Pengambilan Sampel AcakCompatible with Milvus 2.6.x
Ketika bekerja dengan set data berskala besar, Anda sering kali tidak perlu memproses semua data untuk mendapatkan wawasan atau menguji logika pemfilteran. Pengambilan sampel acak memberikan solusi dengan memungkinkan Anda bekerja dengan subset data yang representatif secara statistik, sehingga secara signifikan mengurangi waktu kueri dan konsumsi sumber daya.
Pengambilan sampel acak beroperasi pada tingkat segmen, memastikan kinerja yang efisien sambil mempertahankan keacakan sampel di seluruh distribusi data koleksi Anda.
Kasus penggunaan utama:
Eksplorasi data: Melihat pratinjau struktur dan konten koleksi dengan cepat dengan penggunaan sumber daya yang minimal
Pengujian pengembangan: Menguji logika pemfilteran yang kompleks pada sampel data yang dapat dikelola sebelum penerapan penuh
Optimalisasi sumber daya: Mengurangi biaya komputasi untuk kueri eksplorasi dan analisis statistik
Sintaks
filter = "RANDOM_SAMPLE(sampling_factor)"
String filter = "RANDOM_SAMPLE(sampling_factor)"
filter := "RANDOM_SAMPLE(sampling_factor)"
// node
# restful
Parameter
sampling_factor: Faktor pengambilan sampel dalam rentang (0, 1), tidak termasuk batas. Sebagai contoh,RANDOM_SAMPLE(0.001)memilih sekitar 0,1% dari hasil.
Aturan penting:
Ekspresi ini tidak peka huruf besar/kecil (
RANDOM_SAMPLEataurandom_sample)Faktor pengambilan sampel harus dalam kisaran (0, 1), tidak termasuk batas
Gabungkan dengan filter lain
Operator pengambilan sampel acak harus dikombinasikan dengan ekspresi pemfilteran lain menggunakan logika AND. Ketika menggabungkan filter, Milvus pertama-tama menerapkan kondisi lain dan kemudian melakukan pengambilan sampel acak pada kumpulan hasil.
# Correct: Filter first, then sample
filter = 'color == "red" AND RANDOM_SAMPLE(0.001)'
# Processing: Find all red items → Sample 0.1% of those red items
# Incorrect: OR doesn't make logical sense
filter = 'color == "red" OR RANDOM_SAMPLE(0.001)' # ❌ Invalid logic
# This would mean: "Either red items OR sample everything" - which is meaningless
// Correct: Filter first, then sample
String filter = 'color == "red" AND RANDOM_SAMPLE(0.001)';
// Processing: Find all red items → Sample 0.1% of those red items
// Incorrect: OR doesn't make logical sense
String filter = 'color == "red" OR RANDOM_SAMPLE(0.001)'; // ❌ Invalid logic
// This would mean: "Either red items OR sample everything" - which is meaningless
// Correct: Filter first, then sample
filter := 'color == "red" AND RANDOM_SAMPLE(0.001)'
// Processing: Find all red items → Sample 0.1% of those red items
filter := 'color == "red" OR RANDOM_SAMPLE(0.001)' // ❌ Invalid logic
// This would mean: "Either red items OR sample everything" - which is meaningless
// node
# restful
Contoh
Contoh 1: Eksplorasi data
Melihat pratinjau struktur koleksi Anda dengan cepat:
from pymilvus import MilvusClient
client = MilvusClient(uri="http://localhost:19530")
# Sample approximately 1% of the entire collection
result = client.query(
collection_name="product_catalog",
filter="RANDOM_SAMPLE(0.01)",
output_fields=["id", "product_name"],
limit=10
)
print(f"Sampled {len(result)} products from collection")
import io.milvus.v2.client.*;
import io.milvus.v2.service.vector.request.QueryReq
import io.milvus.v2.service.vector.request.QueryResp
ConnectConfig config = ConnectConfig.builder()
.uri("http://localhost:19530")
.build();
MilvusClientV2 client = new MilvusClientV2(config);
QueryReq queryReq = QueryReq.builder()
.collectionName("product_catalog")
.filter("RANDOM_SAMPLE(0.01)")
.outputFields(Arrays.asList("id", "product_name"))
.limit(10)
.build();
QueryResp queryResp = client.query(queryReq);
List<QueryResp.QueryResult> results = queryResp.getQueryResults();
for (QueryResp.QueryResult result : results) {
System.out.println(result.getEntity());
}
import (
"context"
"fmt"
"github.com/milvus-io/milvus/client/v2/entity"
"github.com/milvus-io/milvus/client/v2/milvusclient"
)
ctx, cancel := context.WithCancel(context.Background())
defer cancel()
milvusAddr := "localhost:19530"
client, err := milvusclient.New(ctx, &milvusclient.ClientConfig{
Address: milvusAddr,
})
if err != nil {
fmt.Println(err.Error())
// handle error
}
defer client.Close(ctx)
resultSet, err := client.Query(ctx, milvusclient.NewQueryOption("product_catalog").
WithFilter("RANDOM_SAMPLE(0.01)").
WithOutputFields("id", "product_name"))
if err != nil {
fmt.Println(err.Error())
// handle error
}
fmt.Println("id: ", resultSet.GetColumn("id").FieldData().GetScalars())
fmt.Println("product_name: ", resultSet.GetColumn("product_name").FieldData().GetScalars())
// node
# restful
Contoh 2: Pemfilteran gabungan dengan pengambilan sampel acak
Menguji logika pemfilteran pada subset yang dapat dikelola:
# First filter by category and price, then sample 0.5% of results
filter_expression = 'category == "electronics" AND price > 100 AND RANDOM_SAMPLE(0.005)'
result = client.query(
collection_name="product_catalog",
filter=filter_expression,
output_fields=["product_name", "price", "rating"],
limit=10
)
print(f"Found {len(result)} electronics products in sample")
String filter = "category == \"electronics\" AND price > 100 AND RANDOM_SAMPLE(0.005)";
QueryReq queryReq = QueryReq.builder()
.collectionName("product_catalog")
.filter(filter)
.outputFields(Arrays.asList("product_name", "price", "rating"))
.limit(10)
.build();
QueryResp queryResp = client.query(queryReq);
filter := "category == \"electronics\" AND price > 100 AND RANDOM_SAMPLE(0.005)"
resultSet, err := client.Query(ctx, milvusclient.NewQueryOption("product_catalog").
WithFilter(filter).
WithOutputFields("product_name", "price", "rating"))
if err != nil {
fmt.Println(err.Error())
// handle error
}
// node
# restful
Contoh 3: Analisis cepat
Melakukan analisis statistik cepat pada data yang difilter:
# Get insights from ~0.1% of premium customer data
filter_expression = 'customer_tier == "premium" AND region == 'North America' AND RANDOM_SAMPLE(0.001)'
result = client.query(
collection_name="customer_profiles",
filter=filter_expression,
output_fields=["purchase_amount", "satisfaction_score", "last_purchase_date"],
limit=10
)
# Analyze sample for quick insights
if result:
average_purchase = sum(r["purchase_amount"] for r in result) / len(result)
average_satisfaction = sum(r["satisfaction_score"] for r in result) / len(result)
print(f"Sample size: {len(result)}")
print(f"Average purchase amount: ${average_purchase:.2f}")
print(f"Average satisfaction score: {average_satisfaction:.2f}")
String filter = "customer_tier == \"premium\" AND region == \"North America\" AND RANDOM_SAMPLE(0.001)";
QueryReq queryReq = QueryReq.builder()
.collectionName("customer_profiles")
.filter(filter)
.outputFields(Arrays.asList("purchase_amount", "satisfaction_score", "last_purchase_date"))
.limit(10)
.build();
QueryResp queryResp = client.query(queryReq);
filter := "customer_tier == \"premium\" AND region == \"North America\" AND RANDOM_SAMPLE(0.001)"
resultSet, err := client.Query(ctx, milvusclient.NewQueryOption("customer_profiles").
WithFilter(filter).
WithOutputFields("purchase_amount", "satisfaction_score", "last_purchase_date"))
if err != nil {
fmt.Println(err.Error())
// handle error
}
// node
# restful
Contoh 4: Dikombinasikan dengan pencarian vektor
Gunakan pengambilan sampel acak dalam skenario pencarian yang difilter:
# Search for similar products within a sampled subset
search_results = client.search(
collection_name="product_catalog",
data=[[0.1, 0.2, 0.3, 0.4, 0.5]], # query vector
filter='category == "books" AND RANDOM_SAMPLE(0.01)',
search_params={"metric_type": "L2", "params": {}},
output_fields=["title", "author", "price"],
limit=10
)
print(f"Found {len(search_results[0])} similar books in sample")
import io.milvus.v2.service.vector.request.SearchReq
import io.milvus.v2.service.vector.request.data.FloatVec;
import io.milvus.v2.service.vector.response.SearchResp
FloatVec queryVector = new FloatVec(new float[]{0.1f, 0.2f, 0.3f, 0.4f, 0.5f});
SearchReq searchReq = SearchReq.builder()
.collectionName("product_catalog")
.data(Collections.singletonList(queryVector))
.topK(10)
.filter("category == \"books\" AND RANDOM_SAMPLE(0.01)")
.outputFields(Arrays.asList("title", "author", "price"))
.build();
SearchResp searchResp = client.search(searchReq);
List<List<SearchResp.SearchResult>> searchResults = searchResp.getSearchResults();
for (List<SearchResp.SearchResult> results : searchResults) {
System.out.println("TopK results:");
for (SearchResp.SearchResult result : results) {
System.out.println(result);
}
}
queryVector := []float32{0.1, 0.2, 0.3, 0.4, 0.5}
resultSets, err := client.Search(ctx, milvusclient.NewSearchOption(
"product_catalog", // collectionName
10, // limit
[]entity.Vector{entity.FloatVector(queryVector)},
).WithConsistencyLevel(entity.ClStrong).
WithFilter("category == \"books\" AND RANDOM_SAMPLE(0.01)").
WithOutputFields("title", "author", "price"))
if err != nil {
fmt.Println(err.Error())
// handle error
}
for _, resultSet := range resultSets {
fmt.Println("title: ", resultSet.GetColumn("title").FieldData().GetScalars())
fmt.Println("author: ", resultSet.GetColumn("author").FieldData().GetScalars())
fmt.Println("price: ", resultSet.GetColumn("price").FieldData().GetScalars())
}
// node
# restful
Praktik terbaik
Mulai dari yang kecil: Mulailah dengan faktor pengambilan sampel yang lebih kecil (0,001-0,01) untuk eksplorasi awal
Alur kerja pengembangan: Gunakan pengambilan sampel selama pengembangan, hapus untuk kueri produksi
Validitas statistik: Sampel yang lebih besar memberikan representasi statistik yang lebih akurat
Pengujian kinerja: Memantau kinerja kueri dan menyesuaikan faktor pengambilan sampel sesuai kebutuhan