结构数组
实体中的 "结构数组 "字段或 "结构数组 "字段存储了一组有序的结构元素。数组中的每个 Struct 都共享相同的预定义 Schema,由多个向量和标量字段组成。
下面是一个包含 StructArray 字段的 Collections 实体示例。
{
'id': 0,
'title': 'Walden',
'title_vector': [0.1, 0.2, 0.3, 0.4, 0.5],
'author': 'Henry David Thoreau',
'year_of_publication': 1845,
'chunks': [
{
'text': 'When I wrote the following pages, or rather the bulk of them...',
'text_vector': [0.3, 0.2, 0.3, 0.2, 0.5],
'chapter': 'Economy',
},
{
'text': 'I would fain say something, not so much concerning the Chinese and...',
'text_vector': [0.7, 0.4, 0.2, 0.7, 0.8],
'chapter': 'Economy'
}
]
// hightlight-end
}
在上面的示例中,chunks 字段是一个 StructArray 字段,每个 Struct 元素都包含自己的字段,即text 、text_vector 和chapter 。
何时使用
从自动驾驶到多模态检索,现代人工智能应用越来越依赖于嵌套的异构数据。传统的平面数据模型很难表示复杂的关系,如"一个文档包含许多注释块"或"一个驾驶场景包含多个观察到的操作"。这正是 Milvus 中 StructArray 数据类型的优势所在。
要快速确定 StructArray 字段是否适合您的应用场景,请考虑以下因素:
您的数据是分层结构的,例如一个文档包含许多注释块。
搜索结果应该是文档,而不是块,如上例。
搜索结果中包含大量重复实体,你很难使用分组、重复数据删除和 Rerankers 等技术检索到最终结果。
如果上述问题的答案都是肯定的,那么就应该使用结构数组。
限制
数据类型
创建 Collections 时,可以使用 Struct 类型作为 Array 字段中元素的数据类型。但是,您不能将 StructArray 添加到现有的 Collections 中,而且 Milvus 不支持使用 Struct 类型作为 Collections 字段的数据类型。
数组字段中的 Struct 共享相同的 Schema,这应该在创建数组字段时定义。
Struct 模式包含向量和标量字段,如下所示:
<div> Applicable vector fields: - `FLOAT_VECTOR` - `FLOAT16_VECTOR` - `BFLOAT16_VECTOR` - `INT8_VECTOR` - `BINARY_VECTOR` </div> <div> Applicable scalar fields: - `VARCHAR` - `INT8/16/32/64` - `FLOAT` - `DOUBLE` - `BOOL` </div>无论是在 Collections 层级还是在 Structs 组合中,向量字段的数量都不应大于或等于 10。
可归零和默认值
StructArray 字段不可为空,也不接受任何默认值。
函数
不能使用函数从 Struct 中的标量字段导出向量字段。
索引类型和度量类型
必须对 Collections 中的所有向量场进行索引。为了索引 StructArray 字段中的向量字段,Milvus 使用嵌入列表来组织每个 Struct 元素中的向量嵌入,并将整个嵌入列表作为一个整体进行索引。
你可以使用
AUTOINDEX或HNSW作为索引类型,并使用下面列出的任何度量类型为 StructArray 字段中的嵌入列表建立索引。索引类型
度量类型
备注
AUTOINDEXHNSWIVF_FLATDISKANN
MAX_SIM_COSINEMAX_SIM_IPMAX_SIM_L2
用于以下类型的嵌入列表:
FLOAT_VECTORFLOAT16_VECTORBFLOAT16_VECTORINT8_VECTORBINARY_VECTOR
有关 Milvus 如何计算查询与嵌入列表之间相似度的详情,请参阅最大相似度。
StructArray 字段中的标量字段支持以下索引类型:
向上插入数据
结构体在合并模式下不支持倒插。但是,您仍然可以在覆盖模式下执行上插入操作,以更新 Structs 中的数据。有关合并模式和覆盖模式下 upsert 的区别,请参阅 "Upsert Entities"。
标量过滤
您可以使用匹配系列中的 元素过滤器和操作符对 StructArray 字段中的标量子字段进行标量过滤。有关详情,请参阅 "结构数组字段中的标量过滤"。
添加结构数组
要在 Milvus 中添加一个 StructArray 字段,需要在创建 Collections 时定义一个数组字段,并将其元素的数据类型设置为 Struct。具体过程如下
将字段作为数组字段添加到 Collections Schema 时,将字段的数据类型设置为
DataType.ARRAY。将字段的
element_type属性设置为DataType.STRUCT,使字段成为 Struct 数组。创建 Struct 模式并包含所需字段。然后,在字段的
struct_schema属性中引用 Struct 模式。将字段的
max_capacity属性设置为适当的值,以指定每个实体在该字段中可包含的最大 Struct 数。(可选)可以为 Struct 元素中的任何字段设置
mmap.enabled,以平衡 Struct 中的冷热数据。
下面是如何定义包含 StructArray 字段的 Collections 模式:
from pymilvus import MilvusClient, DataType
client = MilvusClient(
uri="http://localhost:19530",
token="root:Milvus"
)
schema = client.create_schema()
# add the primary field to the collection
schema.add_field(field_name="id", datatype=DataType.INT64, is_primary=True, auto_id=True)
# add some scalar fields to the collection
schema.add_field(field_name="title", datatype=DataType.VARCHAR, max_length=512)
schema.add_field(field_name="author", datatype=DataType.VARCHAR, max_length=512)
schema.add_field(field_name="year_of_publication", datatype=DataType.INT64)
# add a vector field to the collection
schema.add_field(field_name="title_vector", datatype=DataType.FLOAT_VECTOR, dim=5)
# Create a struct schema
struct_schema = client.create_struct_field_schema()
# add a scalar field to the struct
struct_schema.add_field("text", DataType.VARCHAR, max_length=65535)
struct_schema.add_field("chapter", DataType.VARCHAR, max_length=512)
# add a vector field to the struct with mmap enabled
struct_schema.add_field("text_vector", DataType.FLOAT_VECTOR, mmap_enabled=True, dim=5)
# reference the struct schema in an Array field with its
# element type set to `DataType.STRUCT`
schema.add_field("chunks", datatype=DataType.ARRAY, element_type=DataType.STRUCT,
struct_schema=struct_schema, max_capacity=1000)
import io.milvus.v2.common.DataType;
import io.milvus.v2.service.collection.request.AddFieldReq;
import io.milvus.v2.service.collection.request.CreateCollectionReq;
CreateCollectionReq.CollectionSchema collectionSchema = CreateCollectionReq.CollectionSchema.builder()
.build();
collectionSchema.addField(AddFieldReq.builder()
.fieldName("id")
.dataType(DataType.Int64)
.isPrimaryKey(true)
.autoID(true)
.build());
collectionSchema.addField(AddFieldReq.builder()
.fieldName("title")
.dataType(DataType.VarChar)
.maxLength(512)
.build());
collectionSchema.addField(AddFieldReq.builder()
.fieldName("author")
.dataType(DataType.VarChar)
.maxLength(512)
.build());
collectionSchema.addField(AddFieldReq.builder()
.fieldName("year_of_publication")
.dataType(DataType.Int64)
.build());
collectionSchema.addField(AddFieldReq.builder()
.fieldName("title_vector")
.dataType(DataType.FloatVector)
.dimension(5)
.build());
Map<String, String> params = new HashMap<>();
params.put("mmap_enabled", "true");
collectionSchema.addField(AddFieldReq.builder()
.fieldName("chunks")
.dataType(DataType.Array)
.elementType(DataType.Struct)
.maxCapacity(1000)
.addStructField(AddFieldReq.builder()
.fieldName("text")
.dataType(DataType.VarChar)
.maxLength(65535)
.build())
.addStructField(AddFieldReq.builder()
.fieldName("chapter")
.dataType(DataType.VarChar)
.maxLength(512)
.build())
.addStructField(AddFieldReq.builder()
.fieldName("text_vector")
.dataType(DataType.FloatVector)
.dimension(VECTOR_DIM)
.typeParams(params)
.build())
.build());
// go
import { MilvusClient, DataType } from "@zilliz/milvus2-sdk-node";
const milvusClient = new MilvusClient("http://localhost:19530");
const schema = [
{
name: "id",
data_type: DataType.INT64,
is_primary_key: true,
auto_id: true,
},
{
name: "title",
data_type: DataType.VARCHAR,
max_length: 512,
},
{
name: "author",
data_type: DataType.VARCHAR,
max_length: 512,
},
{
name: "year_of_publication",
data_type: DataType.INT64,
},
{
name: "title_vector",
data_type: DataType.FLOAT_VECTOR,
dim: 5,
},
{
name: "chunks",
data_type: DataType.ARRAY,
element_type: DataType.STRUCT,
fields: [
{
name: "text",
data_type: DataType.VARCHAR,
max_length: 65535,
},
{
name: "chapter",
data_type: DataType.VARCHAR,
max_length: 512,
},
{
name: "text_vector",
data_type: DataType.FLOAT_VECTOR,
dim: 5,
mmap_enabled: true,
},
],
max_capacity: 1000,
},
];
# restful
SCHEMA='{
"autoID": true,
"fields": [
{
"fieldName": "id",
"dataType": "Int64",
"isPrimary": true
},
{
"fieldName": "title",
"dataType": "VarChar",
"elementTypeParams": { "max_length": "512" }
},
{
"fieldName": "author",
"dataType": "VarChar",
"elementTypeParams": { "max_length": "512" }
},
{
"fieldName": "year_of_publication",
"dataType": "Int64"
},
{
"fieldName": "title_vector",
"dataType": "FloatVector",
"elementTypeParams": { "dim": "5" }
}
],
"structArrayFields": [
{
"name": "chunks",
"description": "Array of document chunks with text and vectors",
"elementTypeParams":{
"max_capacity": 1000
},
"fields": [
{
"fieldName": "text",
"dataType": "VarChar",
"elementTypeParams": { "max_length": "65535" }
},
{
"fieldName": "chapter",
"dataType": "VarChar",
"elementTypeParams": { "max_length": "512" }
},
{
"fieldName": "text_vector",
"dataType": "FloatVector",
"elementTypeParams": {
"dim": "5",
"mmap_enabled": "true"
}
}
]
}
]
}'
上面代码示例中高亮显示的几行说明了如何在 Collections 模式中包含 StructArray。
设置索引参数
所有向量字段都必须设置索引,包括 Collections 中的向量字段和元素 Struct 中定义的向量字段。
适用的索引参数因索引类型而异。有关适用索引参数的详细信息,请参阅Index Explained和所选索引类型的文档。
索引嵌入列表
要为嵌入式列表建立索引,需要将其索引类型设置为AUTOINDEX 或上述任何一种适用的索引类型,并使用 Milvus 列出的度量类型来衡量嵌入式列表之间的相似性。
# Create index parameters
index_params = client.prepare_index_params()
# Create an index for the vector field in the collection
index_params.add_index(
field_name="title_vector",
index_type="AUTOINDEX",
metric_type="L2",
)
# Create an index for the vector field in the element Struct
index_params.add_index(
field_name="chunks[text_vector]",
index_type="AUTOINDEX",
metric_type="MAX_SIM_COSINE",
)
import io.milvus.v2.common.IndexParam;
List<IndexParam> indexParams = new ArrayList<>();
indexParams.add(IndexParam.builder()
.fieldName("title_vector")
.indexType(IndexParam.IndexType.AUTOINDEX)
.metricType(IndexParam.MetricType.L2)
.build());
indexParams.add(IndexParam.builder()
.fieldName("chunks[text_vector]")
.indexType(IndexParam.IndexType.AUTOINDEX)
.metricType(IndexParam.MetricType.MAX_SIM_COSINE)
.build());
// go
await milvusClient.createCollection({
collection_name: "books",
fields: schema,
});
const indexParams = [
{
field_name: "title_vector",
index_type: "AUTOINDEX",
metric_type: "L2",
},
{
field_name: "chunks[text_vector]",
index_type: "AUTOINDEX",
metric_type: "MAX_SIM_COSINE",
},
];
# restful
INDEX_PARAMS='[
{
"fieldName": "title_vector",
"indexName": "title_vector_index",
"indexType": "AUTOINDEX",
"metricType": "L2"
},
{
"fieldName": "chunks[text_vector]",
"indexName": "chunks_text_vector_index",
"indexType": "AUTOINDEX",
"metricType": "MAX_SIM_COSINE"
}
]'
为标量结构子字段建立索引
在标量结构子字段上创建索引时,Milvus 实际上是在元素级别而不是行级别建立索引,以加速标量过滤。
下面的代码片段在名为chunks[text] 的标量 struct 子字段上创建了一个索引。
index_params.add_index(
field_name="chunks[text]",
index_type="INVERTED"
)
indexParams.add(IndexParam.builder()
.fieldName("chunks[text]")
.indexType(IndexParam.IndexType.INVERTED)
.build());
// go
indexParams.push({
field_name: "chunks[text]",
index_type: "INVERTED"
})
INDEX_PARAMS += '{
"fieldName": "chunks[text]",
"indexName": "chunks_text_vector_index",
"indexType": "INVERTED"
}'
创建 Collections
Schema 和索引准备就绪后,就可以创建一个包含 StructArray 字段的 Collection。
client.create_collection(
collection_name="my_collection",
schema=schema,
index_params=index_params
)
import io.milvus.v2.service.collection.request.CreateCollectionReq;
CreateCollectionReq requestCreate = CreateCollectionReq.builder()
.collectionName("my_collection")
.collectionSchema(collectionSchema)
.indexParams(indexParams)
.build();
client.createCollection(requestCreate);
// go
await milvusClient.createCollection({
collection_name: "my_collection",
fields: schema,
indexes: indexParams,
});
# restful
curl -X POST "http://localhost:19530/v2/vectordb/collections/create" \
-H "Content-Type: application/json" \
-d "{
\"collectionName\": \"my_collection\",
\"description\": \"A collection for storing book information with struct array chunks\",
\"schema\": $SCHEMA,
\"indexParams\": $INDEX_PARAMS
}"
插入数据
创建 Collections 后,您可以按如下方式插入包含 Structs 数组的数据。
# Sample data
data = {
'title': 'Walden',
'title_vector': [0.1, 0.2, 0.3, 0.4, 0.5],
'author': 'Henry David Thoreau',
'year_of_publication': 1845,
'chunks': [
{
'text': 'When I wrote the following pages, or rather the bulk of them...',
'text_vector': [0.3, 0.2, 0.3, 0.2, 0.5],
'chapter': 'Economy',
},
{
'text': 'I would fain say something, not so much concerning the Chinese and...',
'text_vector': [0.7, 0.4, 0.2, 0.7, 0.8],
'chapter': 'Economy'
}
]
}
# insert data
client.insert(
collection_name="my_collection",
data=[data]
)
import com.google.gson.Gson;
import com.google.gson.JsonArray;
import com.google.gson.JsonObject;
import io.milvus.v2.service.vector.request.InsertReq;
import io.milvus.v2.service.vector.response.InsertResp;
Gson gson = new Gson();
JsonObject row = new JsonObject();
row.addProperty("title", "Walden");
row.add("title_vector", gson.toJsonTree(Arrays.asList(0.1f, 0.2f, 0.3f, 0.4f, 0.5f)));
row.addProperty("author", "Henry David Thoreau");
row.addProperty("year_of_publication", 1845);
JsonArray structArr = new JsonArray();
JsonObject struct1 = new JsonObject();
struct1.addProperty("text", "When I wrote the following pages, or rather the bulk of them...");
struct1.add("text_vector", gson.toJsonTree(Arrays.asList(0.3f, 0.2f, 0.3f, 0.2f, 0.5f)));
struct1.addProperty("chapter", "Economy");
structArr.add(struct1);
JsonObject struct2 = new JsonObject();
struct2.addProperty("text", "I would fain say something, not so much concerning the Chinese and...");
struct2.add("text_vector", gson.toJsonTree(Arrays.asList(0.7f, 0.4f, 0.2f, 0.7f, 0.8f)));
struct2.addProperty("chapter", "Economy");
structArr.add(struct2);
row.add("chunks", structArr);
InsertResp insertResp = client.insert(InsertReq.builder()
.collectionName("my_collection")
.data(Collections.singletonList(row))
.build());
// go
{
id: 0,
title: "Walden",
title_vector: [0.1, 0.2, 0.3, 0.4, 0.5],
author: "Henry David Thoreau",
"year-of-publication": 1845,
chunks: [
{
text: "When I wrote the following pages, or rather the bulk of them...",
text_vector: [0.3, 0.2, 0.3, 0.2, 0.5],
chapter: "Economy",
},
{
text: "I would fain say something, not so much concerning the Chinese and...",
text_vector: [0.7, 0.4, 0.2, 0.7, 0.8],
chapter: "Economy",
},
],
},
];
await milvusClient.insert({
collection_name: "books",
data: data,
});
# restful
curl -X POST "http://localhost:19530/v2/vectordb/entities/insert" \
-H "Content-Type: application/json" \
-d '{
"collectionName": "my_collection",
"data": [
{
"title": "Walden",
"title_vector": [0.1, 0.2, 0.3, 0.4, 0.5],
"author": "Henry David Thoreau",
"year_of_publication": 1845,
"chunks": [
{
"text": "When I wrote the following pages, or rather the bulk of them...",
"text_vector": [0.3, 0.2, 0.3, 0.2, 0.5],
"chapter": "Economy"
},
{
"text": "I would fain say something, not so much concerning the Chinese and...",
"text_vector": [0.7, 0.4, 0.2, 0.7, 0.8],
"chapter": "Economy"
}
]
}
]
}'
import json
import random
from typing import List, Dict, Any
# Real classic books (title, author, year)
BOOKS = [
("Pride and Prejudice", "Jane Austen", 1813),
("Moby Dick", "Herman Melville", 1851),
("Frankenstein", "Mary Shelley", 1818),
("The Picture of Dorian Gray", "Oscar Wilde", 1890),
("Dracula", "Bram Stoker", 1897),
("The Adventures of Sherlock Holmes", "Arthur Conan Doyle", 1892),
("Alice's Adventures in Wonderland", "Lewis Carroll", 1865),
("The Time Machine", "H.G. Wells", 1895),
("The Scarlet Letter", "Nathaniel Hawthorne", 1850),
("Leaves of Grass", "Walt Whitman", 1855),
("The Brothers Karamazov", "Fyodor Dostoevsky", 1880),
("Crime and Punishment", "Fyodor Dostoevsky", 1866),
("Anna Karenina", "Leo Tolstoy", 1877),
("War and Peace", "Leo Tolstoy", 1869),
("Great Expectations", "Charles Dickens", 1861),
("Oliver Twist", "Charles Dickens", 1837),
("Wuthering Heights", "Emily Brontë", 1847),
("Jane Eyre", "Charlotte Brontë", 1847),
("The Call of the Wild", "Jack London", 1903),
("The Jungle Book", "Rudyard Kipling", 1894),
]
# Common chapter names for classics
CHAPTERS = [
"Introduction", "Prologue", "Chapter I", "Chapter II", "Chapter III",
"Chapter IV", "Chapter V", "Chapter VI", "Chapter VII", "Chapter VIII",
"Chapter IX", "Chapter X", "Epilogue", "Conclusion", "Afterword",
"Economy", "Where I Lived", "Reading", "Sounds", "Solitude",
"Visitors", "The Bean-Field", "The Village", "The Ponds", "Baker Farm"
]
# Placeholder text snippets (mimicking 19th-century prose)
TEXT_SNIPPETS = [
"When I wrote the following pages, or rather the bulk of them...",
"I would fain say something, not so much concerning the Chinese and...",
"It is a truth universally acknowledged, that a single man in possession...",
"Call me Ishmael. Some years ago—never mind how long precisely...",
"It was the best of times, it was the worst of times...",
"All happy families are alike; each unhappy family is unhappy in its own way.",
"Whether I shall turn out to be the hero of my own life, or whether that station...",
"You will rejoice to hear that no disaster has accompanied the commencement...",
"The world is too much with us; late and soon, getting and spending...",
"He was an old man who fished alone in a skiff in the Gulf Stream..."
]
def random_vector() -> List[float]:
return [round(random.random(), 1) for _ in range(5)]
def generate_chunk() -> Dict[str, Any]:
return {
"text": random.choice(TEXT_SNIPPETS),
"text_vector": random_vector(),
"chapter": random.choice(CHAPTERS)
}
def generate_record(record_id: int) -> Dict[str, Any]:
title, author, year = random.choice(BOOKS)
num_chunks = random.randint(1, 5) # 1 to 5 chunks per book
chunks = [generate_chunk() for _ in range(num_chunks)]
return {
"title": title,
"title_vector": random_vector(),
"author": author,
"year_of_publication": year,
"chunks": chunks
}
# Generate 1000 records
data = [generate_record(i) for i in range(1000)]
# Insert the generated data
client.insert(collection_name="my_collection", data=data)
在 StructArray 字段中进行向量搜索
您可以在 Collections 的向量字段和 StructArray 中执行向量搜索。
具体来说,你应该将 StructArray 字段的名称和 Struct 元素中目标向量字段的名称串联起来,作为搜索请求中anns_field 参数的值,并使用EmbeddingList 来整齐地组织查询向量。
Milvus 提供的EmbeddingList 可以帮助你更整齐地组织针对 StructArray 中的 Embeddings 列表进行搜索的查询向量。每个EmbeddingList 至少包含一个向量嵌入,并期望返回一定数量的 topK 实体。
不过,EmbeddingList 只能用于没有范围搜索或分组搜索参数的search() 请求,更不用说search_iterator() 请求了。
from pymilvus.client.embedding_list import EmbeddingList
# each query embedding list triggers a single search
embeddingList1 = EmbeddingList()
embeddingList1.add([0.2, 0.9, 0.4, -0.3, 0.2])
embeddingList2 = EmbeddingList()
embeddingList2.add([-0.2, -0.2, 0.5, 0.6, 0.9])
embeddingList2.add([-0.4, 0.3, 0.5, 0.8, 0.2])
# a search with a single embedding list
results = client.search(
collection_name="my_collection",
data=[ embeddingList1 ],
anns_field="chunks[text_vector]",
search_params={"metric_type": "MAX_SIM_COSINE"},
limit=3,
output_fields=["chunks[text]"]
)
import io.milvus.v2.service.vector.request.data.EmbeddingList;
import io.milvus.v2.service.vector.request.data.FloatVec;
EmbeddingList embeddingList1 = new EmbeddingList();
embeddingList1.add(new FloatVec(new float[]{0.2f, 0.9f, 0.4f, -0.3f, 0.2f}));
EmbeddingList embeddingList2 = new EmbeddingList();
embeddingList2.add(new FloatVec(new float[]{-0.2f, -0.2f, 0.5f, 0.6f, 0.9f}));
embeddingList2.add(new FloatVec(new float[]{-0.4f, 0.3f, 0.5f, 0.8f, 0.2f}));
Map<String, Object> params = new HashMap<>();
params.put("metric_type", "MAX_SIM_COSINE");
SearchResp searchResp = client.search(SearchReq.builder()
.collectionName("my_collection")
.annsField("chunks[text_vector]")
.data(Collections.singletonList(embeddingList1))
.searchParams(params)
.limit(3)
.outputFields(Collections.singletonList("chunks[text]"))
.build());
// go
const embeddingList1 = [[0.2, 0.9, 0.4, -0.3, 0.2]];
const embeddingList2 = [
[-0.2, -0.2, 0.5, 0.6, 0.9],
[-0.4, 0.3, 0.5, 0.8, 0.2],
];
const results = await milvusClient.search({
collection_name: "books",
data: embeddingList1,
anns_field: "chunks[text_vector]",
search_params: { metric_type: "MAX_SIM_COSINE" },
limit: 3,
output_fields: ["chunks[text]"],
});
# restful
embeddingList1='[[0.2,0.9,0.4,-0.3,0.2]]'
embeddingList2='[[-0.2,-0.2,0.5,0.6,0.9],[-0.4,0.3,0.5,0.8,0.2]]'
curl -X POST "http://localhost:19530/v2/vectordb/entities/search" \
-H "Content-Type: application/json" \
-d "{
\"collectionName\": \"my_collection\",
\"data\": [$embeddingList1],
\"annsField\": \"chunks[text_vector]\",
\"searchParams\": {\"metric_type\": \"MAX_SIM_COSINE\"},
\"limit\": 3,
\"outputFields\": [\"chunks[text]\"]
}"
上述搜索请求使用chunks[text_vector] 来引用 Struct 元素中的text_vector 字段。您可以使用此语法设置anns_field 和output_fields 参数。
输出将是三个最相似实体的列表。
# [
# [
# {
# 'id': 461417939772144945,
# 'distance': 0.9675756096839905,
# 'entity': {
# 'chunks': [
# {'text': 'The world is too much with us; late and soon, getting and spending...'},
# {'text': 'All happy families are alike; each unhappy family is unhappy in its own way.'}
# ]
# }
# },
# {
# 'id': 461417939772144965,
# 'distance': 0.9555778503417969,
# 'entity': {
# 'chunks': [
# {'text': 'Call me Ishmael. Some years ago—never mind how long precisely...'},
# {'text': 'He was an old man who fished alone in a skiff in the Gulf Stream...'},
# {'text': 'When I wrote the following pages, or rather the bulk of them...'},
# {'text': 'It was the best of times, it was the worst of times...'},
# {'text': 'The world is too much with us; late and soon, getting and spending...'}
# ]
# }
# },
# {
# 'id': 461417939772144962,
# 'distance': 0.9469035863876343,
# 'entity': {
# 'chunks': [
# {'text': 'Call me Ishmael. Some years ago—never mind how long precisely...'},
# {'text': 'The world is too much with us; late and soon, getting and spending...'},
# {'text': 'He was an old man who fished alone in a skiff in the Gulf Stream...'},
# {'text': 'Call me Ishmael. Some years ago—never mind how long precisely...'},
# {'text': 'The world is too much with us; late and soon, getting and spending...'}
# ]
# }
# }
# ]
# ]
您还可以在data 参数中包含多个嵌入列表,以检索每个嵌入列表的搜索结果。
# a search with multiple embedding lists
results = client.search(
collection_name="my_collection",
data=[ embeddingList1, embeddingList2 ],
anns_field="chunks[text_vector]",
search_params={"metric_type": "MAX_SIM_COSINE"},
limit=3,
output_fields=["chunks[text]"]
)
print(results)
Map<String, Object> params = new HashMap<>();
params.put("metric_type", "MAX_SIM_COSINE");
SearchResp searchResp = client.search(SearchReq.builder()
.collectionName("my_collection")
.annsField("chunks[text_vector]")
.data(Arrays.asList(embeddingList1, embeddingList2))
.searchParams(params)
.limit(3)
.outputFields(Collections.singletonList("chunks[text]"))
.build());
List<List<SearchResp.SearchResult>> searchResults = searchResp.getSearchResults();
for (int i = 0; i < searchResults.size(); i++) {
System.out.println("Results of No." + i + " embedding list");
List<SearchResp.SearchResult> results = searchResults.get(i);
for (SearchResp.SearchResult result : results) {
System.out.println(result);
}
}
// go
const results2 = await milvusClient.search({
collection_name: "books",
data: [embeddingList1, embeddingList2],
anns_field: "chunks[text_vector]",
search_params: { metric_type: "MAX_SIM_COSINE" },
limit: 3,
output_fields: ["chunks[text]"],
});
# restful
curl -X POST "http://localhost:19530/v2/vectordb/entities/search" \
-H "Content-Type: application/json" \
-d "{
\"collectionName\": \"my_collection\",
\"data\": [$embeddingList1, $embeddingList2],
\"annsField\": \"chunks[text_vector]\",
\"searchParams\": {\"metric_type\": \"MAX_SIM_COSINE\"},
\"limit\": 3,
\"outputFields\": [\"chunks[text]\"]
}"
输出将是每个嵌入列表中三个最相似实体的列表。
# [
# [
# {
# 'id': 461417939772144945,
# 'distance': 0.9675756096839905,
# 'entity': {
# 'chunks': [
# {'text': 'The world is too much with us; late and soon, getting and spending...'},
# {'text': 'All happy families are alike; each unhappy family is unhappy in its own way.'}
# ]
# }
# },
# {
# 'id': 461417939772144965,
# 'distance': 0.9555778503417969,
# 'entity': {
# 'chunks': [
# {'text': 'Call me Ishmael. Some years ago—never mind how long precisely...'},
# {'text': 'He was an old man who fished alone in a skiff in the Gulf Stream...'},
# {'text': 'When I wrote the following pages, or rather the bulk of them...'},
# {'text': 'It was the best of times, it was the worst of times...'},
# {'text': 'The world is too much with us; late and soon, getting and spending...'}
# ]
# }
# },
# {
# 'id': 461417939772144962,
# 'distance': 0.9469035863876343,
# 'entity': {
# 'chunks': [
# {'text': 'Call me Ishmael. Some years ago—never mind how long precisely...'},
# {'text': 'The world is too much with us; late and soon, getting and spending...'},
# {'text': 'He was an old man who fished alone in a skiff in the Gulf Stream...'},
# {'text': 'Call me Ishmael. Some years ago—never mind how long precisely...'},
# {'text': 'The world is too much with us; late and soon, getting and spending...'}
# ]
# }
# }
# ],
# [
# {
# 'id': 461417939772144663,
# 'distance': 1.9761409759521484,
# 'entity': {
# 'chunks': [
# {'text': 'It was the best of times, it was the worst of times...'},
# {'text': 'It is a truth universally acknowledged, that a single man in possession...'},
# {'text': 'Whether I shall turn out to be the hero of my own life, or whether that station...'},
# {'text': 'He was an old man who fished alone in a skiff in the Gulf Stream...'}
# ]
# }
# },
# {
# 'id': 461417939772144692,
# 'distance': 1.974656581878662,
# 'entity': {
# 'chunks': [
# {'text': 'It is a truth universally acknowledged, that a single man in possession...'},
# {'text': 'Call me Ishmael. Some years ago—never mind how long precisely...'}
# ]
# }
# },
# {
# 'id': 461417939772144662,
# 'distance': 1.9406685829162598,
# 'entity': {
# 'chunks': [
# {'text': 'It is a truth universally acknowledged, that a single man in possession...'}
# ]
# }
# }
# ]
# ]
在上述代码示例中,embeddingList1 是一个向量的嵌入列表,而embeddingList2 包含两个向量。每个矢量都会触发一个单独的搜索请求,并期望得到前 K 个相似实体的列表。
在 StructArray 字段中进行标量过滤
您可以使用匹配系列中的 元素过滤器和操作符,对结构数组(StructArray)中的标量子字段进行标量过滤。有关上述两种操作符类型的详细信息和示例,请参阅结构数组操作符。
元素过滤器
这是一种实体级过滤器,用于检查实体的 StructArray 字段中是否至少有一个元素满足谓词。例如,下面的元素过滤器返回在text 子字段中至少包含一个以 "Red "开头的块的实体。
element_filter(chunks, $[text] LIKE "Red%")
您可以在谓词中使用几乎所有的比较、范围和算术操作符,谓词按每个元素进行评估,逻辑操作符可用于组合同一元素上的多个条件。有关详细信息,请参阅基本操作符。
如果过滤搜索或查询请求中存在多个标量过滤表达式,请将元素过滤表达式放在所有实体级过滤表达式之后,如下图所示。
# correct
id > 0 && element_filter(chunks, $[x] > 1)
# incorrect, resulting errors
element_filter(chunks, $[x] > 1) && id > 0
匹配族操作符
匹配族操作符也可以在 StructArray 字段上操作。你可以确定有多少元素(或多大比例)必须满足元素谓词,而不是简单地检查元素是否存在。
MATCH_ANY(chunks, $[text] LIKE "Red%")这将返回在
text子字段中至少包含一个以 "Red "开头的块的实体;从语义上讲,这等同于element_filter。MATCH_ALL(chunks, $[text] LIKE "Red%")返回所有块中文本子字段均以 "Red "开头的实体。
MATCH_LEAST(chunks, $[text] LIKE "Red%", k)返回至少包含
k块的实体,这些块在text子字段中以 "Red "开头。MATCH_MOST(chunks, $[text] LIKE "Red%", k)返回在
text子字段中最多包含以 "红色 "开头的k块的实体。MATCH_EXACT(chunks, $[text] LIKE "Red%", k)返回在
text子字段中恰好包含以 "红色 "开头的k块的实体。
下一步工作
本地 StructArray 数据类型的开发是 Milvus 处理复杂数据结构能力的一大进步。为了更好地了解其使用案例并最大限度地利用这一新功能,我们建议您阅读《使用结构数组的 Schema 设计》。