🚀 免費嘗試 Zilliz Cloud,完全托管的 Milvus,體驗速度提升 10 倍!立即嘗試

milvus-logo
LFAI
主頁
  • 整合

整合 Milvus 與 Jina AI

Open In Colab GitHub Repository

本指南展示了如何使用 Jina AI 嵌入和 Milvus 來進行相似性搜索和檢索任務。

誰是 Jina AI

Jina AI 於 2020 年在柏林成立,是一家先進的人工智能公司,專注於透過其搜尋基礎徹底改變人工智能的未來。Jina AI 專精於多模態人工智慧,旨在透過其整合式元件套件,包括嵌入式、reerankers、prompt ops 和核心基礎架構,讓企業和開發人員能夠利用多模態資料的力量來創造價值和節省成本。 Jina AI 的尖端嵌入式擁有頂級效能,其 8192 符記長度模型是全面資料表達的理想選擇。這些嵌入式提供多語言支援,並與 OpenAI 等領先平台無縫整合,有助於跨語言應用程式的發展。

Milvus 與 Jina AI 的嵌入式系統

為了有效地儲存和搜尋這些嵌入式資料,以達到速度和規模,需要專為此目的而設計的特定基礎架構。Milvus 是廣為人知的先進開源向量資料庫,能夠處理大規模的向量資料。Milvus 能夠根據大量指標進行快速準確的向量(嵌入)搜尋。它的可擴展性允許無縫處理大量的影像資料,即使資料集成長也能確保高效能的搜尋作業。

範例

Jina 嵌入已經整合到 PyMilvus 模型庫中。現在,我們將以程式碼範例來展示如何實際使用 Jina embeddings。

在開始之前,我們需要為 PyMilvus 安裝模型庫。

$ pip install -U pymilvus
$ pip install "pymilvus[model]"

如果您使用的是 Google Colab,為了啟用剛安裝的相依性,您可能需要重新啟動運行時間。(按一下螢幕上方的「Runtime」功能表,然後從下拉式功能表中選擇「Restart session」)。

通用嵌入

Jina AI 的核心嵌入模型擅於理解詳細的文字,因此非常適合語意搜尋、內容分類,進而支援進階情感分析、文字摘要和個人化推薦系統。

from pymilvus.model.dense import JinaEmbeddingFunction

jina_api_key = "<YOUR_JINA_API_KEY>"
ef = JinaEmbeddingFunction(
    "jina-embeddings-v3", 
    jina_api_key,
    task="retrieval.passage",
    dimensions=1024
)

query = "what is information retrieval?"
doc = "Information retrieval is the process of finding relevant information from a large collection of data or documents."

qvecs = ef.encode_queries([query])  # This method uses `retrieval.query` as the task
dvecs = ef.encode_documents([doc])  # This method uses `retrieval.passage` as the task

雙語嵌入

Jina AI 的雙語模型增強了多語言平台、全球支援和跨語言內容發現。它們專為德英和中英翻譯而設計,可促進多種語言群體之間的理解,簡化跨語言互動。

from pymilvus.model.dense import JinaEmbeddingFunction

jina_api_key = "<YOUR_JINA_API_KEY>"
ef = JinaEmbeddingFunction("jina-embeddings-v2-base-de", jina_api_key)

query = "what is information retrieval?"
doc = "Information Retrieval ist der Prozess, relevante Informationen aus einer großen Sammlung von Daten oder Dokumenten zu finden."

qvecs = ef.encode_queries([query])
dvecs = ef.encode_documents([doc])

程式碼嵌入

Jina AI 的程式碼嵌入模型可透過程式碼和文件提供搜尋能力。它支援英文和 30 種常用程式語言,可用於增強程式碼導航、簡化程式碼檢閱和自動化文件輔助。

from pymilvus.model.dense import JinaEmbeddingFunction

jina_api_key = "<YOUR_JINA_API_KEY>"
ef = JinaEmbeddingFunction("jina-embeddings-v2-base-code", jina_api_key)

# Case1: Enhanced Code Navigation
# query: text description of the functionality
# document: relevant code snippet

query = "function to calculate average in Python."
doc = """
def calculate_average(numbers):
    total = sum(numbers)
    count = len(numbers)
    return total / count
"""

# Case2: Streamlined Code Review
# query: text description of the programming concept
# document: relevante code snippet or PR

query = "pull quest related to Collection"
doc = "fix:[restful v2] parameters of create collection ..."

# Case3: Automatic Documentation Assistance
# query: code snippet you need explanation
# document: relevante document or DocsString

query = "What is Collection in Milvus"
doc = """
In Milvus, you store your vector embeddings in collections. All vector embeddings within a collection share the same dimensionality and distance metric for measuring similarity.
Milvus collections support dynamic fields (i.e., fields not pre-defined in the schema) and automatic incrementation of primary keys.
"""

qvecs = ef.encode_queries([query])
dvecs = ef.encode_documents([doc])

使用 Jina 與 Milvus 進行語意搜尋

透過強大的向量嵌入功能,我們可以結合利用 Jina AI 模型與 Milvus Lite 向量資料庫所擷取的嵌入資料來執行語意搜尋。

from pymilvus.model.dense import JinaEmbeddingFunction
from pymilvus import MilvusClient

jina_api_key = "<YOUR_JINA_API_KEY>"
DIMENSION = 1024  # `jina-embeddings-v3` supports flexible embedding sizes (32, 64, 128, 256, 512, 768, 1024), allowing for truncating embeddings to fit your application. 
ef = JinaEmbeddingFunction(
    "jina-embeddings-v3", 
    jina_api_key,
    task="retrieval.passage",
    dimensions=DIMENSION,
)


doc = [
    "In 1950, Alan Turing published his seminal paper, 'Computing Machinery and Intelligence,' proposing the Turing Test as a criterion of intelligence, a foundational concept in the philosophy and development of artificial intelligence.",
    "The Dartmouth Conference in 1956 is considered the birthplace of artificial intelligence as a field; here, John McCarthy and others coined the term 'artificial intelligence' and laid out its basic goals.",
    "In 1951, British mathematician and computer scientist Alan Turing also developed the first program designed to play chess, demonstrating an early example of AI in game strategy.",
    "The invention of the Logic Theorist by Allen Newell, Herbert A. Simon, and Cliff Shaw in 1955 marked the creation of the first true AI program, which was capable of solving logic problems, akin to proving mathematical theorems.",
]

dvecs = ef.encode_documents(doc) # This method uses `retrieval.passage` as the task

data = [
    {"id": i, "vector": dvecs[i], "text": doc[i], "subject": "history"}
    for i in range(len(dvecs))
]

milvus_client = MilvusClient("./milvus_jina_demo.db")
COLLECTION_NAME = "demo_collection"  # Milvus collection name
if milvus_client.has_collection(collection_name=COLLECTION_NAME):
    milvus_client.drop_collection(collection_name=COLLECTION_NAME)
milvus_client.create_collection(collection_name=COLLECTION_NAME, dimension=DIMENSION)

res = milvus_client.insert(collection_name=COLLECTION_NAME, data=data)

print(res["insert_count"])

至於MilvusClient 的參數:

  • uri 設定為本機檔案,例如./milvus.db ,是最方便的方法,因為它會自動利用Milvus Lite將所有資料儲存在這個檔案中。
  • 如果您有大規模的資料,您可以在docker 或 kubernetes 上架設效能更高的 Milvus 伺服器。在此設定中,請使用伺服器的 uri,例如http://localhost:19530 ,作為您的uri
  • 如果您想使用Zilliz Cloud,Milvus 的完全管理雲端服務,請調整uritoken ,對應 Zilliz Cloud 的Public Endpoint 和 Api key

有了 Milvus 向量資料庫中的所有資料,我們現在可以透過產生查詢的向量嵌入來執行語意搜尋,並進行向量搜尋。

queries = "What event in 1956 marked the official birth of artificial intelligence as a discipline?"
qvecs = ef.encode_queries([queries]) # This method uses `retrieval.query` as the task

res = milvus_client.search(
    collection_name=COLLECTION_NAME,  # target collection
    data=[qvecs[0]],  # query vectors
    limit=3,  # number of returned entities
    output_fields=["text", "subject"],  # specifies fields to be returned
)[0]

for result in res:
    print(result)
{'id': 1, 'distance': 0.8802614808082581, 'entity': {'text': "The Dartmouth Conference in 1956 is considered the birthplace of artificial intelligence as a field; here, John McCarthy and others coined the term 'artificial intelligence' and laid out its basic goals.", 'subject': 'history'}}

Jina Reranker

在使用 embeddings 搜尋之後,Jina Ai 也提供了 reranker 來進一步提升檢索品質。

from pymilvus.model.reranker import JinaRerankFunction

jina_api_key = "<YOUR_JINA_API_KEY>"

rf = JinaRerankFunction("jina-reranker-v1-base-en", jina_api_key)

query = "What event in 1956 marked the official birth of artificial intelligence as a discipline?"

documents = [
    "In 1950, Alan Turing published his seminal paper, 'Computing Machinery and Intelligence,' proposing the Turing Test as a criterion of intelligence, a foundational concept in the philosophy and development of artificial intelligence.",
    "The Dartmouth Conference in 1956 is considered the birthplace of artificial intelligence as a field; here, John McCarthy and others coined the term 'artificial intelligence' and laid out its basic goals.",
    "In 1951, British mathematician and computer scientist Alan Turing also developed the first program designed to play chess, demonstrating an early example of AI in game strategy.",
    "The invention of the Logic Theorist by Allen Newell, Herbert A. Simon, and Cliff Shaw in 1955 marked the creation of the first true AI program, which was capable of solving logic problems, akin to proving mathematical theorems.",
]

rf(query, documents)
[RerankResult(text="The Dartmouth Conference in 1956 is considered the birthplace of artificial intelligence as a field; here, John McCarthy and others coined the term 'artificial intelligence' and laid out its basic goals.", score=0.9370958209037781, index=1),
 RerankResult(text='The invention of the Logic Theorist by Allen Newell, Herbert A. Simon, and Cliff Shaw in 1955 marked the creation of the first true AI program, which was capable of solving logic problems, akin to proving mathematical theorems.', score=0.35420963168144226, index=3),
 RerankResult(text="In 1950, Alan Turing published his seminal paper, 'Computing Machinery and Intelligence,' proposing the Turing Test as a criterion of intelligence, a foundational concept in the philosophy and development of artificial intelligence.", score=0.3498658835887909, index=0),
 RerankResult(text='In 1951, British mathematician and computer scientist Alan Turing also developed the first program designed to play chess, demonstrating an early example of AI in game strategy.', score=0.2728956639766693, index=2)]

免費嘗試托管的 Milvus

Zilliz Cloud 無縫接入,由 Milvus 提供動力,速度提升 10 倍。

開始使用
反饋

這個頁面有幫助嗎?