Gambaran Umum Perangking Ulang
Dalam ranah pencarian informasi dan AI generatif, reranker adalah alat penting yang mengoptimalkan urutan hasil dari pencarian awal. Perangking ulang berbeda dari model penyematan tradisional dengan mengambil kueri dan dokumen sebagai masukan dan secara langsung mengembalikan skor kemiripan, bukan penyematan. Skor ini menunjukkan relevansi antara kueri masukan dan dokumen.
Perangking ulang sering digunakan setelah pencarian tahap pertama, biasanya dilakukan melalui teknik vektor Approximate Nearest Neighbor (ANN). Meskipun pencarian ANN efisien dalam mengambil sekumpulan hasil yang berpotensi relevan, mereka mungkin tidak selalu memprioritaskan hasil dalam hal kedekatan semantik aktual dengan kueri. Di sini, reranker digunakan untuk mengoptimalkan urutan hasil menggunakan analisis kontekstual yang lebih dalam, yang sering kali memanfaatkan model pembelajaran mesin tingkat lanjut seperti BERT atau model berbasis Transformer lainnya. Dengan melakukan hal ini, reranker dapat secara dramatis meningkatkan akurasi dan relevansi hasil akhir yang disajikan kepada pengguna.
Pustaka model PyMilvus mengintegrasikan fungsi peringkat ulang untuk mengoptimalkan urutan hasil yang dikembalikan dari pencarian awal. Setelah Anda mengambil embedding terdekat dari Milvus, Anda dapat memanfaatkan alat perangkingan ulang ini untuk menyaring hasil pencarian untuk meningkatkan ketepatan hasil pencarian.
Fungsi Peringkat Ulang | API atau Sumber terbuka |
---|---|
BGE | Sumber terbuka |
Penyandi Silang | Sumber terbuka |
Pelayaran | API |
Cohere | API |
Jina AI | API |
Sebelum menggunakan perangking sumber terbuka, pastikan untuk mengunduh dan menginstal semua dependensi dan model yang diperlukan.
Untuk perangking berbasis API, dapatkan kunci API dari penyedia dan setel di variabel lingkungan atau argumen yang sesuai.
Contoh 1: Gunakan fungsi BGE rerank untuk memberi peringkat ulang dokumen menurut kueri
Dalam contoh ini, kami mendemonstrasikan cara memberi peringkat ulang hasil pencarian menggunakan perankah BGE berdasarkan kueri tertentu.
Untuk menggunakan perankingan ulang dengan pustaka model PyMilvus, mulailah dengan menginstal pustaka model PyMilvus bersama dengan subpaket model yang berisi semua utilitas perankingan ulang yang diperlukan:
pip install pymilvus[model]
# or pip install "pymilvus[model]" for zsh.
Untuk menggunakan reranker BGE, pertama-tama impor kelas BGERerankFunction
:
from pymilvus.model.reranker import BGERerankFunction
Kemudian, buatlah sebuah instance BGERerankFunction
untuk perankingan ulang:
bge_rf = BGERerankFunction(
model_name="BAAI/bge-reranker-v2-m3", # Specify the model name. Defaults to `BAAI/bge-reranker-v2-m3`.
device="cpu" # Specify the device to use, e.g., 'cpu' or 'cuda:0'
)
Untuk memberi peringkat ulang dokumen berdasarkan kueri, gunakan kode berikut:
query = "What event in 1956 marked the official birth of artificial intelligence as a discipline?"
documents = [
"In 1950, Alan Turing published his seminal paper, 'Computing Machinery and Intelligence,' proposing the Turing Test as a criterion of intelligence, a foundational concept in the philosophy and development of artificial intelligence.",
"The Dartmouth Conference in 1956 is considered the birthplace of artificial intelligence as a field; here, John McCarthy and others coined the term 'artificial intelligence' and laid out its basic goals.",
"In 1951, British mathematician and computer scientist Alan Turing also developed the first program designed to play chess, demonstrating an early example of AI in game strategy.",
"The invention of the Logic Theorist by Allen Newell, Herbert A. Simon, and Cliff Shaw in 1955 marked the creation of the first true AI program, which was capable of solving logic problems, akin to proving mathematical theorems."
]
bge_rf(query, documents)
Keluaran yang diharapkan mirip dengan yang berikut ini:
[RerankResult(text="The Dartmouth Conference in 1956 is considered the birthplace of artificial intelligence as a field; here, John McCarthy and others coined the term 'artificial intelligence' and laid out its basic goals.", score=0.9911615761470803, index=1),
RerankResult(text="In 1950, Alan Turing published his seminal paper, 'Computing Machinery and Intelligence,' proposing the Turing Test as a criterion of intelligence, a foundational concept in the philosophy and development of artificial intelligence.", score=0.0326971950177779, index=0),
RerankResult(text='The invention of the Logic Theorist by Allen Newell, Herbert A. Simon, and Cliff Shaw in 1955 marked the creation of the first true AI program, which was capable of solving logic problems, akin to proving mathematical theorems.', score=0.006514905766152258, index=3),
RerankResult(text='In 1951, British mathematician and computer scientist Alan Turing also developed the first program designed to play chess, demonstrating an early example of AI in game strategy.', score=0.0042116724917325935, index=2)]
Contoh 2: Gunakan perangking ulang untuk meningkatkan relevansi hasil pencarian
Dalam panduan ini, kita akan mengeksplorasi bagaimana cara menggunakan metode search()
di PyMilvus untuk melakukan pencarian kemiripan, dan bagaimana cara meningkatkan relevansi hasil pencarian menggunakan reranker. Demonstrasi kami akan menggunakan dataset berikut:
entities = [
{'doc_id': 0, 'doc_vector': [-0.0372721,0.0101959,...,-0.114994], 'doc_text': "In 1950, Alan Turing published his seminal paper, 'Computing Machinery and Intelligence,' proposing the Turing Test as a criterion of intelligence, a foundational concept in the philosophy and development of artificial intelligence."},
{'doc_id': 1, 'doc_vector': [-0.00308882,-0.0219905,...,-0.00795811], 'doc_text': "The Dartmouth Conference in 1956 is considered the birthplace of artificial intelligence as a field; here, John McCarthy and others coined the term 'artificial intelligence' and laid out its basic goals."},
{'doc_id': 2, 'doc_vector': [0.00945078,0.00397605,...,-0.0286199], 'doc_text': 'In 1951, British mathematician and computer scientist Alan Turing also developed the first program designed to play chess, demonstrating an early example of AI in game strategy.'},
{'doc_id': 3, 'doc_vector': [-0.0391119,-0.00880096,...,-0.0109257], 'doc_text': 'The invention of the Logic Theorist by Allen Newell, Herbert A. Simon, and Cliff Shaw in 1955 marked the creation of the first true AI program, which was capable of solving logic problems, akin to proving mathematical theorems.'}
]
Komponen dataset:
doc_id
: Pengenal unik untuk setiap dokumen.doc_vector
: Sematan vektor yang mewakili dokumen. Untuk panduan tentang cara membuat penyematan, lihat Penyematan.doc_text
: Konten teks dari dokumen.
Persiapan
Sebelum memulai pencarian kemiripan, Anda perlu membuat koneksi dengan Milvus, membuat koleksi, dan menyiapkan serta memasukkan data ke dalam koleksi tersebut. Cuplikan kode berikut ini mengilustrasikan langkah-langkah awal ini.
from pymilvus import MilvusClient, DataType
client = MilvusClient(
uri="http://10.102.6.214:19530" # replace with your own Milvus server address
)
client.drop_collection('test_collection')
# define schema
schema = client.create_schema(auto_id=False, enabel_dynamic_field=True)
schema.add_field(field_name="doc_id", datatype=DataType.INT64, is_primary=True, description="document id")
schema.add_field(field_name="doc_vector", datatype=DataType.FLOAT_VECTOR, dim=384, description="document vector")
schema.add_field(field_name="doc_text", datatype=DataType.VARCHAR, max_length=65535, description="document text")
# define index params
index_params = client.prepare_index_params()
index_params.add_index(field_name="doc_vector", index_type="IVF_FLAT", metric_type="IP", params={"nlist": 128})
# create collection
client.create_collection(collection_name="test_collection", schema=schema, index_params=index_params)
# insert data into collection
client.insert(collection_name="test_collection", data=entities)
# Output:
# {'insert_count': 4, 'ids': [0, 1, 2, 3]}
Melakukan pencarian kemiripan
Setelah penyisipan data, lakukan pencarian kemiripan dengan menggunakan metode search
.
# search results based on our query
res = client.search(
collection_name="test_collection",
data=[[-0.045217834, 0.035171617, ..., -0.025117004]], # replace with your query vector
limit=3,
output_fields=["doc_id", "doc_text"]
)
for i in res[0]:
print(f'distance: {i["distance"]}')
print(f'doc_text: {i["entity"]["doc_text"]}')
Hasil yang diharapkan mirip dengan yang berikut ini:
distance: 0.7235960960388184
doc_text: The Dartmouth Conference in 1956 is considered the birthplace of artificial intelligence as a field; here, John McCarthy and others coined the term 'artificial intelligence' and laid out its basic goals.
distance: 0.6269873976707458
doc_text: In 1950, Alan Turing published his seminal paper, 'Computing Machinery and Intelligence,' proposing the Turing Test as a criterion of intelligence, a foundational concept in the philosophy and development of artificial intelligence.
distance: 0.5340118408203125
doc_text: The invention of the Logic Theorist by Allen Newell, Herbert A. Simon, and Cliff Shaw in 1955 marked the creation of the first true AI program, which was capable of solving logic problems, akin to proving mathematical theorems.
Gunakan perangking ulang untuk meningkatkan hasil pencarian
Kemudian, tingkatkan relevansi hasil pencarian Anda dengan langkah perankingan ulang. Dalam contoh ini, kami menggunakan CrossEncoderRerankFunction
yang dibangun di PyMilvus untuk memberi peringkat ulang hasil untuk meningkatkan akurasi.
# use reranker to rerank search results
from pymilvus.model.reranker import CrossEncoderRerankFunction
ce_rf = CrossEncoderRerankFunction(
model_name="cross-encoder/ms-marco-MiniLM-L-6-v2", # Specify the model name.
device="cpu" # Specify the device to use, e.g., 'cpu' or 'cuda:0'
)
reranked_results = ce_rf(
query='What event in 1956 marked the official birth of artificial intelligence as a discipline?',
documents=[
"In 1950, Alan Turing published his seminal paper, 'Computing Machinery and Intelligence,' proposing the Turing Test as a criterion of intelligence, a foundational concept in the philosophy and development of artificial intelligence.",
"The Dartmouth Conference in 1956 is considered the birthplace of artificial intelligence as a field; here, John McCarthy and others coined the term 'artificial intelligence' and laid out its basic goals.",
"In 1951, British mathematician and computer scientist Alan Turing also developed the first program designed to play chess, demonstrating an early example of AI in game strategy.",
"The invention of the Logic Theorist by Allen Newell, Herbert A. Simon, and Cliff Shaw in 1955 marked the creation of the first true AI program, which was capable of solving logic problems, akin to proving mathematical theorems."
],
top_k=3
)
# print the reranked results
for result in reranked_results:
print(f'score: {result.score}')
print(f'doc_text: {result.text}')
Hasil yang diharapkan mirip dengan yang berikut ini:
score: 6.250532627105713
doc_text: The Dartmouth Conference in 1956 is considered the birthplace of artificial intelligence as a field; here, John McCarthy and others coined the term 'artificial intelligence' and laid out its basic goals.
score: -2.9546022415161133
doc_text: In 1950, Alan Turing published his seminal paper, 'Computing Machinery and Intelligence,' proposing the Turing Test as a criterion of intelligence, a foundational concept in the philosophy and development of artificial intelligence.
score: -4.771512031555176
doc_text: The invention of the Logic Theorist by Allen Newell, Herbert A. Simon, and Cliff Shaw in 1955 marked the creation of the first true AI program, which was capable of solving logic problems, akin to proving mathematical theorems.