بناء RAG مع Milvus و Crawl4AI
يوفّرCrawl4AI زحفًا فائق السرعة وجاهزًا للذكاء الاصطناعي على الويب لـ LLMs. وهو مفتوح المصدر ومحسّن لـ RAG، ويسهّل عملية الكشط مع الاستخراج المتقدم والأداء في الوقت الحقيقي.
في هذا البرنامج التعليمي، سنوضح لك في هذا البرنامج التعليمي كيفية إنشاء خط أنابيب استرجاع-مُعزَّز (RAG) باستخدام Milvus و Crawl4AI. يدمج خط الأنابيب بين Crawl4AI لتصفح بيانات الويب، وMilvus لتخزين المتجهات، وOpenAI لتوليد استجابات ثاقبة مدركة للسياق.
الإعداد
التبعيات والبيئة
للبدء، قم بتثبيت التبعيات المطلوبة عن طريق تشغيل الأمر التالي:
$ pip install -U crawl4ai pymilvus openai requests tqdm
إذا كنت تستخدم Google Colab، لتمكين التبعيات المثبتة للتو، قد تحتاج إلى إعادة تشغيل وقت التشغيل (انقر على قائمة "وقت التشغيل" في أعلى الشاشة، وحدد "إعادة تشغيل الجلسة" من القائمة المنسدلة).
لإعداد crawl4ai بالكامل، قم بتشغيل الأوامر التالية:
# Run post-installation setup
$ crawl4ai-setup
# Verify installation
$ crawl4ai-doctor
[36m[INIT].... → Running post-installation setup...[0m
[36m[INIT].... → Installing Playwright browsers...[0m
[32m[COMPLETE] ● Playwright installation completed successfully.[0m
[36m[INIT].... → Starting database initialization...[0m
[32m[COMPLETE] ● Database initialization completed successfully.[0m
[32m[COMPLETE] ● Post-installation setup completed![0m
[0m[36m[INIT].... → Running Crawl4AI health check...[0m
[36m[INIT].... → Crawl4AI 0.4.247[0m
[36m[TEST].... ℹ Testing crawling capabilities...[0m
[36m[EXPORT].. ℹ Exporting PDF and taking screenshot took 0.80s[0m
[32m[FETCH]... ↓ https://crawl4ai.com... | Status: [32mTrue[0m | Time: 4.22s[0m
[36m[SCRAPE].. ◆ Processed https://crawl4ai.com... | Time: 14ms[0m
[32m[COMPLETE] ● https://crawl4ai.com... | Status: [32mTrue[0m | Total: [33m4.23s[0m[0m
[32m[COMPLETE] ● ✅ Crawling test passed![0m
[0m
إعداد مفتاح OpenAI API
سنستخدم OpenAI كمفتاح OpenAI كمفتاح LLM في هذا المثال. يجب عليك إعداد OPENAI_API_KEY كمتغير بيئة.
import os
os.environ["OPENAI_API_KEY"] = "sk-***********"
قم بإعداد LLM ونموذج التضمين
نقوم بتهيئة عميل OpenAI لإعداد نموذج التضمين.
from openai import OpenAI
openai_client = OpenAI()
حدد دالة لإنشاء تضمينات نصية باستخدام عميل OpenAI. نستخدم نموذج التضمين النصي 3-نموذج التضمين الصغير كمثال.
def emb_text(text):
return (
openai_client.embeddings.create(input=text, model="text-embedding-3-small")
.data[0]
.embedding
)
إنشاء تضمين اختباري وطباعة بُعده والعناصر القليلة الأولى.
test_embedding = emb_text("This is a test")
embedding_dim = len(test_embedding)
print(embedding_dim)
print(test_embedding[:10])
1536
[0.009889289736747742, -0.005578675772994757, 0.00683477520942688, -0.03805781528353691, -0.01824733428657055, -0.04121600463986397, -0.007636285852640867, 0.03225184231996536, 0.018949154764413834, 9.352207416668534e-05]
زحف البيانات باستخدام Crawl4AI
from crawl4ai import *
async def crawl():
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://lilianweng.github.io/posts/2023-06-23-agent/",
)
return result.markdown
markdown_content = await crawl()
[INIT].... → Crawl4AI 0.4.247
[FETCH]... ↓ https://lilianweng.github.io/posts/2023-06-23-agen... | Status: True | Time: 0.07s
[COMPLETE] ● https://lilianweng.github.io/posts/2023-06-23-agen... | Status: True | Total: 0.08s
معالجة المحتوى الذي تم الزحف إليه
لجعل المحتوى الذي تم الزحف إليه قابلاً للإدارة لإدراجه في ملف Milvus، نستخدم ببساطة "#" لفصل المحتوى، والذي يمكن أن يفصل تقريبًا محتوى كل جزء رئيسي من ملف العلامات الذي تم الزحف إليه.
def split_markdown_content(content):
return [section.strip() for section in content.split("# ") if section.strip()]
# Process the crawled markdown content
sections = split_markdown_content(markdown_content)
# Print the first few sections to understand the structure
for i, section in enumerate(sections[:3]):
print(f"Section {i+1}:")
print(section[:300] + "...")
print("-" * 50)
Section 1:
[Lil'Log](https://lilianweng.github.io/posts/2023-06-23-agent/<https:/lilianweng.github.io/> "Lil'Log \(Alt + H\)")
* |
* [ Posts ](https://lilianweng.github.io/posts/2023-06-23-agent/<https:/lilianweng.github.io/> "Posts")
* [ Archive ](https://lilianweng.github.io/posts/2023-06-23-agent/<h...
--------------------------------------------------
Section 2:
LLM Powered Autonomous Agents
Date: June 23, 2023 | Estimated Reading Time: 31 min | Author: Lilian Weng
Table of Contents
* [Agent System Overview](https://lilianweng.github.io/posts/2023-06-23-agent/<#agent-system-overview>)
* [Component One: Planning](https://lilianweng.github.io/posts/2023...
--------------------------------------------------
Section 3:
Agent System Overview[#](https://lilianweng.github.io/posts/2023-06-23-agent/<#agent-system-overview>)
In a LLM-powered autonomous agent system, LLM functions as the agent’s brain, complemented by several key components:
* **Planning**
* Subgoal and decomposition: The agent breaks down large t...
--------------------------------------------------
تحميل البيانات إلى ملف ميلفوس
إنشاء المجموعة
from pymilvus import MilvusClient
milvus_client = MilvusClient(uri="./milvus_demo.db")
collection_name = "my_rag_collection"
INFO:numexpr.utils:Note: NumExpr detected 10 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
INFO:numexpr.utils:NumExpr defaulting to 8 threads.
بالنسبة لحجة MilvusClient
:
يعد تعيين
uri
كملف محلي، على سبيل المثال./milvus.db
، هو الطريقة الأكثر ملاءمة، حيث يستخدم تلقائيًا ملف Milvus Lite لتخزين جميع البيانات في هذا الملف.إذا كان لديك حجم كبير من البيانات، يمكنك إعداد خادم Milvus أكثر أداءً على docker أو kubernetes. في هذا الإعداد، يُرجى استخدام الخادم uri، على سبيل المثال
http://localhost:19530
، كـuri
.إذا كنت ترغب في استخدام Zilliz Cloud، الخدمة السحابية المدارة بالكامل لـ Milvus، اضبط
uri
وtoken
، والتي تتوافق مع نقطة النهاية العامة ومفتاح Api في Zilliz Cloud.
تحقق مما إذا كانت المجموعة موجودة بالفعل وأسقطها إذا كانت موجودة.
if milvus_client.has_collection(collection_name):
milvus_client.drop_collection(collection_name)
قم بإنشاء مجموعة جديدة بمعلمات محددة.
إذا لم نحدد أي معلومات عن الحقل، سيقوم ميلفوس تلقائيًا بإنشاء حقل افتراضي id
للمفتاح الأساسي، وحقل vector
لتخزين بيانات المتجه. يتم استخدام حقل JSON محجوز لتخزين الحقول غير المعرفة من قبل النظام الأساسي وقيمها.
milvus_client.create_collection(
collection_name=collection_name,
dimension=embedding_dim,
metric_type="IP", # Inner product distance
consistency_level="Strong", # Strong consistency level
)
إدراج البيانات
from tqdm import tqdm
data = []
for i, section in enumerate(tqdm(sections, desc="Processing sections")):
embedding = emb_text(section)
data.append({"id": i, "vector": embedding, "text": section})
# Insert data into Milvus
milvus_client.insert(collection_name=collection_name, data=data)
Processing sections: 0%| | 0/18 [00:00<?, ?it/s]INFO:httpx:HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
Processing sections: 6%|▌ | 1/18 [00:00<00:12, 1.37it/s]INFO:httpx:HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
Processing sections: 11%|█ | 2/18 [00:01<00:11, 1.39it/s]INFO:httpx:HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
Processing sections: 17%|█▋ | 3/18 [00:02<00:10, 1.40it/s]INFO:httpx:HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
Processing sections: 22%|██▏ | 4/18 [00:02<00:07, 1.85it/s]INFO:httpx:HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
Processing sections: 28%|██▊ | 5/18 [00:02<00:06, 2.06it/s]INFO:httpx:HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
Processing sections: 33%|███▎ | 6/18 [00:03<00:06, 1.94it/s]INFO:httpx:HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
Processing sections: 39%|███▉ | 7/18 [00:03<00:05, 2.14it/s]INFO:httpx:HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
Processing sections: 44%|████▍ | 8/18 [00:04<00:04, 2.29it/s]INFO:httpx:HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
Processing sections: 50%|█████ | 9/18 [00:04<00:04, 2.20it/s]INFO:httpx:HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
Processing sections: 56%|█████▌ | 10/18 [00:05<00:03, 2.09it/s]INFO:httpx:HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
Processing sections: 61%|██████ | 11/18 [00:06<00:04, 1.68it/s]INFO:httpx:HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
Processing sections: 67%|██████▋ | 12/18 [00:06<00:04, 1.48it/s]INFO:httpx:HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
Processing sections: 72%|███████▏ | 13/18 [00:07<00:02, 1.75it/s]INFO:httpx:HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
Processing sections: 78%|███████▊ | 14/18 [00:07<00:01, 2.02it/s]INFO:httpx:HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
Processing sections: 83%|████████▎ | 15/18 [00:07<00:01, 2.12it/s]INFO:httpx:HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
Processing sections: 89%|████████▉ | 16/18 [00:08<00:01, 1.61it/s]INFO:httpx:HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
Processing sections: 94%|█████████▍| 17/18 [00:09<00:00, 1.92it/s]INFO:httpx:HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
Processing sections: 100%|██████████| 18/18 [00:09<00:00, 1.83it/s]
{'insert_count': 18, 'ids': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17], 'cost': 0}
بناء RAG
استرجاع البيانات لاستعلام
لنحدد سؤال استعلام عن موقع الويب الذي زحفنا إليه للتو.
question = "What are the main components of autonomous agents?"
ابحث عن السؤال في المجموعة واسترجع أفضل 3 مطابقات دلالية.
search_res = milvus_client.search(
collection_name=collection_name,
data=[emb_text(question)],
limit=3,
search_params={"metric_type": "IP", "params": {}},
output_fields=["text"],
)
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
لنلقِ نظرة على نتائج البحث عن الاستعلام
import json
retrieved_lines_with_distances = [
(res["entity"]["text"], res["distance"]) for res in search_res[0]
]
print(json.dumps(retrieved_lines_with_distances, indent=4))
[
[
"Agent System Overview[#](https://lilianweng.github.io/posts/2023-06-23-agent/<#agent-system-overview>)\nIn a LLM-powered autonomous agent system, LLM functions as the agent\u2019s brain, complemented by several key components:\n * **Planning**\n * Subgoal and decomposition: The agent breaks down large tasks into smaller, manageable subgoals, enabling efficient handling of complex tasks.\n * Reflection and refinement: The agent can do self-criticism and self-reflection over past actions, learn from mistakes and refine them for future steps, thereby improving the quality of final results.\n * **Memory**\n * Short-term memory: I would consider all the in-context learning (See [Prompt Engineering](https://lilianweng.github.io/posts/2023-06-23-agent/<https:/lilianweng.github.io/posts/2023-03-15-prompt-engineering/>)) as utilizing short-term memory of the model to learn.\n * Long-term memory: This provides the agent with the capability to retain and recall (infinite) information over extended periods, often by leveraging an external vector store and fast retrieval.\n * **Tool use**\n * The agent learns to call external APIs for extra information that is missing from the model weights (often hard to change after pre-training), including current information, code execution capability, access to proprietary information sources and more.\n\n Fig. 1. Overview of a LLM-powered autonomous agent system.",
0.6433743238449097
],
[
"LLM Powered Autonomous Agents \nDate: June 23, 2023 | Estimated Reading Time: 31 min | Author: Lilian Weng \nTable of Contents\n * [Agent System Overview](https://lilianweng.github.io/posts/2023-06-23-agent/<#agent-system-overview>)\n * [Component One: Planning](https://lilianweng.github.io/posts/2023-06-23-agent/<#component-one-planning>)\n * [Task Decomposition](https://lilianweng.github.io/posts/2023-06-23-agent/<#task-decomposition>)\n * [Self-Reflection](https://lilianweng.github.io/posts/2023-06-23-agent/<#self-reflection>)\n * [Component Two: Memory](https://lilianweng.github.io/posts/2023-06-23-agent/<#component-two-memory>)\n * [Types of Memory](https://lilianweng.github.io/posts/2023-06-23-agent/<#types-of-memory>)\n * [Maximum Inner Product Search (MIPS)](https://lilianweng.github.io/posts/2023-06-23-agent/<#maximum-inner-product-search-mips>)\n * [Component Three: Tool Use](https://lilianweng.github.io/posts/2023-06-23-agent/<#component-three-tool-use>)\n * [Case Studies](https://lilianweng.github.io/posts/2023-06-23-agent/<#case-studies>)\n * [Scientific Discovery Agent](https://lilianweng.github.io/posts/2023-06-23-agent/<#scientific-discovery-agent>)\n * [Generative Agents Simulation](https://lilianweng.github.io/posts/2023-06-23-agent/<#generative-agents-simulation>)\n * [Proof-of-Concept Examples](https://lilianweng.github.io/posts/2023-06-23-agent/<#proof-of-concept-examples>)\n * [Challenges](https://lilianweng.github.io/posts/2023-06-23-agent/<#challenges>)\n * [Citation](https://lilianweng.github.io/posts/2023-06-23-agent/<#citation>)\n * [References](https://lilianweng.github.io/posts/2023-06-23-agent/<#references>)\n\n\nBuilding agents with LLM (large language model) as its core controller is a cool concept. Several proof-of-concepts demos, such as [AutoGPT](https://lilianweng.github.io/posts/2023-06-23-agent/<https:/github.com/Significant-Gravitas/Auto-GPT>), [GPT-Engineer](https://lilianweng.github.io/posts/2023-06-23-agent/<https:/github.com/AntonOsika/gpt-engineer>) and [BabyAGI](https://lilianweng.github.io/posts/2023-06-23-agent/<https:/github.com/yoheinakajima/babyagi>), serve as inspiring examples. The potentiality of LLM extends beyond generating well-written copies, stories, essays and programs; it can be framed as a powerful general problem solver.",
0.5462194085121155
],
[
"Component One: Planning[#](https://lilianweng.github.io/posts/2023-06-23-agent/<#component-one-planning>)\nA complicated task usually involves many steps. An agent needs to know what they are and plan ahead.\n#",
0.5223420858383179
]
]
استخدم LLM للحصول على استجابة RAG
تحويل المستندات المسترجعة إلى تنسيق سلسلة.
context = "\n".join(
[line_with_distance[0] for line_with_distance in retrieved_lines_with_distances]
)
تحديد مطالبات النظام والمستخدم لنموذج لاناج. يتم تجميع هذه المطالبة مع المستندات المسترجعة من ميلفوس.
SYSTEM_PROMPT = """
Human: You are an AI assistant. You are able to find answers to the questions from the contextual passage snippets provided.
"""
USER_PROMPT = f"""
Use the following pieces of information enclosed in <context> tags to provide an answer to the question enclosed in <question> tags.
<context>
{context}
</context>
<question>
{question}
</question>
"""
استخدم OpenAI ChatGPT لإنشاء استجابة بناءً على المطالبات.
response = openai_client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": USER_PROMPT},
],
)
print(response.choices[0].message.content)
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
The main components of autonomous agents are:
1. **Planning**:
- Subgoal and decomposition: Breaking down large tasks into smaller, manageable subgoals.
- Reflection and refinement: Self-criticism and reflection to learn from past actions and improve future steps.
2. **Memory**:
- Short-term memory: In-context learning using prompt engineering.
- Long-term memory: Retaining and recalling information over extended periods using an external vector store and fast retrieval.
3. **Tool use**:
- Calling external APIs for information not contained in the model weights, accessing current information, code execution capabilities, and proprietary information sources.