使用 Milvus 和 Camel 的檢索-擴充世代 (RAG)
本指南展示了如何使用 CAMEL 和 Milvus 建立一個檢索-增強生成 (RAG) 系統。
RAG 系統結合了檢索系統與生成模型,可根據給定的提示生成新的文字。該系統首先使用 Milvus 從語料庫中檢索相關文件,然後根據檢索到的文件使用生成模型生成新文本。
CAMEL是一個多重代理框架。Milvus是世界上最先進的開放原始碼向量資料庫,專門用於嵌入相似性搜尋和人工智能應用程式。
在本筆記簿中,我們展示了 CAMEL Retrieve 模組在自訂和自動兩種方式中的用法。我們也將展示如何結合AutoRetriever
與ChatAgent
,並透過Function Calling
進一步結合AutoRetriever
與RolePlaying
。
包括四個主要部分:
- 自訂 RAG
- 自動 RAG
- 使用自動 RAG 的單一代理
- 角色扮演與自動 RAG
載入資料
讓我們先從 https://arxiv.org/pdf/2303.17760.pdf 載入 CAMEL 文件。這將是我們的本地範例資料。
$ pip install -U "camel-ai[all]" pymilvus
如果您使用的是 Google Colab,為了啟用剛安裝的相依性,您可能需要重新啟動執行時(點選畫面上方的「Runtime」功能表,並從下拉式功能表中選擇「Restart session」)。
import os
import requests
os.makedirs("local_data", exist_ok=True)
url = "https://arxiv.org/pdf/2303.17760.pdf"
response = requests.get(url)
with open("local_data/camel paper.pdf", "wb") as file:
file.write(response.content)
1.自訂 RAG
在本節中,我們將設定自訂的 RAG 管道,以VectorRetriever
為例。我們將設定OpenAIEmbedding
為嵌入模型,MilvusStorage
為其儲存空間。
要設定 OpenAI 的嵌入,我們需要設定OPENAI_API_KEY
。
os.environ["OPENAI_API_KEY"] = "Your Key"
匯入並設定嵌入實例:
from camel.embeddings import OpenAIEmbedding
embedding_instance = OpenAIEmbedding()
匯入並設定向量儲存實例:
from camel.storages import MilvusStorage
storage_instance = MilvusStorage(
vector_dim=embedding_instance.get_output_dim(),
url_and_api_key=(
"./milvus_demo.db", # Your Milvus connection URI
"", # Your Milvus token
),
collection_name="camel_paper",
)
對於url_and_api_key
:
- 使用本機檔案,例如
./milvus.db
,作為 Milvus connection URI 是最方便的方法,因為它會自動利用Milvus Lite將所有資料儲存於此檔案中。 - 如果您有大規模的資料,您可以在docker 或 kubernetes 上架設效能更高的 Milvus 伺服器。在此設定中,請使用伺服器的 uri,例如
http://localhost:19530
,作為您的 url。 - 如果您想使用Zilliz Cloud(Milvus 的完全管理雲端服務),請調整連線 uri 和 token,與 Zilliz Cloud 的Public Endpoint 和 Api key對應。
匯入並設定 retriever instance:
預設情況下,similarity_threshold
設定為 0.75。您可以變更它。
from camel.retrievers import VectorRetriever
vector_retriever = VectorRetriever(
embedding_model=embedding_instance, storage=storage_instance
)
我們使用整合的Unstructured Module
來將內容分割成小塊,內容會利用其chunk_by_title
功能自動分割,每個小塊的最大字元為 500 個字元,這是OpenAIEmbedding
的合適長度。分塊中的所有文字都會嵌入並儲存到向量儲存實例中,這需要一些時間,請稍候。
vector_retriever.process(content_input_path="local_data/camel paper.pdf")
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data] Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data] /root/nltk_data...
[nltk_data] Unzipping taggers/averaged_perceptron_tagger.zip.
現在我們可以透過查詢從向量儲存中擷取資訊。預設情況下,它會從 Cosine 類似度分數最高的前 1 個分塊中擷取文字內容,而類似度分數應高於 0.75,以確保擷取的內容與查詢相關。您也可以變更top_k
值。
傳回的字串清單包括
- 相似度得分
- 內容路徑
- 元資料
- 文字
retrieved_info = vector_retriever.query(query="What is CAMEL?", top_k=1)
print(retrieved_info)
[{'similarity score': '0.8321675658226013', 'content path': 'local_data/camel paper.pdf', 'metadata': {'last_modified': '2024-04-19T14:40:00', 'filetype': 'application/pdf', 'page_number': 45}, 'text': 'CAMEL Data and Code License The intended purpose and licensing of CAMEL is solely for research use. The source code is licensed under Apache 2.0. The datasets are licensed under CC BY NC 4.0, which permits only non-commercial usage. It is advised that any models trained using the dataset should not be utilized for anything other than research purposes.\n\n45'}]
讓我們試試不相關的查詢:
retrieved_info_irrelevant = vector_retriever.query(
query="Compared with dumpling and rice, which should I take for dinner?", top_k=1
)
print(retrieved_info_irrelevant)
[{'text': 'No suitable information retrieved from local_data/camel paper.pdf with similarity_threshold = 0.75.'}]
2.自動 RAG
在本節中,我們將以預設設定執行AutoRetriever
。它使用OpenAIEmbedding
作為預設的嵌入模型,並使用Milvus
作為預設的向量儲存。
您需要做的是
- 設定內容輸入路徑,可以是本機路徑或遠端網址
- 設定 Milvus 的遠端網址和 api 金鑰
- 提供查詢
自動 RAG 管道會為指定的內容輸入路徑建立集合,集合名稱會根據內容輸入路徑名稱自動設定,如果集合存在,則會直接進行擷取。
from camel.retrievers import AutoRetriever
from camel.types import StorageType
auto_retriever = AutoRetriever(
url_and_api_key=(
"./milvus_demo.db", # Your Milvus connection URI
"", # Your Milvus token
),
storage_type=StorageType.MILVUS,
embedding_model=embedding_instance,
)
retrieved_info = auto_retriever.run_vector_retriever(
query="What is CAMEL-AI",
content_input_paths=[
"local_data/camel paper.pdf", # example local path
"https://www.camel-ai.org/", # example remote url
],
top_k=1,
return_detailed_info=True,
)
print(retrieved_info)
Original Query:
{What is CAMEL-AI}
Retrieved Context:
{'similarity score': '0.8252888321876526', 'content path': 'local_data/camel paper.pdf', 'metadata': {'last_modified': '2024-04-19T14:40:00', 'filetype': 'application/pdf', 'page_number': 7}, 'text': ' Section 3.2, to simulate assistant-user cooperation. For our analysis, we set our attention on AI Society setting. We also gathered conversational data, named CAMEL AI Society and CAMEL Code datasets and problem-solution pairs data named CAMEL Math and CAMEL Science and analyzed and evaluated their quality. Moreover, we will discuss potential extensions of our framework and highlight both the risks and opportunities that future AI society might present.'}
{'similarity score': '0.8378663659095764', 'content path': 'https://www.camel-ai.org/', 'metadata': {'filetype': 'text/html', 'languages': ['eng'], 'page_number': 1, 'url': 'https://www.camel-ai.org/', 'link_urls': ['#h.3f4tphhd9pn8', 'https://join.slack.com/t/camel-ai/shared_invite/zt-2g7xc41gy-_7rcrNNAArIP6sLQqldkqQ', 'https://discord.gg/CNcNpquyDc'], 'link_texts': [None, None, None], 'emphasized_text_contents': ['Mission', 'CAMEL-AI.org', 'is an open-source community dedicated to the study of autonomous and communicative agents. We believe that studying these agents on a large scale offers valuable insights into their behaviors, capabilities, and potential risks. To facilitate research in this field, we provide, implement, and support various types of agents, tasks, prompts, models, datasets, and simulated environments.', 'Join us via', 'Slack', 'Discord', 'or'], 'emphasized_text_tags': ['span', 'span', 'span', 'span', 'span', 'span', 'span']}, 'text': 'Mission\n\nCAMEL-AI.org is an open-source community dedicated to the study of autonomous and communicative agents. We believe that studying these agents on a large scale offers valuable insights into their behaviors, capabilities, and potential risks. To facilitate research in this field, we provide, implement, and support various types of agents, tasks, prompts, models, datasets, and simulated environments.\n\nJoin us via\n\nSlack\n\nDiscord\n\nor'}
3.具有自動 RAG 的單一代理
本節將介紹如何將AutoRetriever
與單一ChatAgent
結合。
讓我們設定一個代理函式,在這個函式中,我們可以透過提供查詢給這個代理來取得回應。
from camel.agents import ChatAgent
from camel.messages import BaseMessage
from camel.types import RoleType
from camel.retrievers import AutoRetriever
from camel.types import StorageType
def single_agent(query: str) -> str:
# Set agent role
assistant_sys_msg = BaseMessage(
role_name="Assistant",
role_type=RoleType.ASSISTANT,
meta_dict=None,
content="""You are a helpful assistant to answer question,
I will give you the Original Query and Retrieved Context,
answer the Original Query based on the Retrieved Context,
if you can't answer the question just say I don't know.""",
)
# Add auto retriever
auto_retriever = AutoRetriever(
url_and_api_key=(
"./milvus_demo.db", # Your Milvus connection URI
"", # Your Milvus token
),
storage_type=StorageType.MILVUS,
embedding_model=embedding_instance,
)
retrieved_info = auto_retriever.run_vector_retriever(
query=query,
content_input_paths=[
"local_data/camel paper.pdf", # example local path
"https://www.camel-ai.org/", # example remote url
],
# vector_storage_local_path="storage_default_run",
top_k=1,
return_detailed_info=True,
)
# Pass the retrieved infomation to agent
user_msg = BaseMessage.make_user_message(role_name="User", content=retrieved_info)
agent = ChatAgent(assistant_sys_msg)
# Get response
assistant_response = agent.step(user_msg)
return assistant_response.msg.content
print(single_agent("What is CAMEL-AI"))
CAMEL-AI is an open-source community dedicated to the study of autonomous and communicative agents. It provides, implements, and supports various types of agents, tasks, prompts, models, datasets, and simulated environments to facilitate research in this field.
4.使用 Auto RAG 進行角色扮演
在本節中,我們將展示如何透過應用Function Calling
來結合RETRIEVAL_FUNCS
與RolePlaying
。
from typing import List
from colorama import Fore
from camel.agents.chat_agent import FunctionCallingRecord
from camel.configs import ChatGPTConfig
from camel.functions import (
MATH_FUNCS,
RETRIEVAL_FUNCS,
)
from camel.societies import RolePlaying
from camel.types import ModelType
from camel.utils import print_text_animated
def role_playing_with_rag(
task_prompt, model_type=ModelType.GPT_4O, chat_turn_limit=10
) -> None:
task_prompt = task_prompt
user_model_config = ChatGPTConfig(temperature=0.0)
function_list = [
*MATH_FUNCS,
*RETRIEVAL_FUNCS,
]
assistant_model_config = ChatGPTConfig(
tools=function_list,
temperature=0.0,
)
role_play_session = RolePlaying(
assistant_role_name="Searcher",
user_role_name="Professor",
assistant_agent_kwargs=dict(
model_type=model_type,
model_config=assistant_model_config,
tools=function_list,
),
user_agent_kwargs=dict(
model_type=model_type,
model_config=user_model_config,
),
task_prompt=task_prompt,
with_task_specify=False,
)
print(
Fore.GREEN
+ f"AI Assistant sys message:\n{role_play_session.assistant_sys_msg}\n"
)
print(Fore.BLUE + f"AI User sys message:\n{role_play_session.user_sys_msg}\n")
print(Fore.YELLOW + f"Original task prompt:\n{task_prompt}\n")
print(
Fore.CYAN
+ f"Specified task prompt:\n{role_play_session.specified_task_prompt}\n"
)
print(Fore.RED + f"Final task prompt:\n{role_play_session.task_prompt}\n")
n = 0
input_msg = role_play_session.init_chat()
while n < chat_turn_limit:
n += 1
assistant_response, user_response = role_play_session.step(input_msg)
if assistant_response.terminated:
print(
Fore.GREEN
+ (
"AI Assistant terminated. Reason: "
f"{assistant_response.info['termination_reasons']}."
)
)
break
if user_response.terminated:
print(
Fore.GREEN
+ (
"AI User terminated. "
f"Reason: {user_response.info['termination_reasons']}."
)
)
break
# Print output from the user
print_text_animated(Fore.BLUE + f"AI User:\n\n{user_response.msg.content}\n")
# Print output from the assistant, including any function
# execution information
print_text_animated(Fore.GREEN + "AI Assistant:")
tool_calls: List[FunctionCallingRecord] = assistant_response.info["tool_calls"]
for func_record in tool_calls:
print_text_animated(f"{func_record}")
print_text_animated(f"{assistant_response.msg.content}\n")
if "CAMEL_TASK_DONE" in user_response.msg.content:
break
input_msg = assistant_response.msg
使用定義的 retriever 函式執行角色扮演:
role_playing_with_rag(
task_prompt="""What is the main termination reasons for AI Society
dataset, how many number of messages did camel decided to
limit, what's the value plus 100? You should refer to the
content in path camel/local_data/camel paper.pdf"""
)
AI Assistant sys message:
BaseMessage(role_name='Searcher', role_type=<RoleType.ASSISTANT: 'assistant'>, meta_dict={'task': "What is the main termination reasons for AI Society dataset, how many number of messages did camel decided to limit, what's the value plus 100?", 'assistant_role': 'Searcher', 'user_role': 'Professor'}, content="===== RULES OF ASSISTANT =====\nNever forget you are a Searcher and I am a Professor. Never flip roles! Never instruct me!\nWe share a common interest in collaborating to successfully complete a task.\nYou must help me to complete the task.\nHere is the task: What is the main termination reasons for AI Society dataset, how many number of messages did camel decided to limit, what's the value plus 100?. Never forget our task!\nI must instruct you based on your expertise and my needs to complete the task.\n\nI must give you one instruction at a time.\nYou must write a specific solution that appropriately solves the requested instruction and explain your solutions.\nYou must decline my instruction honestly if you cannot perform the instruction due to physical, moral, legal reasons or your capability and explain the reasons.\nUnless I say the task is completed, you should always start with:\n\nSolution: <YOUR_SOLUTION>\n\n<YOUR_SOLUTION> should be very specific, include detailed explanations and provide preferable detailed implementations and examples and lists for task-solving.\nAlways end <YOUR_SOLUTION> with: Next request.")
AI User sys message:
BaseMessage(role_name='Professor', role_type=<RoleType.USER: 'user'>, meta_dict={'task': "What is the main termination reasons for AI Society dataset, how many number of messages did camel decided to limit, what's the value plus 100?", 'assistant_role': 'Searcher', 'user_role': 'Professor'}, content='===== RULES OF USER =====\nNever forget you are a Professor and I am a Searcher. Never flip roles! You will always instruct me.\nWe share a common interest in collaborating to successfully complete a task.\nI must help you to complete the task.\nHere is the task: What is the main termination reasons for AI Society dataset, how many number of messages did camel decided to limit, what\'s the value plus 100?. Never forget our task!\nYou must instruct me based on my expertise and your needs to solve the task ONLY in the following two ways:\n\n1. Instruct with a necessary input:\nInstruction: <YOUR_INSTRUCTION>\nInput: <YOUR_INPUT>\n\n2. Instruct without any input:\nInstruction: <YOUR_INSTRUCTION>\nInput: None\n\nThe "Instruction" describes a task or question. The paired "Input" provides further context or information for the requested "Instruction".\n\nYou must give me one instruction at a time.\nI must write a response that appropriately solves the requested instruction.\nI must decline your instruction honestly if I cannot perform the instruction due to physical, moral, legal reasons or my capability and explain the reasons.\nYou should instruct me not ask me questions.\nNow you must start to instruct me using the two ways described above.\nDo not add anything else other than your instruction and the optional corresponding input!\nKeep giving me instructions and necessary inputs until you think the task is completed.\nWhen the task is completed, you must only reply with a single word <CAMEL_TASK_DONE>.\nNever say <CAMEL_TASK_DONE> unless my responses have solved your task.')
Original task prompt:
What is the main termination reasons for AI Society dataset, how many number of messages did camel decided to limit, what's the value plus 100?
Specified task prompt:
None
Final task prompt:
What is the main termination reasons for AI Society dataset, how many number of messages did camel decided to limit, what's the value plus 100?
AI User:
Instruction: Provide a summary of the main termination reasons in the AI Society dataset.
Input: None
AI Assistant:
Function Execution: local_retriever
Args: {'query': 'main termination reasons for AI Society dataset'}
Result: Original Query:
{main termination reasons for AI Society dataset}
Retrieved Context:
Next we examine the conversation termination reasons for both AI Society and Code datasets. As can be seen in Figure 8, the main termination reasons for AI Society dataset is Assistant Instruct whereas for Code it is Token Limit. The latter is expected as the since responses that contain code tend to be long. It is also interesting to note that in both datasets, the termination due to Maximum Number of Messages is low indicating that the limit of 40 maximum messages is reasonable. Our decision t
Solution: The main termination reason for the AI Society dataset is "Assistant Instruct." This indicates that conversations in the AI Society dataset typically end when the assistant is instructed to terminate the conversation.
Next request.
AI User:
Instruction: Identify the number of messages that the camel decided to limit in the context provided.
Input: None
AI Assistant:
Solution: The context provided from the local retriever indicates that the camel decided to limit the number of messages to a maximum of 40 messages. This is mentioned in the retrieved context where it states that the termination due to Maximum Number of Messages is low, indicating that the limit of 40 maximum messages is reasonable.
Next request.
AI User:
Instruction: Calculate the value of the message limit plus 100.
Input: None
AI Assistant:
Function Execution: add
Args: {'a': 40, 'b': 100}
Result: 140
Solution: The value of the message limit plus 100 is 140.
Next request.
AI User:
CAMEL_TASK_DONE
AI Assistant:
Solution: Understood, the task is completed.
Next request.