🚀 免費嘗試 Zilliz Cloud,完全托管的 Milvus,體驗速度提升 10 倍!立即嘗試

milvus-logo
LFAI
主頁
  • 整合
    • 知識工程

使用 Vanna 和 Milvus 寫 SQL

Open In Colab GitHub Repository

Vanna是一個開放原始碼的 Python RAG(Retrieval-Augmented Generation)框架,用於 SQL 生成和相關功能。Milvus是世界上最先進的開放原始碼向量資料庫,用於嵌入相似性搜尋和人工智能應用程式。

Vanna 可透過兩個簡單的步驟運作 - 在您的資料上訓練 RAG「模型」,然後發問問題,這些問題會回傳 SQL 查詢,這些查詢可以設定為在您的資料庫上執行。本指南將示範如何使用 Vanna 根據資料庫中儲存的資料,產生並執行 SQL 查詢。

先決條件

在執行本筆記本之前,請確定您已安裝下列依賴項目:

$ pip install "vanna[milvus,openai]"

如果您使用的是 Google Colab,為了啟用剛安裝的相依性,您可能需要重新啟動執行時(點選畫面頂端的「Runtime」功能表,並從下拉式功能表中選擇「Restart session」)。

而且您需要在環境變數中設定OPENAI_API_KEY 。您可以從OpenAI 取得 API 金鑰。

import os

os.environ["OPENAI_API_KEY"] = "sk-***********"

資料準備

首先,我們需要繼承 Vanna 的Milvus_VectorStoreOpenAI_Chat 類,並定義一個結合兩者功能的新類VannaMilvus

from pymilvus import MilvusClient, model
from vanna.milvus import Milvus_VectorStore
from vanna.openai import OpenAI_Chat


class VannaMilvus(Milvus_VectorStore, OpenAI_Chat):
    def __init__(self, config=None):
        Milvus_VectorStore.__init__(self, config=config)
        OpenAI_Chat.__init__(self, config=config)

我們使用必要的組態參數初始化VannaMilvus 類別。我們使用milvus_client 的實例來儲存嵌入式資料,並使用從milvus_model初始化的model.DefaultEmbeddingFunction() 來產生嵌入式資料。C

至於MilvusClient 的參數:

  • uri 設定為本機檔案,例如./milvus.db ,是最方便的方法,因為它會自動利用Milvus Lite將所有資料儲存在這個檔案中。
  • 如果您有大規模的資料,您可以在docker 或 kubernetes 上架設效能更高的 Milvus 伺服器。在此設定中,請使用伺服器的 uri,例如http://localhost:19530 ,作為您的uri
  • 如果您想使用Zilliz Cloud(Milvus 的完全管理雲端服務),請調整uritoken ,對應 Zilliz Cloud 的Public Endpoint 和 Api key
milvus_uri = "./milvus_vanna.db"

milvus_client = MilvusClient(uri=milvus_uri)

vn_milvus = VannaMilvus(
    config={
        "api_key": os.getenv("OPENAI_API_KEY"),
        "model": "gpt-3.5-turbo",
        "milvus_client": milvus_client,
        "embedding_function": model.DefaultEmbeddingFunction(),
        "n_results": 2,  # The number of results to return from Milvus semantic search.
    }
)

這是一個只有少數資料樣本的簡單範例,因此我們將n_results 設定為 2,以確保我們搜尋前 2 個最相似的結果。實際上,在處理較大的訓練資料集時,您應該將n_results 設定為較高的值。

我們將使用一個樣本 SQLite 資料庫,其中有幾個包含一些樣本資料的資料表。

import sqlite3

sqlite_path = "./my-database.sqlite"
sql_connect = sqlite3.connect(sqlite_path)
c = sql_connect.cursor()

init_sqls = """
CREATE TABLE IF NOT EXISTS Customer (
    ID INTEGER PRIMARY KEY AUTOINCREMENT,
    Name TEXT NOT NULL,
    Company TEXT NOT NULL,
    City TEXT NOT NULL,
    Phone TEXT NOT NULL
);

CREATE TABLE IF NOT EXISTS Company (
    ID INTEGER PRIMARY KEY AUTOINCREMENT,
    Name TEXT NOT NULL,
    Industry TEXT NOT NULL,
    Location TEXT NOT NULL,
    EmployeeCount INTEGER NOT NULL
);

CREATE TABLE IF NOT EXISTS User (
    ID INTEGER PRIMARY KEY AUTOINCREMENT,
    Username TEXT NOT NULL UNIQUE,
    Email TEXT NOT NULL UNIQUE
);

INSERT INTO Customer (Name, Company, City, Phone) 
VALUES ('John Doe', 'ABC Corp', 'New York', '123-456-7890');

INSERT INTO Customer (Name, Company, City, Phone) 
VALUES ('Jane Smith', 'XYZ Inc', 'Los Angeles', '098-765-4321');

INSERT INTO Company (Name, Industry, Location, EmployeeCount)
VALUES ('ABC Corp', 'cutting-edge technology', 'New York', 100);

INSERT INTO User (Username, Email)
VALUES ('johndoe123', 'johndoe123@example.com');
"""

for sql in init_sqls.split(";"):
    c.execute(sql)

sql_connect.commit()

# Connect to the SQLite database
vn_milvus.connect_to_sqlite(sqlite_path)

使用資料進行訓練

我們可以在 SQLite 資料庫的 DDL 資料上訓練模型。我們取得 DDL 資料,並將其送入train 函式。

# If there exists training data, we should remove it before training.
existing_training_data = vn_milvus.get_training_data()
if len(existing_training_data) > 0:
    for _, training_data in existing_training_data.iterrows():
        vn_milvus.remove_training_data(training_data["id"])

# Get the DDL of the SQLite database
df_ddl = vn_milvus.run_sql("SELECT type, sql FROM sqlite_master WHERE sql is not null")

# Train the model on the DDL data
for ddl in df_ddl["sql"].to_list():
    vn_milvus.train(ddl=ddl)
Adding ddl: CREATE TABLE Customer (
    ID INTEGER PRIMARY KEY AUTOINCREMENT,
    Name TEXT NOT NULL,
    Company TEXT NOT NULL,
    City TEXT NOT NULL,
    Phone TEXT NOT NULL
)
Adding ddl: CREATE TABLE sqlite_sequence(name,seq)
Adding ddl: CREATE TABLE Company (
    ID INTEGER PRIMARY KEY AUTOINCREMENT,
    Name TEXT NOT NULL,
    Industry TEXT NOT NULL,
    Location TEXT NOT NULL,
    EmployeeCount INTEGER NOT NULL
)
Adding ddl: CREATE TABLE User (
    ID INTEGER PRIMARY KEY AUTOINCREMENT,
    Username TEXT NOT NULL UNIQUE,
    Email TEXT NOT NULL UNIQUE
)

除了在 DDL 資料上進行訓練外,我們也可以在資料庫的文件和 SQL 查詢上進行訓練。

# Add documentation about your business terminology or definitions.
vn_milvus.train(
    documentation="ABC Corp specializes in cutting-edge technology solutions and innovation."
)
vn_milvus.train(
    documentation="XYZ Inc is a global leader in manufacturing and supply chain management."
)

# You can also add SQL queries to your training data.
vn_milvus.train(sql="SELECT * FROM Customer WHERE Name = 'John Doe'")
Adding documentation....
Adding documentation....
Using model gpt-3.5-turbo for 65.0 tokens (approx)
Question generated with sql: What are the details of the customer named John Doe? 
Adding SQL...





'595b185c-e6ad-47b0-98fd-0e93ef9b6a0a-sql'

讓我們來看看訓練資料。

training_data = vn_milvus.get_training_data()
training_data
# id 問題 內容
0 595b185c-e6ad-47b0-98fd-0e93ef9b6a0a-sql 名為 Joh... 的客戶的詳細資料是什麼? SELECT * FROM Customer WHERE Name = 'John Doe'
0 25f4956c-e370-4097-994f-996f22d145fa-ddl CREATE TABLE Company (\n ID INTEGER PRIMARY...
1 b95ecc66-f65b-49dc-a9f1-c1842ad230ff-ddl CREATE TABLE Customer (\n ID INTEGER PRIMAR...
2 fcc73d15-30a5-4421-9d73-b8c3b0ed5305-ddl CREATE TABLE sqlite_sequence(name,seq)
3 feae618c-5910-4f6f-8b4b-6cc3e03aec06-ddl CREATE TABLE User (\n ID INTEGER PRIMARY KE...
0 79a48db1-ba1f-4fd5-be99-74f2ca2eaeeb-doc XYZ Inc 是一家全球領先的製造和服務供應商。
1 9f9df1b8-ae62-4823-ad28-d7e0f2d1f4c0-doc ABC Corp 專門從事尖端技術的研發與生產。

產生 SQL 並執行

由於我們已經訓練過 DDL 資料,現在資料表結構可以用來產生 SQL 查詢。

讓我們嘗試一個簡單的問題。

sql = vn_milvus.generate_sql("what is the phone number of John Doe?")
vn_milvus.run_sql(sql)
SQL Prompt: [{'role': 'system', 'content': "You are a SQLite expert. Please help to generate a SQL query to answer the question. Your response should ONLY be based on the given context and follow the response guidelines and format instructions. \n===Tables \nCREATE TABLE Customer (\n    ID INTEGER PRIMARY KEY AUTOINCREMENT,\n    Name TEXT NOT NULL,\n    Company TEXT NOT NULL,\n    City TEXT NOT NULL,\n    Phone TEXT NOT NULL\n)\n\nCREATE TABLE User (\n    ID INTEGER PRIMARY KEY AUTOINCREMENT,\n    Username TEXT NOT NULL UNIQUE,\n    Email TEXT NOT NULL UNIQUE\n)\n\n\n===Additional Context \n\nABC Corp specializes in cutting-edge technology solutions and innovation.\n\nXYZ Inc is a global leader in manufacturing and supply chain management.\n\n===Response Guidelines \n1. If the provided context is sufficient, please generate a valid SQL query without any explanations for the question. \n2. If the provided context is almost sufficient but requires knowledge of a specific string in a particular column, please generate an intermediate SQL query to find the distinct strings in that column. Prepend the query with a comment saying intermediate_sql \n3. If the provided context is insufficient, please explain why it can't be generated. \n4. Please use the most relevant table(s). \n5. If the question has been asked and answered before, please repeat the answer exactly as it was given before. \n"}, {'role': 'user', 'content': 'What are the details of the customer named John Doe?'}, {'role': 'assistant', 'content': "SELECT * FROM Customer WHERE Name = 'John Doe'"}, {'role': 'user', 'content': 'what is the phone number of John Doe?'}]
Using model gpt-3.5-turbo for 367.25 tokens (approx)
LLM Response: SELECT Phone FROM Customer WHERE Name = 'John Doe'
# 電話
0 123-456-7890

下面是一個比較複雜的問題。製造企業名稱資訊在文件資料中,屬於背景資訊。產生的 SQL 查詢將根據特定的製造公司名稱擷取客戶資訊。

sql = vn_milvus.generate_sql("which customer works for a manufacturing corporation?")
vn_milvus.run_sql(sql)
SQL Prompt: [{'role': 'system', 'content': "You are a SQLite expert. Please help to generate a SQL query to answer the question. Your response should ONLY be based on the given context and follow the response guidelines and format instructions. \n===Tables \nCREATE TABLE Company (\n    ID INTEGER PRIMARY KEY AUTOINCREMENT,\n    Name TEXT NOT NULL,\n    Industry TEXT NOT NULL,\n    Location TEXT NOT NULL,\n    EmployeeCount INTEGER NOT NULL\n)\n\nCREATE TABLE Customer (\n    ID INTEGER PRIMARY KEY AUTOINCREMENT,\n    Name TEXT NOT NULL,\n    Company TEXT NOT NULL,\n    City TEXT NOT NULL,\n    Phone TEXT NOT NULL\n)\n\n\n===Additional Context \n\nXYZ Inc is a global leader in manufacturing and supply chain management.\n\nABC Corp specializes in cutting-edge technology solutions and innovation.\n\n===Response Guidelines \n1. If the provided context is sufficient, please generate a valid SQL query without any explanations for the question. \n2. If the provided context is almost sufficient but requires knowledge of a specific string in a particular column, please generate an intermediate SQL query to find the distinct strings in that column. Prepend the query with a comment saying intermediate_sql \n3. If the provided context is insufficient, please explain why it can't be generated. \n4. Please use the most relevant table(s). \n5. If the question has been asked and answered before, please repeat the answer exactly as it was given before. \n"}, {'role': 'user', 'content': 'What are the details of the customer named John Doe?'}, {'role': 'assistant', 'content': "SELECT * FROM Customer WHERE Name = 'John Doe'"}, {'role': 'user', 'content': 'which customer works for a manufacturing corporation?'}]
Using model gpt-3.5-turbo for 384.25 tokens (approx)
LLM Response: SELECT * 
FROM Customer 
WHERE Company = 'XYZ Inc'
# ID 名稱 公司名稱 城市 電話
0 2 簡-史密斯 XYZ Inc 洛杉磯 098-765-4321

中斷 SQLite 和 Milvus 的連線,並將其移除以釋放資源。

sql_connect.close()
milvus_client.close()

os.remove(sqlite_path)
if os.path.exists(milvus_uri):
    os.remove(milvus_uri)

免費嘗試托管的 Milvus

Zilliz Cloud 無縫接入,由 Milvus 提供動力,速度提升 10 倍。

開始使用
反饋

這個頁面有幫助嗎?