To index documents from a relational database using LlamaIndex, you need to extract structured data from tables, convert it into text-based documents, and build an index for LLM-based querying. LlamaIndex doesn’t directly interact with databases, so you’ll first query the database, process the results into a compatible format, and use LlamaIndex’s tools to create and manage the index. This typically involves using SQL queries or an ORM (like SQLAlchemy) to fetch data, transforming rows into text or structured documents, and then indexing them.
Start by connecting to your database and querying the required data. For example, if you have a users
table with columns like id
, name
, and bio
, you could write a SQL query to fetch these rows. Using Python, you might use a library like psycopg2
for PostgreSQL or sqlite3
for SQLite. Once the data is retrieved, convert each row into a text document. For instance, a row with name: "Alice"
and bio: "Software engineer with 5 years of experience"
could become a document like “Name: Alice. Bio: Software engineer with 5 years of experience.” LlamaIndex’s SimpleDirectoryReader
or custom document classes can help structure this data, ensuring each document includes metadata like the original table or row ID for traceability.
Next, use LlamaIndex’s indexing pipeline. For example, after converting database rows into document objects, initialize a VectorStoreIndex
to create embeddings for semantic search. Here’s a simplified code snippet:
from llama_index import VectorStoreIndex, Document
from sqlalchemy import create_engine
engine = create_engine("sqlite:///mydb.db")
result = engine.execute("SELECT name, bio FROM users")
documents = [Document(text=f"Name: {row[0]}. Bio: {row[1]}") for row in result]
index = VectorStoreIndex.from_documents(documents)
This creates an index where each document represents a database row. You can then query the index using natural language (e.g., “Find users with experience in software engineering”), and LlamaIndex will retrieve relevant documents based on semantic similarity.
For more complex databases, consider joining tables or preprocessing data to create meaningful documents. For example, if a projects
table links to users
via a foreign key, you might write a SQL join to combine user bios with their projects, creating documents like “Alice works on Project X and Project Y.” Additionally, handle incremental updates by tracking timestamps or using database triggers to refresh the index when data changes. Tools like LlamaIndex’s SQLDatabase
connector (part of the llama-index-readers
package) can automate some of this by integrating with SQLAlchemy, but custom pipelines are often needed for fine-grained control. Always include metadata like table names and row IDs to maintain a clear link between indexed documents and the original database records.
Zilliz Cloud is a managed vector database built on Milvus perfect for building GenAI applications.
Try FreeLike the article? Spread the word