Parsing is Hard: Solving Semantic Understanding with Mistral OCR and Milvus
Let’s face it: parsing documents is hard—really hard. PDFs, images, reports, tables, messy handwriting; they’re packed with valuable information that your users want to search for, but extracting that information and express that accurately in your search index is like solving a puzzle where the pieces keep changing shape: you thought you’ve solved it with an extra line of code but tomorrow a new doc gets ingested and you find another corner case to deal with.
In this post, we’ll tackle this challenge head-on using the powerful combo of Mistral OCR and Milvus Vector DB, turning your doc parsing nightmares into a calm dream with searchable, semantically meaningful vector embeddings.
Why Rule-based Parsing Just Won’t Cut It
If you’ve ever struggled with standard OCR tools, you probably know that they have all sorts of issues:
- Complex layouts: Tables, lists, multi-column formats – they can break or pose issues to most parsers.
- Semantic ambiguity: Keywords alone don’t tell you if “apple” means fruit or company.
- The chanllege of scale and cost: Processing thousands of documents becomes painfully slow.
We need a smarter, more systematic approach that doesn’t just extract text—it understands the content. And that’s exactly where Mistral OCR and Milvus come in.
Meet Your Dream Team
Mistral OCR: More than just text extraction
Mistral OCR isn’t your average OCR tool. It’s designed to tackle a wide range of documents.
- Deep Understanding of Complex Documents: Whether it’s embedded images, mathematical equations, or tables, it can understand it all with a very high accuracy.
- Keeps original layouts: Not only does it understand the different layouts in the documents, it also keeps the original layouts and structure intact. On top of that, it’s also capable of parsing multi-page documents.
- Multilingual and Multimodal Mastery: From English to Hindi to Arabic, Mistral OCR can comprehend documents across thousands of languages and scripts, making it invaluable for applications targeting a global user base.
Milvus: Your Vector Database Built for Scale
- Billion+ Scale: Milvus can scale to billions of vectors, making it perfect for storing large-scale documents.
- Full-Text Search: In addition to supporting dense vector embeddings, Milvus also supports Full Text Search. Making it easy to run queries using text and get better results for your RAG system.
Examples:
Let’s take this handwritten note in English, for example. Using a regular OCR tool to extract this text would be a very hard task.
A handwritten note in English
We process it with Mistral OCR
api_key = os.getenv("MISTRAL_API_KEY")
client = Mistral(api_key=api_key)
url = "https://preview.redd.it/ocr-for-handwritten-documents-v0-os036yiv9xod1.png?width=640&format=png&auto=webp&s=29461b68383534a3c1bf76cc9e36a2ba4de13c86"
result = client.ocr.process(
model=ocr_model, document={"type": "image_url", "image_url": url}
)
print(f"Result: {result.pages[0].markdown}")
And we get the following output. It can recognize handwritten text well. We can see that it even keeps the capitalized format of the words "FORCED AND UNNATURAL"!
Today is Thursday, October 20th - But it definitely feels like a Friday. I'm already considering making a second cup of coffee - and I haven't even finished my first. Do I have a problem?
Sometimes I'll fly through older notes I've taken, and my handwriting is unrecamptable. Perhaps it depends on the type of pen I use. I've tried writing in all cups but it looks so FORCED AND UNNATURAL.
Often times, I'll just take notes on my lapten, but I still seem to ermittelt forward pen and paper. Any advice on what to
improve? I already feel stressed at looking back at what I've just written - it looks like I different people wrote this!
Now we can then insert the text into Milvus for semantic search.
from pymilvus import MilvusClient
COLLECTION_NAME = "document_ocr"
milvus_client = MilvusClient(uri='http://localhost:19530')
"""
This is where you would define the index, create a collection etc. For the sake of this example. I am skipping it.
schema = CollectionSchema(...)
milvus_client.create_collection(
collection_name=COLLECTION_NAME,
schema=schema,
)
"""
milvus_client.insert(collection_name=COLLECTION_NAME, data=[result.pages[0].markdown])
But Mistral can also understand documents in different languages or in more complex format, for example let’s try this invoice in German that combines some item names in English.
An Invoice in German
Mistral OCR is still capable of extracting all the information you have and it even creates the table structure in Markdown that represents the table from the scanned image.
Rechnungsadresse:
Jähn Jessel GmbH a. Co. KG Marianne Scheibe Karla-Löffler-Weg 2 66522 Wismar
Lieferadresse:
Jähn Jessel GmbH a. Co. KG Marianne Scheibe Karla-Löffler-Weg 2 66522 Wismar
Rechnungsinformationen:
Bestelldatum: 2004-10-20
Bezahit: Ja
Expressversand: Nein
Rechnungsnummer: 4652
Rechnungsübersicht
| Pos. | Produkt | Preis <br> (Netto) | Menge | Steuersatz | Summe <br> Brutto |
| :--: | :--: | :--: | :--: | :--: | :--: |
| 1 | Grundig CH 7280w Multi-Zerkleinerer (Gourmet, 400 Watt, 11 Glasbehälter), weiß | 183.49 C | 2 | $0 \%$ | 366.98 C |
| 2 | Planet K | 349.9 C | 2 | $19.0 \%$ | 832.76 C |
| 3 | The Cabin in the Woods (Blu-ray) | 159.1 C | 2 | $7.0 \%$ | 340.47 C |
| 4 | Schenkung auf Italienisch Taschenbuch - 30. | 274.33 C | 4 | $19.0 \%$ | 1305.81 C |
| 5 | Xbox 360 - Razer 0N2A Controller Tournament Edition | 227.6 C | 2 | $7.0 \%$ | 487.06 C |
| 6 | Philips LED-Lampe ersetzt 25Watt E27 2700 Kelvin - warm-weiß, 2.7 Watt, 250 Lumen IEnergieklasse A++I | 347.57 C | 3 | $7.0 \%$ | 1115.7 C |
| 7 | Spannende Abenteuer Die verschollene Grabkammer | 242.8 C | 6 | $0 \%$ | 1456.8 C |
| Zw. summe | | 1784.79 C | | | |
| Zzgl. Mwst. 7\% | | 51.4 C | | | |
| Zzgl. Mwst. 19\% | | 118.6 C | | | |
| Gesamtbetrag C inkl. MwSt. | | 1954.79 C | | | |
Real-World Usage: A Case Study
Now that we’ve seen that Mistral OCR can work on different documents, we could imagine how a legal firm that is drowning in case files and contracts leverage this tool. By implementing a RAG system with Mistral OCR and Milvus, what once took a paralegal countless hours, like manually scanning for specific clauses or comparing past cases, now is done by AI in only a couple of minutes.
Next Steps
Ready to extract all your content? Head over to the notebook on GitHub for the full example, join our Discord to chat with the community, and start building today! You can also check out Mistral documentation about their OCR model
Say goodbye to parsing chaos, and hello to intelligent, scalable document understanding.
- Why Rule-based Parsing Just Won't Cut It
- Meet Your Dream Team
- Examples:
- Real-World Usage: A Case Study
On This Page
Try Managed Milvus for Free
Zilliz Cloud is hassle-free, powered by Milvus and 10x faster.
Get StartedLike the article? Spread the word