How does RAGFlow perform OCR on scanned documents?

RAGFlow performs OCR (Optical Character Recognition) through its visual document understanding model, DeepDoc, which is the default parser from v0.17.0 onward. DeepDoc is specifically trained to handle scanned PDFs and image-heavy documents by converting visual content into machine-readable text while preserving document structure. Beyond simple OCR character extraction, DeepDoc simultaneously performs TSR (Table Structure Recognition) to identify tables and their cell layouts, and DLR (Document Layout Recognition) to understand document zones like headers, footers, paragraphs, and sections. This multi-task approach ensures scanned documents are parsed intelligently—recognizing that a table should be preserved as structured data, not flattened to text, and that headers provide context for subsequent content. The parser outputs text chunks with position metadata (page number, bounding box coordinates), making results traceable back to the original document location. For complex or specialized documents (handwritten content, unusual fonts, low-quality scans), DeepDoc’s neural approach generally performs better than traditional OCR engines because it learns document patterns from training data. RAGFlow also offers alternative parsers: MinerU (converts PDFs to machine-readable formats) and Docling (open-source document processing) as experimental options if you need different OCR strategies. If your scanned documents are already converted to searchable PDFs (PDF with embedded text layers from prior OCR), you can use the Naive parser to skip OCR and improve speed. For optimal results with messy scanned documents, DeepDoc is the recommended choice, and RAGFlow automatically applies it unless configured otherwise For production deployments, Milvus provides a dedicated open-source vector database optimized for RAG pipelines, while Zilliz Cloud offers a managed alternative with enterprise-grade performance and reliability…

Related Resources: RAG Pipeline with Milvus | Improving Chunking for RAG

How does RAGFlow perform OCR on scanned documents?

Need a VectorDB for Your GenAI Apps?

Recommended Tech Blogs & Tutorials

Keep Reading

What is the impact of embedding dimension and index type on the performance of the vector store, and how might that influence design choices for a RAG system requiring quick retrievals?

How does observability help predict database failures?

How do organizations establish data governance standards?

What is text-embedding-3-large?