UltraRag manages multimodal data through a modular, configurable framework designed to process and integrate diverse data types like text, images, and documents containing both. A core aspect of its approach is the native multimodal support across its key components, including the Retriever, Generator, and Evaluator modules. This means that these modules are inherently capable of handling not just pure text but also visual data and complex cross-modal inputs, allowing for a unified processing pipeline. For instance, its VisRAG Pipeline is specifically designed to parse local PDF documents, automatically extracting both text and charts, and then building cross-modal indexes to facilitate hybrid retrieval, such as querying images with text or vice-versa. This capability is essential for applications dealing with scientific papers or technical manuals where information is often presented in a mixed-media format.
The framework further streamlines multimodal data handling through its Automated Knowledge & Corpus Construction process. A Unified Corpus Server within UltraRag can parse a wide array of document formats, including .txt, .md, .pdf, .epub, and others. It integrates tools like MinerU for layout-aware text recovery and flexible chunking, which helps in correctly extracting and structuring information from complex document layouts where text and images are intertwined. This intelligent parsing ensures that the semantic meaning and structural relationships between different modalities are preserved before indexing. The processed data, in the form of embeddings, can then be stored and efficiently retrieved using vector databases such as Milvus or Faiss, which are explicitly supported for large-scale corpus construction and high-performance retrieval.
UltraRag’s ability to manage complex multimodal workflows is deeply rooted in its modular architecture, inspired by the Model Context Protocol (MCP), and its declarative YAML configuration system. The MCP architecture encapsulates RAG core components as standardized, independent servers (e.g., Corpus, Retriever, Generation, Evaluation), each with unified tool interfaces, promoting flexibility and extensibility. This modularity allows developers to easily plug in and orchestrate different multimodal processing capabilities without extensive code modifications. The entire workflow, from data ingestion and processing to retrieval and generation, is defined in simple YAML files, which simplifies the creation of intricate pipelines involving sequential, looped, or conditional logic for multimodal data. This low-code approach significantly reduces the engineering overhead, making it easier for researchers and developers to experiment with and deploy sophisticated multimodal RAG systems.