Milvus
Zilliz

Does Nemotron 3 Super support multimodal inputs like images?

Nemotron 3 Super itself is a text-based model, but NVIDIA’s ecosystem provides multimodal capabilities through companion models: Llama Nemotron Embed VL and Llama Nemotron Rerank VL, which handle vision-language tasks.

Embed VL generates vector embeddings from both text and images, enabling you to store and search multimodal content. Rerank VL scores the relevance of multimodal items during retrieval, improving the quality of results in RAG systems. These models work alongside Nemotron 3 Super to create end-to-end multimodal AI pipelines.

With Milvus, you can store embeddings from both vision and text modalities in the same collection. Your application can accept queries in any modality (text or image), generate embeddings using the appropriate VL model, and retrieve relevant items from Milvus regardless of their original format. This flexibility enables use cases like searching code repositories with visual diagrams, or finding relevant support documents by uploading a screenshot of an error.

Like the article? Spread the word