Does Nemotron 3 Super support multimodal inputs like images?

Nemotron 3 Super itself is a text-based model, but NVIDIA’s ecosystem provides multimodal capabilities through companion models: Llama Nemotron Embed VL and Llama Nemotron Rerank VL, which handle vision-language tasks.

Embed VL generates vector embeddings from both text and images, enabling you to store and search multimodal content. Rerank VL scores the relevance of multimodal items during retrieval, improving the quality of results in RAG systems. These models work alongside Nemotron 3 Super to create end-to-end multimodal AI pipelines.

With Milvus, you can store embeddings from both vision and text modalities in the same collection. Your application can accept queries in any modality (text or image), generate embeddings using the appropriate VL model, and retrieve relevant items from Milvus regardless of their original format. This flexibility enables use cases like searching code repositories with visual diagrams, or finding relevant support documents by uploading a screenshot of an error.

Does Nemotron 3 Super support multimodal inputs like images?

Need a VectorDB for Your GenAI Apps?

Recommended Tech Blogs & Tutorials

Keep Reading

What are the core components of a TTS system?

What are common TTS APIs available in the market?

How are roles managed in SQL databases?

What is event-based RL?