Yes, Gemma 4 can generate embeddings for both text and images, enabling unified multimodal vector search applications.
Gemma 4’s Per-Layer Embeddings architecture produces high-quality vector representations at each decoder layer. This flexibility allows you to extract embeddings from intermediate layers rather than just the final output, potentially optimizing dimensionality and performance trade-offs for your specific use case.
The multimodal capability is particularly powerful: you can embed images and text in the same vector space, enabling true cross-modal semantic search. For example, you could search for images using text queries or vice versa. This unified embedding space is essential for building modern multimodal retrieval systems.
When paired with Milvus, Gemma 4’s embeddings can be indexed and queried at scale. Milvus handles the vector storage, similarity search, and filtering operations efficiently, while Gemma 4 provides the semantic understanding. This combination eliminates the need for proprietary cloud databases, giving you control over your embedding generation and storage infrastructure.
Related Resources