Milvus
Zilliz
Home
  • User Guide
  • Home
  • Docs
  • User Guide

  • Schema & Data Fields

  • Analyzer

  • Choose the Right Analyzer for Your Use Case

Choose the Right Analyzer for Your Use Case

This guide focuses on practical decision-making for analyzer selection. For technical details about analyzer components and how to add analyzer parameters, refer to Analyzer Overview.

Understand analyzers in 2 minutes

In Milvus, an analyzer processes the text stored in this field to make it searchable for features like full text search (BM25), phrase match, or text match. Think of it as a text processor that transforms your raw content into searchable tokens.

An analyzer works in a simple, two-stage pipeline:

Analyzer Workflow Analyzer Workflow

  1. Tokenization (required): This initial stage applies a tokenizer to break a continuous string of text into discrete, meaningful units called tokens. The tokenization method can vary significantly depending on the language and content type.

  2. Token filtering (optional): After tokenization, filters are applied to modify, remove, or refine the tokens. These operations can include converting all tokens to lowercase, removing common meaningless words (such as stopwords), or reducing words to their root form (stemming).

Example:

Input: "Hello World!" 
       1. Tokenization → ["Hello", "World", "!"]
       2. Lowercase & Punctuation Filtering → ["hello", "world"]

Why the choice of analyzer matters

Choosing the wrong analyzer can make relevant documents unsearchable or return irrelevant results.

The following table summarizes common problems caused by improper analyzer selection and provides actionable solutions for diagnosing search issues.

Problem

Symptom

Example (Input & Output)

Cause (Bad Analyzer)

Solution (Good Analyzer)

Over-tokenization

Text queries for technical terms, identifiers, or URLs fail to find relevant documents.

  • "user_id"['user', 'id']

  • "C++"['c']

standard analyzer

Use a whitespace tokenizer; combine with an alphanumonly filter.

Under-tokenization

Search for a component of a multi-word phrase fails to return documents containing the full phrase.

"state-of-the-art"['state-of-the-art']

Analyzer with a whitespace tokenizer

Use a standard tokenizer to split on punctuation and spaces; use a custom regex filter.

Language Mismatches

Search results for a specific language are nonsensical or nonexistent.

Chinese text: "机器学习"['机器学习'] (one token)

english analyzer

Use a language-specific analyzer, such as chinese.

First question: Do you need to choose an analyzer?

For many use cases, you don’t need to do anything special. Let’s determine if you’re one of them.

Default behavior: standard analyzer

If you don’t specify an analyzer when using text retrieval features like full text search, Milvus automatically uses the standard analyzer.

The standard analyzer:

  • Splits text on spaces and punctuation

  • Converts all tokens to lowercase

  • Removes a built-in set of common English stop words and most punctuation

Example transformation:

Input:  "The Milvus vector database is built for scale!"
Output: ['the', 'milvus', 'vector', 'database', 'is', 'built', 'scale']

Decision criteria: A quick check

Use this table to quickly determine if the default standard analyzer meets your needs. If it doesn’t, you’ll need to choose a different path.

Your Content

Standard Analyzer OK?

Why

What You Need

English blog posts

✅ Yes

Default behavior is sufficient.

Use the default (no configuration needed).

Chinese documents

❌ No

Chinese words have no spaces and will be treated as one token.

Use a built-in chinese analyzer.

Technical documentation

❌ No

Punctuation is stripped from terms like C++.

Create a custom analyzer with a whitespace tokenizer and an alphanumonly filter.

Space-separated languages such as French/Spanish text

⚠️ Maybe

Accented characters (café vs. cafe) may not match.

A custom analyzer with the asciifolding is recommended for better results.

Multilingual or unknown languages

❌ No

The standard analyzer lacks the language-specific logic needed to handle different character sets and tokenization rules.

Use a custom analyzer with the icu tokenizer for unicode-aware tokenization.

Alternatively, consider configuring multi-language analyzers or a language identifier for more precise handling of multilingual content.

If the default standard analyzer cannot meet your requirements, you need to implement a different one. You have two paths:

Path A: Use built-in analyzers

Built-in analyzers are pre-configured solutions for common languages. They are the easiest way to get started when the default standard analyzer isn’t a perfect fit.

Available built-in analyzers

Analyzer

Language Support

Components

Notes

standard

Most space-separated languages (English, French, German, Spanish, etc.)

  • Tokenizer: standard

  • Filters: lowercase

General-purpose analyzer for initial text processing. For monolingual scenarios, language-specific analyzers (like english) provide better performance.

english

Dedicated to English, which applies stemming and stop word removal for better English semantic matching

  • Tokenizer: standard

  • Filters: lowercase, stemmer, stop

Recommended for English-only content over standard.

chinese

Chinese

  • Tokenizer: jieba

  • Filters: cnalphanumonly

Currently uses Simplified Chinese dictionary by default.

Implementation example

To use a built-in analyzer, simply specify its type in the analyzer_params when defining your field schema.

# Using built-in English analyzer
analyzer_params = {
    "type": "english"
}

# Applying analyzer config to target VARCHAR field in your collection schema
schema.add_field(
    field_name='text',
    datatype=DataType.VARCHAR,
    max_length=200,
    enable_analyzer=True,
    analyzer_params=analyzer_params,
)

For detailed usage, refer to Full Text Search, Text Match, or Phrase Match.

Path B: Create a custom analyzer

When built-in options don’t meet your needs, you can create a custom analyzer by combining a tokenizer with a set of filters. This gives you full control over the text processing pipeline.

Step 1: Select the tokenizer based on language

Choose your tokenizer based on your content’s primary language:

Western languages

For space-separated languages, you have these options:

Tokenizer

How It Works

Best For

Examples

standard

Splits text based on spaces and punctuation marks

General text, mixed punctuation

  • Input: "Hello, world! Visit example.com"

  • Output: ['Hello', 'world', 'Visit', 'example', 'com']

whitespace

Splits only on whitespace characters

Pre-processed content, user-formatted text

  • Input: "user_id = get_user_data()"

  • Output: ['user_id', '=', 'get_user_data()']

East Asian languages

Dictionary-based languages require specialized tokenizers for proper word segmentation:

Chinese

Tokenizer

How It Works

Best For

Examples

jieba

Chinese dictionary-based segmentation with intelligent algorithm

Recommended for Chinese content - combines dictionary with intelligent algorithms, specifically designed for Chinese

  • Input: "机器学习是人工智能的一个分支"

  • Output: ['机器', '学习', '是', '人工', '智能', '人工智能', '的', '一个', '分支']

lindera

Pure dictionary-based morphological analysis with Chinese dictionary (cc-cedict)

Compared to jieba, processes Chinese text in a more generic manner

  • Input: "机器学习算法"

  • Output: ["机器", "学习", "算法"]

Japanese and Korean

Language

Tokenizer

Dictionary Options

Best For

Examples

Japanese

lindera

ipadic (general-purpose), ipadic-neologd (modern terms), unidic (academic)

Morphological analysis with proper noun handling

  • Input: "東京都渋谷区"

  • Output: ["東京", "都", "渋谷", "区"]

Korean

lindera

ko-dic

Korean morphological analysis

  • Input: "안녕하세요"

  • Output: ["안녕", "하", "세요"]

Multilingual or unknown languages

For content where languages are unpredictable or mixed within documents:

Tokenizer

How It Works

Best For

Examples

icu

Unicode-aware tokenization (International Components for Unicode)

Mixed scripts, unknown languages, or when simple tokenization is sufficient

  • Input: "Hello 世界 مرحبا"

  • Output: ['Hello', ' ', '世界', ' ', 'مرحبا']

When to use icu:

  • Mixed languages where language identification is impractical.

  • You don’t want the overhead of multi-language analyzers or the language identifier.

  • Content has a primary language with occasional foreign words that contribute little to the overall meaning (e.g., English text with sporadic brand names or technical terms in Japanese or French).

Alternative approaches: For more precise handling of multilingual content, consider using multi-language analyzers or the language identifier. For details, refer to Multi-language Analyzers or Language Identifier.

Step 2: Add filters for precision

After selecting your tokenizer, apply filters based on your specific search requirements and content characteristics.

Commonly used filters

These filters are essential for most space-separated language configurations (English, French, German, Spanish, etc.) and significantly improve search quality:

Filter

How It Works

When to Use

Examples

lowercase

Convert all tokens to lowercase

Universal - applies to all languages with case distinctions

  • Input: ["Apple", "iPhone"]

  • Output: [['apple'], ['iphone']]

stemmer

Reduce words to their root form

Languages with word inflections (English, French, German, etc.)

For English:

  • Input: ["running", "runs", "ran"]

  • Output: [['run'], ['run'], ['ran']]

stop

Remove common meaningless words

Most languages - particularly effective for space-separated languages

  • Input: ["the", "quick", "brown", "fox"]

  • Output: [[], ['quick'], ['brown'], ['fox']]

For East Asian languages (Chinese, Japanese, Korean, etc.), focus on language-specific filters instead. These languages typically use different approaches for text processing and may not benefit significantly from stemming.

Text normalization filters

These filters standardize text variations to improve matching consistency:

Filter

How It Works

When to Use

Examples

asciifolding

Convert accented characters to ASCII equivalents

International content, user-generated content

  • Input: ["café", "naïve", "résumé"]

  • Output: [['cafe'], ['naive'], ['resume']]

Token filtering

Control which tokens are preserved based on character content or length:

Filter

How It Works

When to Use

Examples

removepunct

Remove standalone punctuation tokens

Clean output from jieba, lindera, icu tokenizers, which will return punctuations as single tokens

  • Input: ["Hello", "!", "world"]

  • Output: [['Hello'], ['world']]

alphanumonly

Keep only letters and numbers

Technical content, clean text processing

  • Input: ["user123", "test@email.com"]

  • Output: [['user123'], ['test', 'email', 'com']]

length

Remove tokens outside specified length range

Filter noise (exccessively long tokens)

  • Input: ["a", "very", "extraordinarily"]

  • Output: [['a'], ['very'], []] (if max=10)

regex

Custom pattern-based filtering

Domain-specific token requirements

  • Input: ["test123", "prod456"]

  • Output: [[], ['prod456']] (if expr="^prod")

Language-specific filters

These filters handle specific language characteristics:

Filter

Language

How It Works

Examples

decompounder

German

Splits compound words into searchable components

  • Input: ["dampfschifffahrt"]

  • Output: [['dampf', 'schiff', 'fahrt']]

cnalphanumonly

Chinese

Keeps Chinese characters + alphanumeric

  • Input: ["Hello", "世界", "123", "!@#"]

  • Output: [['Hello'], ['世界'], ['123'], []]

cncharonly

Chinese

Keeps only Chinese characters

  • Input: ["Hello", "世界", "123"]

  • Output: [[], ['世界'], []]

Step 3: Combine and implement

To create your custom analyzer, you define the tokenizer and a list of filters in the analyzer_params dictionary. The filters are applied in the order they are listed.

# Example: A custom analyzer for technical content
analyzer_params = {
    "tokenizer": "whitespace",
    "filter": ["lowercase", "alphanumonly"]
}

# Applying analyzer config to target VARCHAR field in your collection schema
schema.add_field(
    field_name='text',
    datatype=DataType.VARCHAR,
    max_length=200,
    enable_analyzer=True,
    analyzer_params=analyzer_params,
)

Final: Test with run_analyzer

Always validate your configuration before applying to a collection:

# Sample text to analyze
sample_text = "The Milvus vector database is built for scale!"

# Run analyzer with the defined configuration
result = client.run_analyzer(sample_text, analyzer_params)
print("Analyzer output:", result)

Common issues to check:

  • Over-tokenization: Technical terms being split incorrectly

  • Under-tokenization: Phrases not being separated properly

  • Missing tokens: Important terms being filtered out

For detailed usage, refer to run_analyzer.

This section provides recommended tokenizer and filter configurations for common use cases when working with analyzers in Milvus. Choose the combination that best matches your content type and search requirements.

Before applying an analyzer to your collection, we recommend you use run_analyzer to test and validate text analysis performance.

Languages with accent marks (French, Spanish, German, etc.)

Use a standard tokenizer with lowercase conversion, language-specific stemming, and stopword removal. This configuration also works for other European languages by modifying the language and stop_words parameters.

# French example
analyzer_params = {
    "tokenizer": "standard",
    "filter": [
        "lowercase", 
        "asciifolding",  # Handle accent marks
        {
            "type": "stemmer",
            "language": "french"
        },
        {
            "type": "stop",
            "stop_words": ["_french_"]
        }
    ]
}

# For other languages, modify the language parameter:
# "language": "spanish" for Spanish
# "language": "german" for German
# "stop_words": ["_spanish_"] or ["_german_"] accordingly

English content

For English text processing with comprehensive filtering. You can also use the built-in english analyzer:

analyzer_params = {
    "tokenizer": "standard",
    "filter": [
        "lowercase",
        {
            "type": "stemmer",
            "language": "english"
        },
        {
            "type": "stop",
            "stop_words": ["_english_"]
        }
    ]
}

# Equivalent built-in shortcut:
analyzer_params = {
    "type": "english"
}

Chinese content

Use the jieba tokenizer and apply a character filter to retain only Chinese characters, Latin letters, and digits.

analyzer_params = {
    "tokenizer": "jieba",
    "filter": ["cnalphanumonly"]
}

# Equivalent built-in shortcut:
analyzer_params = {
    "type": "chinese"
}

For Simplified Chinese, cnalphanumonly removes all tokens except Chinese characters, alphanumeric text, and digits. This prevents punctuation from affecting search quality.

Japanese content

Use the lindera tokenizer with Japanese dictionary and filters to clean punctuation and control token length:

analyzer_params = {
    "tokenizer": {
        "type": "lindera",
        "dict": "ipadic"  # Options: ipadic, ipadic-neologd, unidic
    },
    "filter": [
        "removepunct",  # Remove standalone punctuation
        {
            "type": "length",
            "min": 1,
            "max": 20
        }
    ]
}

Korean content

Similar to Japanese, using lindera tokenizer with Korean dictionary:

analyzer_params = {
    "tokenizer": {
        "type": "lindera",
        "dict": "ko-dic"
    },
    "filter": [
        "removepunct",
        {
            "type": "length",
            "min": 1,
            "max": 20
        }
    ]
}

Mixed or multilingual content

When working with content that spans multiple languages or uses scripts unpredictably, start with the icu analyzer. This Unicode-aware analyzer handles mixed scripts and symbols effectively.

Basic multilingual configuration (no stemming):

analyzer_params = {
    "tokenizer": "icu",
    "filter": ["lowercase", "asciifolding"]
}

Advanced multilingual processing:

For better control over token behavior across different languages:

Integrate with text retrieval features

After selecting your analyzer, you can integrate it with text retrieval features provided by Milvus.

  • Full text search

    Analyzers directly impact BM25-based full text search through sparse vector generation. Use the same analyzer for both indexing and querying to ensure consistent tokenization. Language-specific analyzers generally provide better BM25 scoring than generic ones. For implementation details, refer to Full Text Search.

  • Text match

    Text match operations perform exact token matching between queries and indexed content based on your analyzer output. For implementation details, refer to Text Match.

  • Phrase match

    Phrase match requires consistent tokenization across multi-word expressions to maintain phrase boundaries and meaning. For implementation details, refer to Phrase Match.