About Milvus
Get Started
Concepts
User Guide
- Database
- Collections
- Schema & Data Fields
  - Schema Explained
  - Primary Field & AutoID
  - Dense Vector
  - Binary Vector
  - Sparse Vector
  - String Field
  - Number Field
  - JSON Field
  - Array Field
  - Array of Structs
  - Geometry Field
  - TIMESTAMPTZ Field
  - Dynamic Field
  - Nullable Fields
  - Default Values
  - Analyzer
    Analyzer Overview
    Built-in Analyzers
    Tokenizers
    Standard
    Whitespace
    Jieba
    Lindera
    ICU
    Language Identifier
    Filters
    Multi-language Analyzers
    Choose the Right Analyzer for Your Use Case
    Manage File Resources
  - Alter Collection Field
  - Add Fields to an Existing Collection
  - Best Practices
- Insert & Delete
- Indexes
- Search
- Function & Model Inference
- Storage Optimization
- Snapshots
Data Import
AI Tools
Administration Guide
Tools
Integrations
Tutorials
FAQs
API Reference

Home
Docs
User Guide
Schema & Data Fields
Analyzer
Tokenizers
Jieba

Jieba

The jieba tokenizer processes Chinese text by breaking it down into its component words.

The jieba tokenizer preserves punctuation marks as separate tokens in the output. For example, "你好！世界。" becomes ["你好", "！", "世界", "。"]. To remove these standalone punctuation tokens, use the removepunct filter.

Configuration

Milvus supports two configuration approaches for the jieba tokenizer: a simple configuration and a custom configuration.

Simple configuration

With the simple configuration, you only need to set the tokenizer to "jieba". For example:

Python Java NodeJS Go cURL

# Simple configuration: only specifying the tokenizer name
analyzer_params = {
    "tokenizer": "jieba",  # Use the default settings: dict=["_default_"], mode="search", hmm=True
}

Map<String, Object> analyzerParams = new HashMap<>();
analyzerParams.put("tokenizer", "jieba");

const analyzer_params = {
    "tokenizer": "jieba",
};

analyzerParams = map[string]any{"tokenizer": "jieba"}

# restful
analyzerParams='{
  "tokenizer": "jieba"
}'

This simple configuration is equivalent to the following custom configuration:

Python Java NodeJS Go cURL

# Custom configuration equivalent to the simple configuration above
analyzer_params = {
    "type": "jieba",          # Tokenizer type, fixed as "jieba"
    "dict": ["_default_"],     # Use the default dictionary
    "mode": "search",          # Use search mode for improved recall (see mode details below)
    "hmm": True                # Enable HMM for probabilistic segmentation
}

Map<String, Object> analyzerParams = new HashMap<>();
analyzerParams.put("type", "jieba");
analyzerParams.put("dict", Collections.singletonList("_default_"));
analyzerParams.put("mode", "search");
analyzerParams.put("hmm", true);

// javascript

analyzerParams = map[string]any{"type": "jieba", "dict": []any{"_default_"}, "mode": "search", "hmm": true}

# restful

For details on parameters, refer to Custom configuration.

Custom configuration

For more control, you can provide a custom configuration that allows you to specify a custom dictionary, select the segmentation mode, and enable or disable the Hidden Markov Model (HMM). For example:

Python Java NodeJS Go cURL

# Custom configuration with user-defined settings
analyzer_params = {
    "tokenizer": {
        "type": "jieba",           # Fixed tokenizer type
        "dict": ["customDictionary"],  # Custom dictionary list; replace with your own terms
        "mode": "exact",           # Use exact mode (non-overlapping tokens)
        "hmm": False               # Disable HMM; unmatched text will be split into individual characters
    }
}

Map<String, Object> analyzerParams = new HashMap<>();                                                                          
analyzerParams.put("tokenizer", new HashMap<String, Object>() {{
  put("type", "jieba");                                                                                                      
  put("dict", Arrays.asList("customDictionary"));             
  put("mode", "exact");
  put("hmm", false);
}});

// javascript

analyzerParams := map[string]interface{}{
  "tokenizer": map[string]interface{}{
      "type": "jieba",
      "dict": []string{"customDictionary"},
      "mode": "exact",
      "hmm":  false,
  },
}

# restful

Parameter	Description	Default Value
`type`	The type of tokenizer. This is fixed to `"jieba"`.	`"jieba"`
`dict`	A list of dictionaries that the analyzer will load as its vocabulary source. Built-in options: `"_default_"`: Loads the engine's built‑in Simplified‑Chinese dictionary. For details, refer to dict.txt. `"_extend_default_"`: Loads everything in `"_default_"` plus an additional Traditional‑Chinese supplement. For details, refer to dict.txt.big. You can also mix the built‑in dictionary with any number of custom dictionaries. Example: `["_default_", "结巴分词器"]`.	`["_default_"]`
`mode`	The segmentation mode. Possible values: `"exact"`: Tries to segment the sentence in the most precise manner, making it ideal for text analysis. `"search"`: Builds on exact mode by further breaking down long words to improve recall, making it suitable for search engine tokenization. For more information, refer to Jieba GitHub Project.	`"search"`
`hmm`	A boolean flag indicating whether to enable the Hidden Markov Model (HMM) for probabilistic segmentation of words not found in the dictionary.	`true`

To load a large custom vocabulary from an external file instead of inlining it via dict, see Custom configuration with a dictionary file below.

After defining analyzer_params, you can apply them to a VARCHAR field when defining a collection schema. This allows Milvus to process the text in that field using the specified analyzer for efficient tokenization and filtering. For details, refer to Example use.

Custom configuration with a dictionary fileCompatible with Milvus 3.0.x

For large custom vocabularies — domain glossaries, product terminology, or proper-noun lists — store the words in a file and register the file as a remote file resource, then reference it from the tokenizer via the extra_dict_file parameter. The analyzer loads these words into its vocabulary on top of the built-in dictionary.

The file is plain UTF‑8 text with one term per line. For example:

结巴分词器
向量数据库

Upload the file to the object store that your Milvus cluster is configured to use, then register it:

Python Java NodeJS Go cURL

from pymilvus import MilvusClient

client = MilvusClient(uri="http://localhost:19530")

# Register the uploaded file under a name you'll reference from analyzer configs.
client.add_file_resource(
    name="zh_terms",
    path="file/zh_terms.txt",    # full S3 object key, including the rootPath prefix
)

// java

// nodejs

// go

# restful

Reference the registered resource in the tokenizer via extra_dict_file:

Python Java NodeJS Go cURL

analyzer_params = {
    "tokenizer": {
        "type": "jieba",
        "dict": ["_default_"],             # keep the built-in dictionary
        "mode": "exact",
        "hmm": False,
        "extra_dict_file": {
            "type": "remote",
            "resource_name": "zh_terms",
            "file_name": "zh_terms.txt",
        },
    },
}

client.run_analyzer(["milvus结巴分词器中文测试"], analyzer_params)
# → [['milvus', '结巴', '分词器', '中文', '测试']]

// java

// nodejs

// go

# restful

The extra_dict_file parameter accepts an object with the following fields:

Field	Description
`type`	The resource type. Use `"remote"` for a file registered via `add_file_resource`. For the `"local"` variant used in self-hosted deployments, refer to Manage File Resources.
`resource_name`	The name used when the file was registered with `add_file_resource`.
`file_name`	The filename portion of the registered resource's object-store path (for example, `"zh_terms.txt"` if the resource was registered with `path="file/zh_terms.txt"`).

Words added via extra_dict_file are merged with the built-in dictionary, so jieba’s segmentation algorithm sees them alongside existing entries. Whether any specific term surfaces as a standalone token depends on jieba’s probability-weighted DAG selection — a long custom term such as 向量数据库 may still be split into 向量 + 数据库 if those shorter entries have higher frequencies in the built-in dictionary.

Examples

Before applying the analyzer configuration to your collection schema, verify its behavior using the run_analyzer method.

Analyzer configuration

Python Java NodeJS Go cURL

analyzer_params = {
    "tokenizer": {
        "type": "jieba",
        "dict": ["结巴分词器"],
        "mode": "exact",
        "hmm": False
    }
}

Map<String, Object> analyzerParams = new HashMap<>();                                                                          
analyzerParams.put("tokenizer", new HashMap<String, Object>() {{
  put("type", "jieba");                                                                                                      
  put("dict", Arrays.asList("结巴分词器"));                   
  put("mode", "exact");
  put("hmm", false);
}});

// javascript

analyzerParams := map[string]interface{}{
  "tokenizer": map[string]interface{}{
      "type": "jieba",
      "dict": []string{"结巴分词器"},
      "mode": "exact",
      "hmm":  false,
  },
}

# restful

Verification using `run_analyzer`

Python Java NodeJS Go cURL

from pymilvus import (
    MilvusClient,
)

client = MilvusClient(
    uri="http://localhost:19530",
    token="root:Milvus"
)

# Sample text to analyze
sample_text = "milvus结巴分词器中文测试"

# Run the standard analyzer with the defined configuration
result = client.run_analyzer(sample_text, analyzer_params)
print("Standard analyzer output:", result)

import io.milvus.v2.client.ConnectConfig;
import io.milvus.v2.client.MilvusClientV2;
import io.milvus.v2.service.vector.request.RunAnalyzerReq;
import io.milvus.v2.service.vector.response.RunAnalyzerResp;

ConnectConfig config = ConnectConfig.builder()
        .uri("http://localhost:19530")
        .token("root:Milvus")
        .build();
MilvusClientV2 client = new MilvusClientV2(config);

List<String> texts = new ArrayList<>();
texts.add("milvus结巴分词器中文测试");

RunAnalyzerResp resp = client.runAnalyzer(RunAnalyzerReq.builder()
        .texts(texts)
        .analyzerParams(analyzerParams)
        .build());
List<RunAnalyzerResp.AnalyzerResult> results = resp.getResults();

// javascript

import (
    "context"
    "encoding/json"
    "fmt"

    "github.com/milvus-io/milvus/client/v2/milvusclient"
)

client, err := milvusclient.New(ctx, &milvusclient.ClientConfig{
    Address: "localhost:19530",
    APIKey:  "root:Milvus",
})
if err != nil {
    fmt.Println(err.Error())
    // handle error
}

bs, _ := json.Marshal(analyzerParams)
texts := []string{"milvus结巴分词器中文测试"}
option := milvusclient.NewRunAnalyzerOption(texts).
    WithAnalyzerParams(string(bs))

result, err := client.RunAnalyzer(ctx, option)
if err != nil {
    fmt.Println(err.Error())
    // handle error
}

# restful

Expected output

['milvus', '结巴分词器', '中', '文', '测', '试']

Jieba
Configuration
Simple configuration
Custom configuration
Custom configuration with a dictionary file
Examples
Analyzer configuration
Verification using run_analyzer
Expected output

Try Managed Milvus for Free

Zilliz Cloud is hassle-free, powered by Milvus and 10x faster.

Get Started

Feedback

Was this page helpful?

Jieba

Configuration

Simple configuration

Custom configuration

Custom configuration with a dictionary fileCompatible with Milvus 3.0.x

Examples

Analyzer configuration

Verification using run_analyzer

Expected output

Table of contents

Try Managed Milvus for Free

Feedback

Verification using `run_analyzer`