Jieba
The jieba
tokenizer processes Chinese text by breaking it down into its component words.
Configuration
Milvus supports two configuration approaches for the jieba
tokenizer: a simple configuration and a custom configuration.
Simple configuration
With the simple configuration, you only need to set the tokenizer to "jieba"
. For example:
# Simple configuration: only specifying the tokenizer name
analyzer_params = {
"tokenizer": "jieba", # Use the default settings: dict=["_default_"], mode="search", hmm=true
}
Map<String, Object> analyzerParams = new HashMap<>();
analyzerParams.put("tokenizer", "jieba");
const analyzer_params = {
"tokenizer": "jieba",
};
analyzerParams = map[string]any{"tokenizer": "jieba"}
# restful
analyzerParams='{
"tokenizer": "jieba"
}'
This simple configuration is equivalent to the following custom configuration:
# Custom configuration equivalent to the simple configuration above
analyzer_params = {
"type": "jieba", # Tokenizer type, fixed as "jieba"
"dict": ["_default_"], # Use the default dictionary
"mode": "search", # Use search mode for improved recall (see mode details below)
"hmm": true # Enable HMM for probabilistic segmentation
}
Map<String, Object> analyzerParams = new HashMap<>();
analyzerParams.put("type", "jieba");
analyzerParams.put("dict", Collections.singletonList("_default_"));
analyzerParams.put("mode", "search");
analyzerParams.put("hmm", true);
// javascript
analyzerParams = map[string]any{"type": "jieba", "dict": []any{"_default_"}, "mode": "search", "hmm": true}
# restful
For details on parameters, refer to Custom configuration.
Custom configuration
For more control, you can provide a custom configuration that allows you to specify a custom dictionary, select the segmentation mode, and enable or disable the Hidden Markov Model (HMM). For example:
# Custom configuration with user-defined settings
analyzer_params = {
"tokenizer": {
"type": "jieba", # Fixed tokenizer type
"dict": ["customDictionary"], # Custom dictionary list; replace with your own terms
"mode": "exact", # Use exact mode (non-overlapping tokens)
"hmm": false # Disable HMM; unmatched text will be split into individual characters
}
}
Map<String, Object> analyzerParams = new HashMap<>();
analyzerParams.put("type", "jieba");
analyzerParams.put("dict", Collections.singletonList("customDictionary"));
analyzerParams.put("mode", "exact");
analyzerParams.put("hmm", false);
// javascript
analyzerParams = map[string]any{"type": "jieba", "dict": []any{"customDictionary"}, "mode": "exact", "hmm": false}
# restful
Parameter |
Description |
Default Value |
---|---|---|
|
The type of tokenizer. This is fixed to |
|
|
A list of dictionaries that the analyzer will load as its vocabulary source. Built-in options:
|
|
|
The segmentation mode. Possible values:
|
|
|
A boolean flag indicating whether to enable the Hidden Markov Model (HMM) for probabilistic segmentation of words not found in the dictionary. |
|
After defining analyzer_params
, you can apply them to a VARCHAR
field when defining a collection schema. This allows Milvus to process the text in that field using the specified analyzer for efficient tokenization and filtering. For details, refer to Example use.
Examples
Before applying the analyzer configuration to your collection schema, verify its behavior using the run_analyzer
method.
Analyzer configuration
analyzer_params = {
"tokenizer": {
"type": "jieba",
"dict": ["结巴分词器"],
"mode": "exact",
"hmm": False
}
}
Map<String, Object> analyzerParams = new HashMap<>();
analyzerParams.put("type", "jieba");
analyzerParams.put("dict", Collections.singletonList("结巴分词器"));
analyzerParams.put("mode", "exact");
analyzerParams.put("hmm", false);
// javascript
analyzerParams = map[string]any{"type": "jieba", "dict": []any{"结巴分词器"}, "mode": "exact", "hmm": false}
# restful
Verification using run_analyzer
from pymilvus import (
MilvusClient,
)
client = MilvusClient(uri="http://localhost:19530")
# Sample text to analyze
sample_text = "milvus结巴分词器中文测试"
# Run the standard analyzer with the defined configuration
result = client.run_analyzer(sample_text, analyzer_params)
print("Standard analyzer output:", result)
import io.milvus.v2.client.ConnectConfig;
import io.milvus.v2.client.MilvusClientV2;
import io.milvus.v2.service.vector.request.RunAnalyzerReq;
import io.milvus.v2.service.vector.response.RunAnalyzerResp;
ConnectConfig config = ConnectConfig.builder()
.uri("http://localhost:19530")
.build();
MilvusClientV2 client = new MilvusClientV2(config);
List<String> texts = new ArrayList<>();
texts.add("milvus结巴分词器中文测试");
RunAnalyzerResp resp = client.runAnalyzer(RunAnalyzerReq.builder()
.texts(texts)
.analyzerParams(analyzerParams)
.build());
List<RunAnalyzerResp.AnalyzerResult> results = resp.getResults();
// javascript
// go
# restful
Expected output
['milvus', '结巴分词器', '中', '文', '测', '试']