Chinese
The chinese
analyzer is designed specifically to handle Chinese text, providing effective segmentation and tokenization.
Definition
The chinese
analyzer consists of:
Tokenizer: Uses the
jieba
tokenizer to segment Chinese text into tokens based on vocabulary and context. For more information, refer to Jieba.Filter: Uses the
cnalphanumonly
filter to remove tokens that contain any non-Chinese characters. For more information, refer to Cnalphanumonly.
The functionality of the chinese
analyzer is equivalent to the following custom analyzer configuration:
analyzer_params = {
"tokenizer": "jieba",
"filter": ["cnalphanumonly"]
}
Configuration
To apply the chinese
analyzer to a field, simply set type
to chinese
in analyzer_params
.
analyzer_params = {
"type": "chinese",
}
The chinese
analyzer does not accept any optional parameters.
Example output
Here’s how the chinese
analyzer processes text.
Original text:
"Milvus 是一个高性能、可扩展的向量数据库!"
Expected output:
["Milvus", "是", "一个", "高性", "性能", "高性能", "可", "扩展", "的", "向量", "数据", "据库", "数据库"]