關於 Milvus
開始使用
概念
使用者指南
- 資料庫
- 收藏集
- 模式與資料欄位
  - 模式說明
  - 主要欄位與自動識別
  - 密集向量
  - 二進位向量
  - 稀疏向量
  - 字串欄位
  - 號碼欄位
  - JSON 欄位
  - 陣列欄位
  - 結構體陣列
  - 幾何領域
  - TIMESTAMPTZ 欄位
  - 動態領域
  - 可歸零欄位
  - 預設值
  - 分析器
    分析儀總覽
    內建分析儀
    計時器
    標準
    白色空間
    捷霸
    林達
    加護病房
    語言識別碼
    濾波器
    多語言分析儀
    針對您的使用個案選擇正確的分析儀
    管理檔案資源
  - 更改收藏欄位
  - 新增欄位到現有的集合
  - 最佳實務
- 插入與刪除
- 索引
- 搜尋
- 功能與模型推論
- 儲存優化
- 快照
資料匯入
AI 工具
管理指南
工具
整合
教學
常見問題
API Reference

Home
Docs
使用者指南
模式與資料欄位
分析器
計時器
捷霸

詞霸

jieba tokenizer 將中文文字分解成字元來處理。

jieba tokenizer 在輸出中保留標點符號作為獨立的標記。例如，"你好！世界。" 變成["你好", "！", "世界", "。"] 。若要移除這些獨立的標點符號，請使用 removepunct過濾器。

配置

Milvus 支援兩種jieba tokenizer 的配置方法：簡單配置和自訂配置。

簡單配置

使用簡單配置，您只需要將 tokenizer 設定為"jieba" 。例如

Python Java NodeJS Go cURL

# Simple configuration: only specifying the tokenizer name
analyzer_params = {
    "tokenizer": "jieba",  # Use the default settings: dict=["_default_"], mode="search", hmm=True
}

Map<String, Object> analyzerParams = new HashMap<>();
analyzerParams.put("tokenizer", "jieba");

const analyzer_params = {
    "tokenizer": "jieba",
};

analyzerParams = map[string]any{"tokenizer": "jieba"}

# restful
analyzerParams='{
  "tokenizer": "jieba"
}'

此簡單配置等同於下列自訂配置：

Python Java NodeJS Go cURL

# Custom configuration equivalent to the simple configuration above
analyzer_params = {
    "type": "jieba",          # Tokenizer type, fixed as "jieba"
    "dict": ["_default_"],     # Use the default dictionary
    "mode": "search",          # Use search mode for improved recall (see mode details below)
    "hmm": True                # Enable HMM for probabilistic segmentation
}

Map<String, Object> analyzerParams = new HashMap<>();
analyzerParams.put("type", "jieba");
analyzerParams.put("dict", Collections.singletonList("_default_"));
analyzerParams.put("mode", "search");
analyzerParams.put("hmm", true);

// javascript

analyzerParams = map[string]any{"type": "jieba", "dict": []any{"_default_"}, "mode": "search", "hmm": true}

# restful

有關參數的詳細資訊，請參閱自訂配置。

自訂組態

若要進行更多控制，您可以提供自訂組態，讓您指定自訂字典、選擇分割模式，以及啟用或停用隱馬可夫模型 (HMM)。舉例來說

Python Java NodeJS Go cURL

# Custom configuration with user-defined settings
analyzer_params = {
    "tokenizer": {
        "type": "jieba",           # Fixed tokenizer type
        "dict": ["customDictionary"],  # Custom dictionary list; replace with your own terms
        "mode": "exact",           # Use exact mode (non-overlapping tokens)
        "hmm": False               # Disable HMM; unmatched text will be split into individual characters
    }
}

Map<String, Object> analyzerParams = new HashMap<>();                                                                          
analyzerParams.put("tokenizer", new HashMap<String, Object>() {{
  put("type", "jieba");                                                                                                      
  put("dict", Arrays.asList("customDictionary"));             
  put("mode", "exact");
  put("hmm", false);
}});

// javascript

analyzerParams := map[string]interface{}{
  "tokenizer": map[string]interface{}{
      "type": "jieba",
      "dict": []string{"customDictionary"},
      "mode": "exact",
      "hmm":  false,
  },
}

# restful

參數	說明	預設值
`type`	tokenizer 的類型。固定為`"jieba"` 。	`"jieba"`
`dict`	分析器將載入作為詞彙來源的詞典清單。內建選項： `"_default_"`:載入引擎內建的簡體中文字典。詳情請參閱dict.txt。 `"_extend_default_"`:載入`"_default_"` 中的所有內容，外加額外的繁體補充。詳情請參閱dict.txt.big。您也可以將內建字典與任意數量的自訂字典混合使用。範例：`["_default_", "结巴分词器"]` 。	`["_default_"]`
`mode`	分割模式。可能的值： `"exact"`:嘗試以最精確的方式分割句子，非常適合文字分析。 `"search"`:在精確模式的基礎上，進一步分割長字詞以提高召回率，使其適用於搜尋引擎標記化。如需詳細資訊，請參閱Jieba GitHub 專案。	`"search"`
`hmm`	一個布林標誌，表示是否啟用隱馬可夫模型 (HMM) 來對字典中找不到的字進行概率分割。	`true`

若要從外部檔案載入大型自訂詞彙，而不是透過dict 內嵌，請參閱下面的自訂配置與字典檔案。

定義analyzer_params 之後，您可以在定義集合模式時，將它們套用到VARCHAR 欄位。這可讓 Milvus 使用指定的分析器來處理該欄位中的文字，以進行有效的標記化和過濾。詳情請參閱範例使用。

使用字典檔案自訂配置Compatible with Milvus 3.0.x

對於大型自訂詞彙 - 領域詞彙、產品詞彙或專有名詞清單 - 可將詞彙儲存在檔案中，並將檔案註冊為遠端檔案資源，然後透過extra_dict_file 參數從 tokenizer 中引用該檔案。分析器會將這些詞彙載入內建字典的詞彙庫。

檔案是每行一個詞彙的純 UTF-8 文字。舉例來說

结巴分词器
向量数据库

將檔案上傳到您的 Milvus 叢集設定使用的物件儲存空間，然後註冊：

Python Java NodeJS Go cURL

from pymilvus import MilvusClient

client = MilvusClient(uri="http://localhost:19530")

# Register the uploaded file under a name you'll reference from analyzer configs.
client.add_file_resource(
    name="zh_terms",
    path="file/zh_terms.txt",    # full S3 object key, including the rootPath prefix
)

// java

// nodejs

// go

# restful

透過extra_dict_file 在 tokenizer 中引用註冊的資源：

Python Java NodeJS Go cURL

analyzer_params = {
    "tokenizer": {
        "type": "jieba",
        "dict": ["_default_"],             # keep the built-in dictionary
        "mode": "exact",
        "hmm": False,
        "extra_dict_file": {
            "type": "remote",
            "resource_name": "zh_terms",
            "file_name": "zh_terms.txt",
        },
    },
}

client.run_analyzer(["milvus结巴分词器中文测试"], analyzer_params)
# → [['milvus', '结巴', '分词器', '中文', '测试']]

// java

// nodejs

// go

# restful

extra_dict_file 參數接受包含下列欄位的物件：

欄位	說明
`type`	資源類型。對於透過`add_file_resource` 註冊的檔案，請使用`"remote"` 。有關在自託管部署中使用的`"local"` 變體，請參閱管理檔案資源。
`resource_name`	在`add_file_resource` 註冊檔案時使用的名稱。
`file_name`	註冊資源的物件存放路徑的檔案名稱部分 (例如，如果資源是註冊於`path="file/zh_terms.txt"` ，則為`"zh_terms.txt"` )。

透過extra_dict_file 新增的字詞會與內建字典合併，因此 jieba 的分割演算法會將它們與現有的詞條一併檢視。任何特定的詞彙是否以獨立標記的形式出現取決於 jieba 的概率加權 DAG 選擇 - 如果較短的詞條在內建字典中擁有較高的頻率，那麼長的自訂詞彙如向量数据库 仍然可能會被分割成向量 +数据库 。

範例

在將分析器設定套用到您的集合模式之前，請使用run_analyzer 方法驗證其行為。

分析器配置

Python Java NodeJS Go cURL

analyzer_params = {
    "tokenizer": {
        "type": "jieba",
        "dict": ["结巴分词器"],
        "mode": "exact",
        "hmm": False
    }
}

Map<String, Object> analyzerParams = new HashMap<>();                                                                          
analyzerParams.put("tokenizer", new HashMap<String, Object>() {{
  put("type", "jieba");                                                                                                      
  put("dict", Arrays.asList("结巴分词器"));                   
  put("mode", "exact");
  put("hmm", false);
}});

// javascript

analyzerParams := map[string]interface{}{
  "tokenizer": map[string]interface{}{
      "type": "jieba",
      "dict": []string{"结巴分词器"},
      "mode": "exact",
      "hmm":  false,
  },
}

# restful

驗證使用`run_analyzer`

Python Java NodeJS Go cURL

from pymilvus import (
    MilvusClient,
)

client = MilvusClient(
    uri="http://localhost:19530",
    token="root:Milvus"
)

# Sample text to analyze
sample_text = "milvus结巴分词器中文测试"

# Run the standard analyzer with the defined configuration
result = client.run_analyzer(sample_text, analyzer_params)
print("Standard analyzer output:", result)

import io.milvus.v2.client.ConnectConfig;
import io.milvus.v2.client.MilvusClientV2;
import io.milvus.v2.service.vector.request.RunAnalyzerReq;
import io.milvus.v2.service.vector.response.RunAnalyzerResp;

ConnectConfig config = ConnectConfig.builder()
        .uri("http://localhost:19530")
        .token("root:Milvus")
        .build();
MilvusClientV2 client = new MilvusClientV2(config);

List<String> texts = new ArrayList<>();
texts.add("milvus结巴分词器中文测试");

RunAnalyzerResp resp = client.runAnalyzer(RunAnalyzerReq.builder()
        .texts(texts)
        .analyzerParams(analyzerParams)
        .build());
List<RunAnalyzerResp.AnalyzerResult> results = resp.getResults();

// javascript

import (
    "context"
    "encoding/json"
    "fmt"

    "github.com/milvus-io/milvus/client/v2/milvusclient"
)

client, err := milvusclient.New(ctx, &milvusclient.ClientConfig{
    Address: "localhost:19530",
    APIKey:  "root:Milvus",
})
if err != nil {
    fmt.Println(err.Error())
    // handle error
}

bs, _ := json.Marshal(analyzerParams)
texts := []string{"milvus结巴分词器中文测试"}
option := milvusclient.NewRunAnalyzerOption(texts).
    WithAnalyzerParams(string(bs))

result, err := client.RunAnalyzer(ctx, option)
if err != nil {
    fmt.Println(err.Error())
    // handle error
}

# restful

預期輸出

['milvus', '结巴分词器', '中', '文', '测', '试']

免費嘗試托管的 Milvus

Zilliz Cloud 無縫接入，由 Milvus 提供動力，速度提升 10 倍。

開始使用

反饋

這個頁面有幫助嗎？

詞霸

配置

簡單配置

自訂組態

使用字典檔案自訂配置Compatible with Milvus 3.0.x

範例

分析器配置

驗證使用run_analyzer

預期輸出

目錄

免費嘗試托管的 Milvus

反饋

驗證使用`run_analyzer`