About Milvus
Get Started
Concepts
User Guide
Data Import
Administration Guide
Tools
Integrations
Tutorials
FAQs
API Reference

Home
Docs
User Guide
Schema & Data Fields
Analyzer
Tokenizers
Lindera

Lindera

The lindera tokenizer performs dictionary-based morphological analysis. It is designed for Japanese and Korean—languages where words are not separated by spaces and grammatical markers (particles) attach directly to words.

For Chinese text: While lindera supports Chinese via the cc-cedict dictionary, we recommend using the jieba tokenizer instead. Jieba is specifically designed for Chinese word segmentation and provides better results.

Overview

Japanese and Korean are agglutinative languages: grammatical markers called particles attach directly to nouns, forming numerous combinations. For example:

Language	Root word	+ Particle	= Combined form	Meaning
Korean	서울 (Seoul)	에서	서울에서	in Seoul
Japanese	東京 (Tokyo)	に	東京に	to Tokyo

The lindera tokenizer:

Segments text into individual morphemes (words and particles)
Tags each token with part-of-speech (POS) information from the dictionary
Applies filters to remove unwanted tokens (e.g., particles, punctuation)

This two-stage process—segmentation followed by POS-based filtering—enables precise control over which tokens are indexed for search.

Prerequisites

Milvus 2.6+ users: You can skip this section. All dictionaries are pre-compiled and included in the official release.

For Milvus 2.5.x, you need to compile Milvus with specific dictionaries enabled. All dictionaries must be explicitly included during compilation.

To enable specific dictionaries, include them in the compilation command:

make milvus TANTIVY_FEATURES=lindera-ipadic,lindera-ko-dic

The complete list of available dictionaries:

Dictionary	Language	Description
lindera-ko-dic	Korean	Korean morphological dictionary (MeCab Ko-dic)
lindera-ipadic	Japanese	Standard morphological dictionary (MeCab IPADIC)
lindera-ipadic-neologd	Japanese	Extended dictionary with new words and proper nouns (IPADIC NEologd)
lindera-unidic	Japanese	Academic standard dictionary (UniDic)
lindera-cc-cedict	Chinese	Community-maintained Chinese-English dictionary (CC-CEDICT)

For example, to enable all dictionaries:

make milvus TANTIVY_FEATURES=lindera-ipadic,lindera-ipadic-neologd,lindera-unidic,lindera-ko-dic,lindera-cc-cedict

Configuration

To configure an analyzer using the lindera tokenizer, set tokenizer.type to lindera, choose a dictionary with dict_kind, and optionally apply filters.

Python Java Go NodeJS cURL

analyzer_params = {
    "tokenizer": {
        "type": "lindera",
        "dict_kind": "ko-dic",
        "filter": [
            {
                "kind": "korean_stop_tags",
                "tags": ["SP", "SSC", "SSO", "SC", "SE", "SF", "JKS", "JKC", "JKG", "JKO", "JKB", "JKV", "JKQ", "JX", "JC", "UNK", "EP", "ETM"]
            }
        ]
    }
}

Map<String, Object> analyzerParams = new HashMap<>();                                 
  analyzerParams.put("tokenizer", new HashMap<String, Object>() {{
      put("type", "lindera");                                                           
      put("dict_kind", "ko-dic");                                 
      put("filter", Arrays.asList(
          new HashMap<String, Object>() {{
              put("kind", "korean_stop_tags");
              put("tags", Arrays.asList(
                  "SP", "SSC", "SSO", "SC", "SE", "SF",
                  "JKS", "JKC", "JKG", "JKO", "JKB", "JKV", "JKQ",
                  "JX", "JC", "UNK", "EP", "ETM"
              ));
          }}
      ));
  }});

analyzerParams := map[string]interface{}{                                             
      "tokenizer": map[string]interface{}{     
          "type":      "lindera",                                                       
          "dict_kind": "ko-dic",                                  
          "filter": []interface{}{                                                      
              map[string]interface{}{                             
                  "kind": "korean_stop_tags",
                  "tags": []string{
                      "SP", "SSC", "SSO", "SC", "SE", "SF",
                      "JKS", "JKC", "JKG", "JKO", "JKB", "JKV", "JKQ",
                      "JX", "JC", "UNK", "EP", "ETM",
                  },
              },
          },
      },
  }

const analyzer_params = {
    "tokenizer": {
        "type": "lindera",
        "dict_kind": "ko-dic",
        "filter": [
            {
                "kind": "korean_stop_tags",
                "tags": ["SP", "SSC", "SSO", "SC", "SE", "SF", "JKS", "JKC", "JKG", "JKO", "JKB", "JKV", "JKQ", "JX", "JC", "UNK", "EP", "ETM"]
            }
        ]
    }
};

# restful

Parameter	Description
`type`	The type of tokenizer. This is fixed to `"lindera"`.
`dict_kind`	A dictionary used to define vocabulary. Possible values: `ko-dic`: Korean - Korean morphological dictionary (MeCab Ko-dic) `ipadic`: Japanese - Standard morphological dictionary (MeCab IPADIC) `ipadic-neologd`: Japanese with neologism dictionary (extended) - Includes new words and proper nouns (IPADIC NEologd) `unidic`: Japanese UniDic (extended) - Academic standard dictionary with detailed linguistic information (UniDic) `cc-cedict`: Mandarin Chinese (traditional/simplified) - Community-maintained Chinese-English dictionary (CC-CEDICT)
`filter`	A list of tokenizer-level filters to apply after segmentation. Each filter is an object with: `kind`: The filter type. Supported values: `korean_stop_tags`: Remove tokens matching specified Korean POS tags. `japanese_stop_tags`: Remove tokens matching specified Japanese POS tags. `tags`: A list of POS tags to filter out. The available tags depend on the `kind`: For `korean_stop_tags`: Use exact tag codes (e.g., `JKS`, `JKO`, `SF`). Korean tags require exact matching. For the complete list based on the Sejong tagset, see the Lindera Korean stop tags source. For `japanese_stop_tags`: Use exact tag codes (e.g., `助詞,格助詞`, `助詞,係助詞`, `助動詞`). Japanese tags require exact matching. For the complete list (IPADIC), see Japanese POS tags reference.

Parameter

Description

type

The type of tokenizer. This is fixed to "lindera".

dict_kind

A dictionary used to define vocabulary. Possible values:

ko-dic: Korean - Korean morphological dictionary (MeCab Ko-dic)
ipadic: Japanese - Standard morphological dictionary (MeCab IPADIC)
ipadic-neologd: Japanese with neologism dictionary (extended) - Includes new words and proper nouns (IPADIC NEologd)
unidic: Japanese UniDic (extended) - Academic standard dictionary with detailed linguistic information (UniDic)
cc-cedict: Mandarin Chinese (traditional/simplified) - Community-maintained Chinese-English dictionary (CC-CEDICT)

filter

A list of tokenizer-level filters to apply after segmentation. Each filter is an object with:

kind: The filter type. Supported values:
- korean_stop_tags: Remove tokens matching specified Korean POS tags.
- japanese_stop_tags: Remove tokens matching specified Japanese POS tags.
tags: A list of POS tags to filter out. The available tags depend on the kind:
- For korean_stop_tags: Use exact tag codes (e.g., JKS, JKO, SF). Korean tags require exact matching. For the complete list based on the Sejong tagset, see the Lindera Korean stop tags source.
- For japanese_stop_tags: Use exact tag codes (e.g., 助詞,格助詞, 助詞,係助詞, 助動詞). Japanese tags require exact matching. For the complete list (IPADIC), see Japanese POS tags reference.

After defining analyzer_params, you can apply them to a VARCHAR field when defining a collection schema. This allows Milvus to process the text in that field using the specified analyzer for efficient tokenization and filtering. For details, refer to Example use.

Examples

Before applying the analyzer configuration to your collection schema, verify its behavior using the run_analyzer method.

Korean example

Python Java Go NodeJS cURL

from pymilvus import MilvusClient

client = MilvusClient(uri="http://localhost:19530")

analyzer_params = {
    "tokenizer": {
        "type": "lindera",
        "dict_kind": "ko-dic",
        "filter": [
            {
                "kind": "korean_stop_tags",
                "tags": ["SP", "SSC", "SSO", "SC", "SE", "SF", "JKS", "JKC", "JKG", "JKO", "JKB", "JKV", "JKQ", "JX", "JC", "UNK", "EP", "ETM"]
            }
        ]
    }
}

# Sample Korean text: "서울에서 맛있는 음식을 먹었습니다" (I ate delicious food in Seoul)
sample_text = "서울에서 맛있는 음식을 먹었습니다"

result = client.run_analyzer(sample_text, analyzer_params)
print("Analyzer output:", result)

import io.milvus.v2.client.ConnectConfig;
import io.milvus.v2.client.MilvusClientV2;
import io.milvus.v2.service.vector.request.RunAnalyzerReq;
import io.milvus.v2.service.vector.response.RunAnalyzerResp;

ConnectConfig config = ConnectConfig.builder()
        .uri("http://localhost:19530")
        .build();
MilvusClientV2 client = new MilvusClientV2(config);

Map<String, Object> analyzerParams = new HashMap<>();                                                                          
analyzerParams.put("tokenizer", new HashMap<String, Object>() {{
  put("type", "lindera");                                                                                                    
  put("dict_kind", "ko-dic");                                 
  put("filter", Arrays.asList(
      new HashMap<String, Object>() {{
          put("kind", "korean_stop_tags");
          put("tags", Arrays.asList(
              "SP", "SSC", "SSO", "SC", "SE", "SF",
              "JKS", "JKC", "JKG", "JKO", "JKB", "JKV", "JKQ",
              "JX", "JC", "UNK", "EP", "ETM"
          ));
      }}
  ));
}});

List<String> texts = new ArrayList<>();
texts.add("서울에서 맛있는 음식을 먹었습니다");

RunAnalyzerResp resp = client.runAnalyzer(RunAnalyzerReq.builder()
        .texts(texts)
        .analyzerParams(analyzerParams)
        .build());
List<RunAnalyzerResp.AnalyzerResult> results = resp.getResults();

import (
    "context"
    "encoding/json"
    "fmt"

    "github.com/milvus-io/milvus/client/v2/milvusclient"
)

client, err := milvusclient.New(ctx, &milvusclient.ClientConfig{
    Address: "localhost:19530",
    APIKey:  "root:Milvus",
})
if err != nil {
    fmt.Println(err.Error())
    // handle error
}

analyzerParams := map[string]interface{}{
  "tokenizer": map[string]interface{}{
      "type":      "lindera",
      "dict_kind": "ko-dic",
      "filter": []interface{}{
          map[string]interface{}{
              "kind": "korean_stop_tags",
              "tags": []string{
                  "SP", "SSC", "SSO", "SC", "SE", "SF",
                  "JKS", "JKC", "JKG", "JKO", "JKB", "JKV", "JKQ",
                  "JX", "JC", "UNK", "EP", "ETM",
              },
          },
      },
  },
}

bs, _ := json.Marshal(analyzerParams)
texts := []string{"서울에서 맛있는 음식을 먹었습니다"}
option := milvusclient.NewRunAnalyzerOption(texts).
    WithAnalyzerParams(string(bs))

result, err := client.RunAnalyzer(ctx, option)
if err != nil {
    fmt.Println(err.Error())
    // handle error
}

import { MilvusClient } from "@zilliz/milvus2-sdk-node";

const client = new MilvusClient({
  uri: "http://localhost:19530",
});

const analyzer_params = {
  tokenizer: {
    type: "lindera",
    dict_kind: "ko-dic",
    filter: [
      {
        kind: "korean_stop_tags",
        tags: [
          "SP",
          "SSC",
          "SSO",
          "SC",
          "SE",
          "SF",
          "JKS",
          "JKC",
          "JKG",
          "JKO",
          "JKB",
          "JKV",
          "JKQ",
          "JX",
          "JC",
          "UNK",
          "EP",
          "ETM",
        ],
      },
    ],
  },
};

const sample_text = "서울에서 맛있는 음식을 먹었습니다";

const result = await client.run_analyzer(sample_text, analyzer_params);
console.log("Analyzer output:", result);

# restful

Expected output:

['서울', '맛있', '음식', '먹', '습니다']

Without korean_stop_tags, the output would include particles like 에서 (in), 는 (topic marker), and 을 (object marker), which are typically not useful for search.

Japanese example

Python Java Go NodeJS cURL

from pymilvus import MilvusClient

client = MilvusClient(uri="http://localhost:19530")

analyzer_params = {
    "tokenizer": {
        "type": "lindera",
        "dict_kind": "ipadic",
        "filter": [
            {
                "kind": "japanese_stop_tags",
                "tags": ["接続詞", "助詞,格助詞", "助詞,格助詞,一般", "助詞,格助詞,引用", "助詞,格助詞,連語", "助詞,係助詞", "助詞,終助詞", "助詞,接続助詞", "助詞,特殊", "助詞,副助詞", "助詞,副助詞／並立助詞／終助詞", "助詞,連体化", "助詞,副詞化", "助詞,並立助詞", "助動詞", "記号,一般", "記号,読点", "記号,句点", "記号,空白", "記号,括弧閉", "記号,括弧開", "その他,間投", "フィラー", "非言語音"]
            }
        ]
    }
}

# Sample Japanese text: "東京スカイツリーの最寄り駅はとうきょうスカイツリー駅です"
sample_text = "東京スカイツリーの最寄り駅はとうきょうスカイツリー駅です"

result = client.run_analyzer(sample_text, analyzer_params)
print("Analyzer output:", result)

// java

// go


import { MilvusClient } from "@zilliz/milvus2-sdk-node";

const client = new MilvusClient({
  uri: "http://localhost:19530",
});

const analyzer_params = {
    "tokenizer": {
        "type": "lindera",
        "dict_kind": "ipadic",
        "filter": [
            {
                "kind": "japanese_stop_tags",
                "tags": ["接続詞", "助詞,格助詞", "助詞,格助詞,一般", "助詞,格助詞,引用", "助詞,格助詞,連語", "助詞,係助詞", "助詞,終助詞", "助詞,接続助詞", "助詞,特殊", "助詞,副助詞", "助詞,副助詞／並立助詞／終助詞", "助詞,連体化", "助詞,副詞化", "助詞,並立助詞", "助動詞", "記号,一般", "記号,読点", "記号,句点", "記号,空白", "記号,括弧閉", "記号,括弧開", "その他,間投", "フィラー", "非言語音"]
            }
        ]
    }
}

// Sample Japanese text: "東京スカイツリーの最寄り駅はとうきょうスカイツリー駅です"
const sample_text = "東京スカイツリーの最寄り駅はとうきょうスカイツリー駅です"

const result = await client.run_analyzer(sample_text, analyzer_params);
console.log("Analyzer output:", result);

# restful

Expected output:

['東京', 'スカイ', 'ツリー', '最寄り駅', 'とう', 'きょう', 'スカイ', 'ツリー', '駅']

Without japanese_stop_tags, the output would include particles like の (possessive), は (topic marker), and です (copula).

Lindera
Overview
Prerequisites
Configuration
Examples
Korean example
Japanese example

Try Managed Milvus for Free

Zilliz Cloud is hassle-free, powered by Milvus and 10x faster.

Get Started

Feedback

Was this page helpful?

Lindera

Overview

Prerequisites

Configuration

Examples

Korean example

Japanese example

Table of contents

Try Managed Milvus for Free

Feedback