Milvus
Zilliz
Home
  • User Guide
  • Home
  • Docs
  • User Guide

  • Schema & Data Fields

  • JSON Field

  • JSON Shredding

JSON ShreddingCompatible with Milvus 2.6.2+

JSON shredding accelerates JSON queries by converting traditional row-based storage into optimized columnar storage. While maintaining JSON’s flexibility for data modeling, Milvus performs behind-the-scenes columnar optimization that dramatically improves access and query efficiency.

JSON shredding is effective for most JSON query scenarios. The performance benefits become more pronounced with:

  • Larger, more complex JSON documents - Greater performance gains as document size increases

  • Read-heavy workloads - Frequent filtering, sorting, or searching on JSON keys

  • Mixed query patterns - Queries across different JSON keys benefit from the hybrid storage approach

How it works

The JSON shredding process happens in three distinct phases to optimize data for fast retrieval.

Phase 1: Ingestion & key classification

As new JSON documents are written, Milvus continuously samples and analyzes them to build statistics for each JSON key. This analysis includes the key’s occurrence ratio and type stability (whether its data type is consistent across documents).

Based on these statistics, JSON keys are categorized into the following for optimal storage.

Categories of JSON keys

Key Type

Description

Typed keys

Keys that exist in most documents and always have the same data type (e.g., all integers or all strings).

Dynamic keys

Keys that appear frequently but have a mixed data type (e.g., sometimes a string, sometimes an integer).

Shared keys

Infrequently appearing or nested keys that fall below a configurable frequency threshold.

Example classification

Consider the sample JSON data containing the following JSON keys:

{"a": 10, "b": "str1", "f": 1}
{"a": 20, "b": "str2", "f": 2}  
{"a": 30, "b": "str3", "f": 3}
{"a": 40, "b": 1, "f": 4}       // b becomes mixed type
{"a": 50, "b": 2, "e": "rare"}  // e appears infrequently

Based on this data, the keys would be classified as follows:

  • Typed keys: a and f (always an integer)

  • Dynamic keys: b (mixed string/integer)

  • Shared keys: e (infrequently appearing key)

Phase 2: Storage optimization

The classification from Phase 1 dictates the storage layout. Milvus uses a columnar format optimized for queries.

Json Shredding Flow Json Shredding Flow

  • Shredded columns: For typed and dynamic keys, data is written to dedicated columns. This columnar storage allows for fast, direct scans during queries, as Milvus can read only the required data for a given key without processing the entire document.

  • Shared column: All shared keys are stored together in a single, compact binary JSON column. A shared-key inverted index is built on this column. This index is crucial for accelerating queries on low-frequency keys by allowing Milvus to quickly prune the data, effectively narrowing down the search space to only those rows that contain the specified key.

Phase 3: Query execution

The final phase leverages the optimized storage layout to intelligently select the fastest path for each query predicate.

  • Fast path: Queries on typed/dynamic keys (e.g., json['a'] < 100) access dedicated columns directly

  • Optimized path: Queries on shared keys (e.g., json['e'] = 'rare') use inverted index to quickly locate relevant documents

Enable JSON shredding

To activate the feature, set common.enabledJSONKeyStats to true in your milvus.yaml configuration file. New data will automatically trigger the shredding process.

# milvus.yaml
...
common:
  enabledJSONKeyStats: true # Indicates whether to enable JSON key stats build and load processes
...

Once enabled, Milvus will begin analyzing and restructuring your JSON data upon ingestion without any further manual intervention.

Parameter tuning

For most users, once JSON shredding is enabled, the default settings for other parameters are sufficient. However, you can fine-tune the behavior of JSON shredding using these parameters in milvus.yaml.

Parameter Name

Description

Default Value

Tuning Advice

common.enabledJSONKeyStats

Controls whether the JSON shredding build and load processes are enabled.

false

Must be set to true to activate the feature.

common.usingJsonStatsForQuery

Controls whether Milvus uses shredded data for acceleration.

true

Set to false as a recovery measure if queries fail, reverting to the original query path.

queryNode.mmap.jsonStats

Determines whether Milvus uses mmap when loading shredding data.

For details, refer to Use mmap.

true

This setting is generally optimized for performance. Only adjust it if you have specific memory management needs or constraints on your system.

dataCoord.jsonStatsMaxShreddingColumns

The maximum number of JSON keys that will be stored in shredded columns.

If the number of frequently appearing keys exceeds this limit, Milvus will prioritize the most frequent ones for shredding, and the remaining keys will be stored in the shared column.

1024

This is sufficient for most scenarios. For JSON with thousands of frequently appearing keys, you may need to increase this, but monitor storage usage.

dataCoord.jsonStatsShreddingRatioThreshold

The minimum occurrence ratio a JSON key must have to be considered for shredding into a shredded column.

A key is considered frequently appearing if its ratio is above this threshold.

0.3

Increase (e.g., to 0.5) if the number of keys that meet the shredding criteria exceeds the dataCoord.jsonStatsMaxShreddingColumns limit. This makes the threshold stricter, reducing the number of keys that qualify for shredding.

Decrease (e.g., to 0.1) if you want to shred more keys that appear less frequently than the default 30% threshold.

Performance benchmarks

Our testing demonstrates significant performance improvements across different JSON key types and query patterns.

Test environment and methodology

  • Hardware: 1 core/8GB cluster

  • Dataset: 1 million documents from JSONBench

  • Average document size: 478.89 bytes

  • Test duration: 100 seconds measuring QPS and latency

Results: typed keys

This test measured performance when querying a key present in most documents.

Query Expression

Key Value Type

QPS (without shredding)

QPS (with shredding)

Performance Boost

json['time_us'] > 0

Integer

8.69

287.50

33x

json['kind'] == 'commit'

String

8.42

126.1

14.9x

Results: shared keys

This test focused on querying sparse, nested keys that fall into the “shared” category.

Query Expression

Key Value Type

QPS (without shredding)

QPS (with shredding)

Performance Boost

json['identity']['seq'] > 0

Nested Integer

4.33

385

88.9x

json['identity']['did'] == 'xxxxx'

Nested String

7.6

352

46.3x

Key insights

  • Shared key queries show the most dramatic improvements (up to 89x faster)

  • Typed key queries provide consistent 15-30x performance gains

  • All query types benefit from JSON Shredding with no performance regressions

FAQ

  • How do I verify if JSON shredding works properly?

    1. First, check if the data has been built by using the show segment --format table command in the Birdwatcher tool. If successful, the output will contain shredding_data/ and shared_key_index/ under the Json Key Stats field.

      Birdwatcher Output Birdwatcher Output

    2. Next, verify that the data has been loaded by running show loaded-json-stats on the query node. The output will display details about the loaded shredded data for each query node.

  • How do I select between JSON shredding and JSON indexing?

    • JSON shredding is ideal for keys that appear frequently in your documents, especially for complex JSON structures. It combines the benefits of columnar storage and inverted indexing, making it well-suited for read-heavy scenarios where you query many different keys. However, it is not recommended for very small JSON documents as the performance gain is minimal. The smaller the proportion of the key’s value to the total size of the JSON document, the better the performance optimization from shredding.

    • JSON indexing is better for targeted optimization of specific key-based queries and has lower storage overhead. It’s suitable for simpler JSON structures. Note that JSON shredding does not cover queries on keys inside arrays, so you need a JSON index to accelerate those.

  • What if I encounter an error?

    If the build or load process fails, you can quickly disable the feature by setting common.enabledJSONKeyStats=false. To clear any remaining tasks, use the remove stats-task <task_id> command in Birdwatcher. If a query fails, set common.usingJsonStatsForQuery=false to revert to the original query path, bypassing the shredded data.