JSON ShreddingCompatible with Milvus 2.6.2+
JSON shredding accelerates JSON queries by converting traditional row-based storage into optimized columnar storage. While maintaining JSON’s flexibility for data modeling, Milvus performs behind-the-scenes columnar optimization that dramatically improves access and query efficiency.
JSON shredding is effective for most JSON query scenarios. The performance benefits become more pronounced with:
Larger, more complex JSON documents - Greater performance gains as document size increases
Read-heavy workloads - Frequent filtering, sorting, or searching on JSON keys
Mixed query patterns - Queries across different JSON keys benefit from the hybrid storage approach
How it works
The JSON shredding process happens in three distinct phases to optimize data for fast retrieval.
Phase 1: Ingestion & key classification
As new JSON documents are written, Milvus continuously samples and analyzes them to build statistics for each JSON key. This analysis includes the key’s occurrence ratio and type stability (whether its data type is consistent across documents).
Based on these statistics, JSON keys are categorized into the following for optimal storage.
Categories of JSON keys
Key Type |
Description |
---|---|
Typed keys |
Keys that exist in most documents and always have the same data type (e.g., all integers or all strings). |
Dynamic keys |
Keys that appear frequently but have a mixed data type (e.g., sometimes a string, sometimes an integer). |
Shared keys |
Infrequently appearing or nested keys that fall below a configurable frequency threshold. |
Example classification
Consider the sample JSON data containing the following JSON keys:
{"a": 10, "b": "str1", "f": 1}
{"a": 20, "b": "str2", "f": 2}
{"a": 30, "b": "str3", "f": 3}
{"a": 40, "b": 1, "f": 4} // b becomes mixed type
{"a": 50, "b": 2, "e": "rare"} // e appears infrequently
Based on this data, the keys would be classified as follows:
Typed keys:
a
andf
(always an integer)Dynamic keys:
b
(mixed string/integer)Shared keys:
e
(infrequently appearing key)
Phase 2: Storage optimization
The classification from Phase 1 dictates the storage layout. Milvus uses a columnar format optimized for queries.
Json Shredding Flow
Shredded columns: For typed and dynamic keys, data is written to dedicated columns. This columnar storage allows for fast, direct scans during queries, as Milvus can read only the required data for a given key without processing the entire document.
Shared column: All shared keys are stored together in a single, compact binary JSON column. A shared-key inverted index is built on this column. This index is crucial for accelerating queries on low-frequency keys by allowing Milvus to quickly prune the data, effectively narrowing down the search space to only those rows that contain the specified key.
Phase 3: Query execution
The final phase leverages the optimized storage layout to intelligently select the fastest path for each query predicate.
Fast path: Queries on typed/dynamic keys (e.g.,
json['a'] < 100
) access dedicated columns directlyOptimized path: Queries on shared keys (e.g.,
json['e'] = 'rare'
) use inverted index to quickly locate relevant documents
Enable JSON shredding
To activate the feature, set common.enabledJSONKeyStats
to true
in your milvus.yaml
configuration file. New data will automatically trigger the shredding process.
# milvus.yaml
...
common:
enabledJSONKeyStats: true # Indicates whether to enable JSON key stats build and load processes
...
Once enabled, Milvus will begin analyzing and restructuring your JSON data upon ingestion without any further manual intervention.
Parameter tuning
For most users, once JSON shredding is enabled, the default settings for other parameters are sufficient. However, you can fine-tune the behavior of JSON shredding using these parameters in milvus.yaml
.
Parameter Name |
Description |
Default Value |
Tuning Advice |
---|---|---|---|
|
Controls whether the JSON shredding build and load processes are enabled. |
false |
Must be set to true to activate the feature. |
|
Controls whether Milvus uses shredded data for acceleration. |
true |
Set to false as a recovery measure if queries fail, reverting to the original query path. |
|
Determines whether Milvus uses mmap when loading shredding data. For details, refer to Use mmap. |
true |
This setting is generally optimized for performance. Only adjust it if you have specific memory management needs or constraints on your system. |
|
The maximum number of JSON keys that will be stored in shredded columns. If the number of frequently appearing keys exceeds this limit, Milvus will prioritize the most frequent ones for shredding, and the remaining keys will be stored in the shared column. |
1024 |
This is sufficient for most scenarios. For JSON with thousands of frequently appearing keys, you may need to increase this, but monitor storage usage. |
|
The minimum occurrence ratio a JSON key must have to be considered for shredding into a shredded column. A key is considered frequently appearing if its ratio is above this threshold. |
0.3 |
Increase (e.g., to 0.5) if the number of keys that meet the shredding criteria exceeds the Decrease (e.g., to 0.1) if you want to shred more keys that appear less frequently than the default 30% threshold. |
Performance benchmarks
Our testing demonstrates significant performance improvements across different JSON key types and query patterns.
Test environment and methodology
Hardware: 1 core/8GB cluster
Dataset: 1 million documents from JSONBench
Average document size: 478.89 bytes
Test duration: 100 seconds measuring QPS and latency
Results: typed keys
This test measured performance when querying a key present in most documents.
Query Expression |
Key Value Type |
QPS (without shredding) |
QPS (with shredding) |
Performance Boost |
---|---|---|---|---|
|
Integer |
8.69 |
287.50 |
33x |
|
String |
8.42 |
126.1 |
14.9x |
Results: shared keys
This test focused on querying sparse, nested keys that fall into the “shared” category.
Query Expression |
Key Value Type |
QPS (without shredding) |
QPS (with shredding) |
Performance Boost |
---|---|---|---|---|
|
Nested Integer |
4.33 |
385 |
88.9x |
|
Nested String |
7.6 |
352 |
46.3x |
Key insights
Shared key queries show the most dramatic improvements (up to 89x faster)
Typed key queries provide consistent 15-30x performance gains
All query types benefit from JSON Shredding with no performance regressions
FAQ
How do I verify if JSON shredding works properly?
First, check if the data has been built by using the
show segment --format table
command in the Birdwatcher tool. If successful, the output will containshredding_data/
andshared_key_index/
under the Json Key Stats field.Birdwatcher Output
Next, verify that the data has been loaded by running
show loaded-json-stats
on the query node. The output will display details about the loaded shredded data for each query node.
How do I select between JSON shredding and JSON indexing?
JSON shredding is ideal for keys that appear frequently in your documents, especially for complex JSON structures. It combines the benefits of columnar storage and inverted indexing, making it well-suited for read-heavy scenarios where you query many different keys. However, it is not recommended for very small JSON documents as the performance gain is minimal. The smaller the proportion of the key’s value to the total size of the JSON document, the better the performance optimization from shredding.
JSON indexing is better for targeted optimization of specific key-based queries and has lower storage overhead. It’s suitable for simpler JSON structures. Note that JSON shredding does not cover queries on keys inside arrays, so you need a JSON index to accelerate those.
What if I encounter an error?
If the build or load process fails, you can quickly disable the feature by setting
common.enabledJSONKeyStats=false
. To clear any remaining tasks, use theremove stats-task <task_id>
command in Birdwatcher. If a query fails, setcommon.usingJsonStatsForQuery=false
to revert to the original query path, bypassing the shredded data.