milvus-logo
LFAI
Home

Import Data

This topic describes how to import data in Milvus via bulk load.

Regular method to insert a large batch of entities to Milvus usually leads to a massive network transmission across client, proxy, Pulsar and data nodes. To avoid such situation, Milvus 2.1 supports loading data from files via bulk load. You can import large amounts of data into a collection by just a few lines of code, and endow atomicity to a whole batch of entities.

You can also migrate data to Milvus with MilvusDM, an open-source tool designed specifically for importing and exporting data with Milvus.

Prepare data file

You can prepare the data file on row base or column base.

  • Row-based data file

A row-based data file is a JSON file containing multiple rows. The root key must be "rows". The file name can be specified arbitrarily.

{
  "rows":[
    {"book_id": 101, "word_count": 13, "book_intro": [1.1, 1.2]},
    {"book_id": 102, "word_count": 25, "book_intro": [2.1, 2.2]},
    {"book_id": 103, "word_count": 7, "book_intro": [3.1, 3.2]},
    {"book_id": 104, "word_count": 12, "book_intro": [4.1, 4.2]},
    {"book_id": 105, "word_count": 34, "book_intro": [5.1, 5.2]},
  ]
}
  • Column-based data file

A column-based data file can be a JSON file containing multiple columns, several Numpy files, each contains a single column, or a JSON file contains multiple columns and some Numpy files.

  • JSON file containing multiple columns
```json
{
        "book_id": [101, 102, 103, 104, 105],
        "word_count": [13, 25, 7, 12, 34],
        "book_intro": [
                [1.1, 1.2],
                [2.1, 2.2],
                [3.1, 3.2],
                [4.1, 4.2],
                [5.1, 5.2]
        ]
}
```
  • Numpy files

    import numpy
    numpy.save('book_id.npy', numpy.array([101, 102, 103, 104, 105]))
    numpy.save('word_count.npy', numpy.array([13, 25, 7, 12, 34]))
    arr = numpy.array([[1.1, 1.2],
                [2.1, 2.2],
                [3.1, 3.2],
                [4.1, 4.2],
                [5.1, 5.2]])
    numpy.save('book_intro.npy', arr)
    
  • A JSON file contains multiple columns and some Numpy files.

    {
            "book_id": [101, 102, 103, 104, 105],
            "word_count": [13, 25, 7, 12, 34]
    }
    
    {
        "book_id": [101, 102, 103, 104, 105],
        "word_count": [13, 25, 7, 12, 34]
    }
    

Upload data file

Upload data files to object storage.

You can upload data file to MinIO or local storage (available only in Milvus Standalone).

  • Upload to MinIO

upload the data files to the bucket which is defined by minio.bucketName in the configuration file milvus.yml.

  • Upload to local storage

copy the data files into the directory which is defined by localStorage.path in the configuration file milvus.yml.

Insert data to Milvus

Import the data to the collection.

  • For row-based files
from pymilvus import utility
tasks = utility.bulk_load(
    collection_name="book",
    is_row_based=True,
    files=["row_based_1.json", "row_based_2.json"]
)
  • For column-based files
from pymilvus import utility
tasks = utility.bulk_load(
    collection_name="book",
    is_row_based=False,
    files=["columns.json", "book_intro.npy"]
)
Parameter Description
collection_name Name of the collection to load data into.
is_row_based Boolean value to indicate if the file is row-based.
files List of file names to load into Milvus.
partition_name (optional) Name of the partition to insert data into.

Check the import task state

Check the state of the import task.

state = utility.get_bulk_load_state(tasks[0])
print(state.state_name())
print(state.ids())
print(state.infos())

The state codes and their corresponding descriptions.

State codeStateDescription
0BulkLoadPendingTask is in pending list
1BulkLoadFailedTask failed, get the failed reason with state.infos["failed_reason"]
2BulkLoadStartedTask is dispatched to data node, gonna to be executed
3BulkLoadDownloadedData file has been downloaded from MinIO to local
4BulkLoadParsedData file has been validated and parsed
5BulkLoadPersistedNew segments have been generated and persisted
6BulkLoadCompletedTask completed

Limits

FeatureMaximum limit
Max size of task pending list32
Max size of a data file4GB

What’s next

Feedback

Was this page helpful?