Milvus monitoring framework overview
This topic explains how Milvus uses Prometheus to monitor metrics and Grafana to visualize metrics and create alerts.
Prometheus in Milvus
Prometheus is an open-source monitoring and alerting toolkit for Kubernetes implementations. It collects and stores metrics as time-series data. This means that metrics are stored with timestamps when recorded, alongside with optional key-value pairs called labels.
Currently Milvus uses the following components of Prometheus:
- Prometheus endpoint to pull data from endpoints set by exporters.
- Prometheus operator to effectively manage Prometheus monitoring instances.
- Kube-prometheus to provide easy to operate end-to-end Kubernetes cluster monitoring.
Metric names
A valid metric name in Prometheus contains three elements: namespace, subsystem, and name. These three elements are connected with "_".
The namespace of Milvus metrics monitored by Prometheus is "milvus". Depending on the role that a metric belongs to, its subsystem should be one of the following eight roles: "rootcoord", "proxy", "querycoord", "querynode", "indexcoord", "indexnode", "datacoord", "datanode".
For instance, the Milvus metric that calculates the total number of vectors queried is named milvus_proxy_search_vectors_count
.
Metric types
Prometheus supports four types of metrics:
- Counter: a type of cumulative metrics whose value can only increase or be reset to zero upon restart.
- Gauge: a type of metrics whose value can either go up and down.
- Histogram: a type of metrics that are counted based on configurable buckets. A common example is request duration.
- Summary: a type of metrics similar to histogram that calculates configurable quantiles over a sliding time window.
Metric labels
Prometheus differentiates samples with the same metric name by labeling them. A label is a certain attribute of a metric. Metrics with the same name must have the same value for the variable_labels
field. The following table lists the names and meanings of common labels of Milvus metrics.
Label name | Definition | Values |
---|---|---|
“node_id” | The unique identity of a role. | A globally unique ID generated by milvus. |
“status” | The status of a processed operation or request. | "abandon", "success", or "fail". |
“query_type” | The type of a read request. | “search” or "query". |
“msg_type” | The type of messages. | "insert", "delete", "search", or "query". |
“segment_state” | The status of a segment. | "Sealed", "Growing", "Flushed", "Flushing", "Dropped", or "Importing". |
“cache_state” | The status of a cached object. | “hit” or "miss". |
“cache_name” | The name of a cached object. This label is used together with the label "cache_state". | Eg. "CollectionID", "Schema", etc. |
“channel_name" | Physical topics in message storage (Pulsar or Kafka). | Eg."by-dev-rootcoord-dml_0", "by-dev-rootcoord-dml_255", etc. |
“function_name” | The name of a function that handles certain requests. | Eg. "CreateCollection", "CreatePartition", "CreateIndex", etc. |
“user_name” | The user name used for authentication. | A user name of your preference. |
“index_task_status” | The status of an index task in meta storage. | "unissued", "in-progress", "failed", "finished", or "recycled". |
Grafana in Milvus
Grafana is an open-source visualizing stack that can connect with all data sources. By pulling up metrics, it helps users understand, analyze and monitor massive data.
Milvus uses Grafana’s customizable dashboards for metric visualization.
What’s next
After learning about the basic workflow of monitoring and alerting, learn: