Milvus monitoring framework overview
This topic explains how Milvus uses Prometheus to monitor metrics and Grafana to visualize metrics and create alerts.
Prometheus in Milvus
Prometheus is an open-source monitoring and alerting toolkit for Kubernetes implementations. It collects and stores metrics as time-series data. This means that metrics are stored with timestamps when recorded, alongside with optional key-value pairs called labels.
Currently Milvus uses the following components of Prometheus:
- Prometheus endpoint to pull data from endpoints set by exporters.
- Prometheus operator to effectively manage Prometheus monitoring instances.
- Kube-prometheus to provide easy to operate end-to-end Kubernetes cluster monitoring.
Metric names
A valid metric name in Prometheus contains three elements: namespace, subsystem, and name. These three elements are connected with "_".
The namespace of Milvus metrics monitored by Prometheus is "milvus". Depending on the role that a metric belongs to, its subsystem should be one of the following eight roles: "rootcoord", "proxy", "querycoord", "querynode", "indexcoord", "indexnode", "datacoord", "datanode".
For instance, the Milvus metric that calculates the total number of vectors queried is named milvus_proxy_search_vectors_count
.
Metric types
Prometheus supports four types of metrics:
- Counter: a type of cumulative metrics whose value can only increase or be reset to zero upon restart.
- Gauge: a type of metrics whose value can either go up and down.
- Histogram: a type of metrics that are counted based on configurable buckets. A common example is request duration.
- Summary: a type of metrics similar to histogram that calculates configurable quantiles over a sliding time window.
Metric labels
Prometheus differentiates samples with the same metric name by labeling them. A label is a certain attribute of a metric. Metrics with the same name must have the same value for the variable_labels
field. The following table lists the names and meanings of common labels of Milvus metrics.
Label name | Definition | Values |
---|---|---|
ānode_idā | The unique identity of a role. | A globally unique ID generated by milvus. |
āstatusā | The status of a processed operation or request. | "abandon", "success", or "fail". |
āquery_typeā | The type of a read request. | āsearchā or "query". |
āmsg_typeā | The type of messages. | "insert", "delete", "search", or "query". |
āsegment_stateā | The status of a segment. | "Sealed", "Growing", "Flushed", "Flushing", "Dropped", or "Importing". |
ācache_stateā | The status of a cached object. | āhitā or "miss". |
ācache_nameā | The name of a cached object. This label is used together with the label "cache_state". | Eg. "CollectionID", "Schema", etc. |
āchannel_name" | Physical topics in message storage (Pulsar or Kafka). | Eg."by-dev-rootcoord-dml_0", "by-dev-rootcoord-dml_255", etc. |
āfunction_nameā | The name of a function that handles certain requests. | Eg. "CreateCollection", "CreatePartition", "CreateIndex", etc. |
āuser_nameā | The user name used for authentication. | A user name of your preference. |
āindex_task_statusā | The status of an index task in meta storage. | "unissued", "in-progress", "failed", "finished", or "recycled". |
Grafana in Milvus
Grafana is an open-source visualizing stack that can connect with all data sources. By pulling up metrics, it helps users understand, analyze and monitor massive data.
Milvus uses Grafanaās customizable dashboards for metric visualization.
Whatās next
After learning about the basic workflow of monitoring and alerting, learn: