This topic explains how Milvus uses Prometheus to monitor metrics and Grafana to visualize metrics and create alerts.
Prometheus is an open-source monitoring and alerting toolkit for Kubernetes implementations. It collects and stores metrics as time-series data. This means that metrics are stored with timestamps when recorded, alongside with optional key-value pairs called labels.
Currently Milvus uses the following components of Prometheus:
- Prometheus endpoint to pull data from endpoints set by exporters.
- Prometheus operator to effectively manage Prometheus monitoring instances.
- Kube-prometheus to provide easy to operate end-to-end Kubernetes cluster monitoring.
A valid metric name in Prometheus contains three elements: namespace, subsystem, and name. These three elements are connected with "_".
The namespace of Milvus metrics monitored by Prometheus is "milvus". Depending on the role that a metric belongs to, its subsystem should be one of the following eight roles: "rootcoord", "proxy", "querycoord", "querynode", "indexcoord", "indexnode", "datacoord", "datanode".
For instance, the Milvus metric that calculates the total number of vectors queried is named
Prometheus supports four types of metrics:
- Counter: a type of cumulative metrics whose value can only increase or be reset to zero upon restart.
- Gauge: a type of metrics whose value can either go up and down.
- Histogram: a type of metrics that are counted based on configurable buckets. A common example is request duration.
- Summary: a type of metrics similar to histogram that calculates configurable quantiles over a sliding time window.
Prometheus differentiates samples with the same metric name by labeling them. A label is a certain attribute of a metric. Metrics with the same name must have the same value for the
variable_labels field. The following table lists the names and meanings of common labels of Milvus metrics.
|"node_id"||The unique identity of a role.||A globally unique ID generated by milvus.|
|"status"||The status of a processed operation or request.||"abandon", "success", or "fail".|
|“query_type”||The type of a read request.||"search" or "query".|
|"msg_type"||The type of messages.||"insert", "delete", "search", or "query".|
|"segment_state"||The status of a segment.||"Sealed", "Growing", "Flushed", "Flushing", "Dropped", or "Importing".|
|"cache_state"||The status of a cached object.||"hit" or "miss".|
|"cache_name"||The name of a cached object. This label is used together with the label "cache_state".||Eg. "CollectionID", "Schema", etc.|
|“channel_name"||Physical topics in message storage (Pulsar or Kafka).||Eg."by-dev-rootcoord-dml_0", "by-dev-rootcoord-dml_255", etc.|
|"function_name"||The name of a function that handles certain requests.||Eg. "CreateCollection", "CreatePartition", "CreateIndex", etc.|
|"user_name"||The user name used for authentication.||A user name of your preference.|
|"index_task_status"||The status of an index task in meta storage.||"unissued", "in-progress", "failed", "finished", or "recycled".|
Grafana is an open-source visualizing stack that can connect with all data sources. By pulling up metrics, it helps users understand, analyze and monitor massive data.
Milvus uses Grafana's customizable dashboards for metric visualization.
After learning about the basic workflow of monitoring and alerting, learn: