RabbitMQ has a native built-in Prometheus plugin and by default it has granular metrics disabled. Granular metrics means per-queue/vhost metrics - detailed metrics that provide message lag and consumer info on a queue
and vhost
basis. You could enable granular per-object metrics, but this is not recommended as the plugin becomes much slower on a large cluster and the label cardinality for your time series database could become high.
To solve this you could use the unofficial OSS RabbitMQ exporter written by kbudde that will allow you to have granular metrics enabled and also disable specific metrics that the native Prometheus plugin provides. The unofficial exporter refers to a mixed approach where you use the unofficial exporter for detailed metrics and disable all other metrics and use the native RabbitMQ Prometheus plugin for all other metrics.
However, recently the native RabbitMQ Prometheus plugin added another endpoint that provides detailed metrics that are both configurable and granular. This allows us to avoid using two exporters and not worry about metric duplicates, label naming, multiple datasources for dashboards amongst other issues. It also does not require enabling per-object
metrics for the whole Prometheus plugin that is slow and can cause high label cardinality. This post will walk through how to scrape detailed metrics and also provide dashboards and alerts for the metrics.
Scraping the Detailed Endpoint
The Prometheus plugin allows two HTTP
query parameters family
and vhost
on the /metrics/detailed
path. The family
parameter indicates what metrics to return and the vhost
parameter indicates which virtual hosts to filter those metrics against. An example of the HTTP
path with the both of the query parameters would be /metrics/detailed?vhost=test&family=queue_coarse_metrics
.
RabbitMQ provides a detailed description here.
For our dashboards and alerts that are per-queue specific we need two families of metrics. The queue_coarse_metrics
that provides acked/unacked/total/reductions for a queue and rabbitmq_detailed_queue_consumers
which provider consumer count for a queue. For these requirements we'll add the parameters ?family=queue_coarse_metrics&family=queue_consumer_count
and we'll not filter by any vhost
. As I'm using the Promethues-operator we'll define a ServiceMonitor
for the detailed endpoint. We'll have the default ServiceMonitor
that scrapes non-detailed cluster-wide metrics and a detailed one that provides narrowed specific metrics that are detailed. The detailed ServiceMonitor
is defined below - the path and params for the endpoint is the important bit.
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
labels:
app.kubernetes.io/instance: rabbitmq
app.kubernetes.io/name: rabbitmq
argocd.argoproj.io/instance: message-queue
tanka.dev/environment: 9013254899e6601a6bf94f789e62faf2903ea90287e4fdc7
name: rabbitmq-detailed
namespace: staging
spec:
endpoints:
- interval: 30s
params:
family:
- queue_coarse_metrics
- queue_consumer_count
path: /metrics/detailed
port: metrics
namespaceSelector:
matchNames:
- staging
selector:
matchLabels:
app.kubernetes.io/instance: rabbitmq
app.kubernetes.io/name: rabbitmq
I also recommend that you add a Prometheus rule which defines a new metric called rabbitmq_queue_info
. This is done by grouping the default rabbitmq_identity_info
metric with the detailed consumer metric based on the instance/cluster/node. This is used in the dashboard to filter on and specify the rabbitmq_cluster
. Add the below rule to your Prometheus config.
- "name": "rabbitmq.rules"
"rules":
- "expr": |
rabbitmq_detailed_queue_consumers * on(instance) group_left(rabbitmq_cluster, rabbitmq_node) max(rabbitmq_identity_info) by (rabbitmq_cluster, instance, rabbitmq_node)
"record": "rabbitmq_queue_info"
Now that we have our metrics in let's add a dashboard for the queue metrics.
Visualizing our Metrics
As the default and very popular RabbitMQ dashboard provides a cluster overview we'll only focus on a dashboard that has detailed visualizations per queue
and vhost
. Therefore, it should have:
- Grafana templates for the cluster, vhosts and queue.
- A graph that shows ready & unacked messages per
vhost
. - A graph that shows ready & unacked messages per
queue
. - A table that shows consumers & queue length per
queue
.
The below dashboard showcases the above requirements well and visualizes the coarse metrics that the RabbitMQ plugin has.
This fills the gap that the official and default RabbitMQ dashboard has which is a visualization breakdown per vhost
and queue
.
The dashboard is published and can be imported from the Grafana dashboard library.
Alerting on our Metrics
Now that we have our rules and visualization in place we would like to alert on the things the dashboard shows. The Prometheus alerts would be:
- Too many messages in a queue
- "alert": "RabbitmqTooManyMessagesInQueue"
"annotations":
"dashboard_url": "https://grafana.com/d/rabbitmq-queue-12mk4klgjweg/rabbitmq-queue"
"description": "More than 100 messages in the queue {{ $labels.rabbitmq_cluster }}/{{ $labels.vhost }}/{{ $labels.queue }} for the past 2 minutes."
"summary": "RabbitMQ too many messages in queue."
"expr": |
sum by (rabbitmq_cluster, instance, vhost, queue) (rabbitmq_detailed_queue_messages * on(instance) group_left(rabbitmq_cluster, rabbitmq_node) max(rabbitmq_identity_info) by (rabbitmq_cluster, instance, rabbitmq_node)) > 100
"for": "2m"
"labels":
"severity": "warning"
- No consumers for a queue
- "alert": "RabbitmqNoConsumer"
"annotations":
"dashboard_url": "https://grafana.com/d/rabbitmq-queue-12mk4klgjweg/rabbitmq-queue"
"description": "The queue {{ $labels.rabbitmq_cluster }}/{{ $labels.vhost }}/{{ $labels.queue }} has 0 consumers for the past 2 minutes."
"summary": "RabbitMQ queue has no consumers."
"expr": |
sum by (rabbitmq_cluster, instance, vhost, queue) (rabbitmq_detailed_queue_consumers{queue!~".*dlx.*"} * on(instance) group_left(rabbitmq_cluster, rabbitmq_node) max(rabbitmq_identity_info) by (rabbitmq_cluster, instance, rabbitmq_node)) == 0
"for": "2m"
"labels":
"severity": "warning"
- Unroutable messages per cluster
- "alert": "RabbitmqUnroutableMessages"
"annotations":
"dashboard_url": "https://grafana.com/d/rabbitmq-queue-12mk4klgjweg/rabbitmq-queue"
"description": "The Rabbitmq cluster {{ $labels.rabbitmq_cluster }} has unroutable messages for the past 2 minutes."
"summary": "The Rabbitmq cluster has unroutable messages."
"expr": |
sum by(rabbitmq_node, rabbitmq_cluster) (rate(rabbitmq_channel_messages_unroutable_dropped_total[1m]) * on(instance) group_left(rabbitmq_cluster, rabbitmq_node) max(rabbitmq_identity_info) by (rabbitmq_cluster, instance, rabbitmq_node)) > 0 or
sum by(rabbitmq_node, rabbitmq_cluster) (rate(rabbitmq_channel_messages_unroutable_returned_total[1m]) * on(instance) group_left(rabbitmq_cluster, rabbitmq_node) max(rabbitmq_identity_info) by (rabbitmq_cluster, instance, rabbitmq_node)) > 0
"for": "2m"
"labels":
"severity": "info"
Summary
The dashboards and alerts are extracted from the RabbitMQ-mixin I've written. A Prometheus mixin is a jsonnet written library for alerts and dashboards. The Grafana dashboard is available in the dashboard library.
Hopefully this post will simplify your setup to use a single exporter. Also, the dashboards and alerts will be a great addition for detailed per-queue visualizations and alerts.