RabbitMQ Per Queue Monitoring

RabbitMQ has a native built-in Prometheus plugin and by default it has granular metrics disabled. Granular metrics means per-queue/vhost metrics - detailed metrics that provide message lag and consumer info on a queue and vhost basis. You could enable granular per-object metrics, but this is not recommended as the plugin becomes much slower on a large cluster and the label cardinality for your time series database could become high.

To solve this you could use the unofficial OSS RabbitMQ exporter written by kbudde that will allow you to have granular metrics enabled and also disable specific metrics that the native Prometheus plugin provides. The unofficial exporter refers to a mixed approach where you use the unofficial exporter for detailed metrics and disable all other metrics and use the native RabbitMQ Prometheus plugin for all other metrics.

However, recently the native RabbitMQ Prometheus plugin added another endpoint that provides detailed metrics that are both configurable and granular. This allows us to avoid using two exporters and not worry about metric duplicates, label naming, multiple datasources for dashboards amongst other issues. It also does not require enabling per-object metrics for the whole Prometheus plugin that is slow and can cause high label cardinality. This post will walk through how to scrape detailed metrics and also provide dashboards and alerts for the metrics.

Scraping the Detailed Endpoint

The Prometheus plugin allows two HTTP query parameters family and vhost on the /metrics/detailed path. The family parameter indicates what metrics to return and the vhost parameter indicates which virtual hosts to filter those metrics against. An example of the HTTP path with the both of the query parameters would be /metrics/detailed?vhost=test&family=queue_coarse_metrics.

RabbitMQ provides a detailed description here.

For our dashboards and alerts that are per-queue specific we need two families of metrics. The queue_coarse_metrics that provides acked/unacked/total/reductions for a queue and rabbitmq_detailed_queue_consumers which provider consumer count for a queue. For these requirements we’ll add the parameters ?family=queue_coarse_metrics&family=queue_consumer_count and we’ll not filter by any vhost. As I’m using the Promethues-operator we’ll define a ServiceMonitor for the detailed endpoint. We’ll have the default ServiceMonitor that scrapes non-detailed cluster-wide metrics and a detailed one that provides narrowed specific metrics that are detailed. The detailed ServiceMonitor is defined below - the path and params for the endpoint is the important bit.

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  labels:
    app.kubernetes.io/instance: rabbitmq
    app.kubernetes.io/name: rabbitmq
    argocd.argoproj.io/instance: message-queue
    tanka.dev/environment: 9013254899e6601a6bf94f789e62faf2903ea90287e4fdc7
  name: rabbitmq-detailed
  namespace: staging
spec:
  endpoints:
  - interval: 30s
    params:
      family:
      - queue_coarse_metrics
      - queue_consumer_count
    path: /metrics/detailed
    port: metrics
  namespaceSelector:
    matchNames:
    - staging
  selector:
    matchLabels:
      app.kubernetes.io/instance: rabbitmq
      app.kubernetes.io/name: rabbitmq

I also recommend that you add a Prometheus rule which defines a new metric called rabbitmq_queue_info. This is done by grouping the default rabbitmq_identity_info metric with the detailed consumer metric based on the instance/cluster/node. This is used in the dashboard to filter on and specify the rabbitmq_cluster. Add the below rule to your Prometheus config.

- "name": "rabbitmq.rules"
  "rules":
  - "expr": |
      rabbitmq_detailed_queue_consumers * on(instance) group_left(rabbitmq_cluster, rabbitmq_node) max(rabbitmq_identity_info) by (rabbitmq_cluster, instance, rabbitmq_node)
    "record": "rabbitmq_queue_info"

Now that we have our metrics in let’s add a dashboard for the queue metrics.

Visualizing our Metrics

As the default and very popular RabbitMQ dashboard provides a cluster overview we’ll only focus on a dashboard that has detailed visualizations per queue and vhost. Therefore, it should have:

Grafana templates for the cluster, vhosts and queue.
A graph that shows ready & unacked messages per vhost.
A graph that shows ready & unacked messages per queue.
A table that shows consumers & queue length per queue.

The below dashboard showcases the above requirements well and visualizes the coarse metrics that the RabbitMQ plugin has.

rabbitmq-dashboard

This fills the gap that the official and default RabbitMQ dashboard has which is a visualization breakdown per vhost and queue.

The dashboard is published and can be imported from the Grafana dashboard library.

Alerting on our Metrics

Now that we have our rules and visualization in place we would like to alert on the things the dashboard shows. The Prometheus alerts would be:

Too many messages in a queue

- "alert": "RabbitmqTooManyMessagesInQueue"
  "annotations":
    "dashboard_url": "https://grafana.com/d/rabbitmq-queue-12mk4klgjweg/rabbitmq-queue"
    "description": "More than 100 messages in the queue {{ $labels.rabbitmq_cluster }}/{{ $labels.vhost }}/{{ $labels.queue }} for the past 2 minutes."
    "summary": "RabbitMQ too many messages in queue."
  "expr": |
    sum by (rabbitmq_cluster, instance, vhost, queue) (rabbitmq_detailed_queue_messages * on(instance) group_left(rabbitmq_cluster, rabbitmq_node) max(rabbitmq_identity_info) by (rabbitmq_cluster, instance, rabbitmq_node)) > 100
  "for": "2m"
  "labels":
    "severity": "warning"

No consumers for a queue

- "alert": "RabbitmqNoConsumer"
  "annotations":
    "dashboard_url": "https://grafana.com/d/rabbitmq-queue-12mk4klgjweg/rabbitmq-queue"
    "description": "The queue {{ $labels.rabbitmq_cluster }}/{{ $labels.vhost }}/{{ $labels.queue }} has 0 consumers for the past 2 minutes."
    "summary": "RabbitMQ queue has no consumers."
  "expr": |
    sum by (rabbitmq_cluster, instance, vhost, queue) (rabbitmq_detailed_queue_consumers{queue!~".*dlx.*"} * on(instance) group_left(rabbitmq_cluster, rabbitmq_node) max(rabbitmq_identity_info) by (rabbitmq_cluster, instance, rabbitmq_node)) == 0
  "for": "2m"
  "labels":
    "severity": "warning"

Unroutable messages per cluster

- "alert": "RabbitmqUnroutableMessages"
  "annotations":
    "dashboard_url": "https://grafana.com/d/rabbitmq-queue-12mk4klgjweg/rabbitmq-queue"
    "description": "The Rabbitmq cluster {{ $labels.rabbitmq_cluster }} has unroutable messages for the past 2 minutes."
    "summary": "The Rabbitmq cluster has unroutable messages."
  "expr": |
    sum by(rabbitmq_node, rabbitmq_cluster) (rate(rabbitmq_channel_messages_unroutable_dropped_total[1m]) * on(instance) group_left(rabbitmq_cluster, rabbitmq_node) max(rabbitmq_identity_info) by (rabbitmq_cluster, instance, rabbitmq_node)) > 0 or
    sum by(rabbitmq_node, rabbitmq_cluster) (rate(rabbitmq_channel_messages_unroutable_returned_total[1m]) * on(instance) group_left(rabbitmq_cluster, rabbitmq_node) max(rabbitmq_identity_info) by (rabbitmq_cluster, instance, rabbitmq_node)) > 0
  "for": "2m"
  "labels":
    "severity": "info"

Summary

The dashboards and alerts are extracted from the RabbitMQ-mixin I’ve written. A Prometheus mixin is a jsonnet written library for alerts and dashboards. The Grafana dashboard is available in the dashboard library.

Hopefully this post will simplify your setup to use a single exporter. Also, the dashboards and alerts will be a great addition for detailed per-queue visualizations and alerts.