4 min read
RabbitMQ has a native built-in Prometheus plugin and by default it has granular metrics disabled. Granular metrics means per-queue/vhost metrics - detailed metrics that provide message lag and consumer info on a queue
and vhost
basis. You could enable granular per-object metrics but this is not recommended as the plugin becomes much slower on a large cluster and the label cardinality for your time series database could become high.
To solve this you could use the unofficial OSS RabbitMQ exporter written by kbudde that will allow you to have granular metrics enabled and also disable specific metrics that the native Prometheus plugin provides. The unofficial exporter refers to a mixed approach where you use the unofficial exporter for detailed metrics and disable all other metrics and use the native RabbitMQ Prometheus plugin for all other metrics.
However, recently the native RabbitMQ Prometheus plugin added another endpoint that provides detailed metrics that are both configurable and granular. This allows us to avoid using two exporters and not worry about metric duplicates, label naming, multiple datasources for dashboards amongst other issues. It also does not require enabling per-object
metrics for the whole Prometheus plugin that is slow and can cause high label cardinality. This post will walk through how to scrape detailed metrics and also provide dashboards and alerts for the metrics.
The Prometheus plugin allows two HTTP
query parameters family
and vhost
on the /metrics/detailed
path. The family
parameter indicates what metrics to return and the vhost
parameter indicates which virtual hosts to filter those metrics against. An example of the HTTP
path with the both of the query parameters would be /metrics/detailed?vhost=test&family=queue_coarse_metrics
.
RabbitMQ provides a detailed description here.
For our dashboards and alerts that are per-queue specific we need two families of metrics. The queue_coarse_metrics
that provides acked/unacked/total/reductions for a queue and rabbitmq_detailed_queue_consumers
which provider consumer count for a queue. For these requirements we'll add the parameters ?family=queue_coarse_metrics&family=queue_consumer_count
and we'll not filter by any vhost
. As I'm using the Promethues-operator we'll define a ServiceMonitor
for the detailed endpoint. We'll have the default ServiceMonitor
that scrapes non-detailed cluster-wide metrics and a detailed one that provides narrowed specific metrics that are detailed. The detailed ServiceMonitor
is defined below - the path and params for the endpoint is the important bit.
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
labels:
app.kubernetes.io/instance: rabbitmq
app.kubernetes.io/name: rabbitmq
argocd.argoproj.io/instance: message-queue
tanka.dev/environment: 9013254899e6601a6bf94f789e62faf2903ea90287e4fdc7
name: rabbitmq-detailed
namespace: staging
spec:
endpoints:
- interval: 30s
params:
family:
- queue_coarse_metrics
- queue_consumer_count
path: /metrics/detailed
port: metrics
namespaceSelector:
matchNames:
- staging
selector:
matchLabels:
app.kubernetes.io/instance: rabbitmq
app.kubernetes.io/name: rabbitmq
I also recommend that you add a Prometheus rule which defines a new metric called rabbitmq_queue_info
. This is done by grouping the default rabbitmq_identity_info
metric with the detailed consumer metric based on the instance/cluster/node. This is used in the dashboard to filter on and specify the rabbitmq_cluster
. Add the below rule to your Prometheus config.
- "name": "rabbitmq.rules"
"rules":
- "expr": |
rabbitmq_detailed_queue_consumers * on(instance) group_left(rabbitmq_cluster, rabbitmq_node) rabbitmq_identity_info
"record": "rabbitmq_queue_info"
Now that we have our metrics in let's add a dashboard for the queue metrics.
As the default and very popular RabbitMQ dashboard provides a cluster overview we'll only focus on a dashboard that has detailed visualizations per queue
and vhost
. Therefore it should have:
vhost
.queue
.queue
.The below dashboard showcases the above requirements well and visualizes the coarse metrics that the RabbitMQ plugin has.
This fills the gap that the official and default RabbitMQ dashboard has which is a visualization breakdown per vhost
and queue
.
The dashboard is published and can be imported from the Grafana dashboard library.
Now that we have our rules and visualization in place we would like to alert on the things the dashboard shows. The Prometheus alerts would be:
- "alert": "RabbitmqTooManyMessagesInQueue"
"annotations":
"description": "More than 100 messages in the queue {{ $labels.rabbitmq_cluster }}/{{ $labels.vhost }}/{{ $labels.queue }} for the past 2 minutes."
"summary": "RabbitMQ too many messages in queue."
"expr": |
sum by (rabbitmq_cluster, instance, vhost, queue) (rabbitmq_detailed_queue_messages * on(instance) group_left(rabbitmq_cluster, rabbitmq_node) rabbitmq_identity_info) > 100
"for": "2m"
"labels":
"severity": "warning"
- "alert": "RabbitmqNoConsumer"
"annotations":
"dashboard_url": "https://grafana.com/d/rabbitmq-queue-12mk4klgjweg/rabbitmq-queue"
"description": "The queue {{ $labels.rabbitmq_cluster }}/{{ $labels.vhost }}/{{ $labels.queue }} has 0 consumers for the past 2 minutes."
"summary": "RabbitMQ queue has no consumers."
"expr": |
sum by (rabbitmq_cluster, instance, vhost, queue) (rabbitmq_detailed_queue_consumers{queue!~".*dlx.*"} * on(instance) group_left(rabbitmq_cluster, rabbitmq_node) rabbitmq_identity_info) == 0
"for": "2m"
"labels":
"severity": "warning"
- "alert": "RabbitmqUnroutableMessages"
"annotations":
"description": "The Rabbitmq cluster {{ $labels.rabbitmq_cluster }} has unroutable messages for the past 2 minutes."
"summary": "The Rabbitmq cluster has unroutable messages."
"expr": |
sum by(rabbitmq_node, rabbitmq_cluster) (rate(rabbitmq_channel_messages_unroutable_dropped_total[1m]) * on(instance) group_left(rabbitmq_cluster, rabbitmq_node) rabbitmq_identity_info) > 0 or
sum by(rabbitmq_node, rabbitmq_cluster) (rate(rabbitmq_channel_messages_unroutable_returned_total[1m]) * on(instance) group_left(rabbitmq_cluster, rabbitmq_node) rabbitmq_identity_info) > 0
"for": "2m"
"labels":
"severity": "info"
The dashboards and alerts are extracted from the RabbitMQ-mixin I've written. A Prometheus mixin is a jsonnet written library for alerts and dashboards. The Grafana dashboard is available in the dashboard library.
Hopefully this post will simplify your setup to use a single exporter. Also, the dashboards and alerts will be a great addition for detailed per-queue visualizations and alerts.
4 min read
Apollo GraphQL and NestJS are gaining traction quickly, however the monitoring approaches are unclear. At the moment (late 2021 / early 2022) there are no default exporters or libraries for Prometheus metrics and the same …
5 min read
The [Prometheus package](https://github.com/korfuri/django-prometheus) for Django provides a great Prometheus integration, but the open source dashboards and alerts that exist are not that great. The to-go [Grafana dashboard](https://grafana.com/grafana/dashboards/9528-django-prometheus/) does not use a large portion of metrics …
5 min read
Celery is a python project used for asynchronous job processing and task scheduling in web applications or distributed systems. It is very commonly used together with Django, Celery as the asynchronous job processor and Django …