Celery Monitoring with Prometheus and Grafana

Celery is a python project used for asynchronous job processing and task scheduling in web applications or distributed systems. It is very commonly used together with Django, Celery as the asynchronous job processor and Django as the web framework. Celery has great documentation on how to use it, deploy it and integrate it with Django. However, monitoring is less covered - this is what this blog post aims to do. There is a great Prometheus exporter for Celery that has dashboards and alerts that come with them.

The Prometheus exporter for Celery can be found here. This blog post is based on the metrics exposed by that exporter.

There are already two dashboards that are published in Grafana:

Celery Tasks Overview - Celery Overview, a simple overview of the tasks, queue and workers.
Celery Tasks by Task - Celery tasks by task, a breakdown of tasks by task that shows compute expensive metrics as task runtime buckets alongside tasks completions/failures/retries and exceptions.

There are also Prometheus alerts stored in GitHub that you can import that cover success rates, queue length and worker uptime.

The dashboards and alerts are work in progress, and feel free to share feedback in the Celery-exporter repository of what you would like to see or any issues you experience.

If you want to go directly to the dashboards you can use the links above, the rest of the blog post will describe setting up the Celery-exporter and the various alerts and dashboards.

Setting up Celery and Celery-exporter

First, we need to ensure Celery sends the task events, as the exporter depends on those. Task events can be enabled through the setting CELERY_WORKER_SEND_TASK_EVENTS, we also want to enable sending the SENT event. The SENT event indicates when a task is sent, and the RECEIVED event (enabled by default) tracks. Having both events will show the difference in tasks between clients and workers. The below settings can be appended to your Django/Celery application.

# https://docs.celeryq.dev/en/stable/userguide/configuration.html#worker-send-task-events
CELERY_WORKER_SEND_TASK_EVENTS = True
# https://docs.celeryq.dev/en/stable/userguide/configuration.html#std-setting-task_send_sent_event
CELERY_TASK_SEND_SENT_EVENT = True

Now Celery will omit events and the Celery-exporter subscribes to these events and turns them into metrics. The Celery-exporter has Docker images published at danihodovic/celery-exporter and that is the default way of deploying the Celery-exporter.

To run it you just need to configure the broker URL that Celery uses. To do this, set the CE_BROKER_URL environment variable:

CE_BROKER_URL=redis://<my redis url for example>

Additionally, I’d recommend setting up histogram buckets that are more suited for your use case. The Celery-exporter’s default buckets use Prometheus default buckets which are suited for requests - they are 10 buckets from 0.1 ms to 10 s. Since Celery is used for asynchronous processing, there’s a high probability your tasks are longer than many of these buckets. For one of my projects, I’ve set them to the following:

CE_BUCKETS=1,2.5,5,10,30,60,300,600,900,1800

Now that the environment variables are set, you can just run the Docker image and add the scrape endpoint <your-celery-exporter-endpoint>:9808/metrics to your Prometheus configuration to scrape the metrics.

Helm Chart

The Celery-exporter also comes with a Helm chart. It is hosted at https://danihodovic.github.io/celery-exporter. It supports both setting Prometheus scrape annotations and the Prometheus-operator’s ServiceMonitor custom resource definition. With the below Helm values, the exporter should be deployable to your Kubernetes cluster:

env:
    - name: "CE_BROKER_URL"
      valueFrom:
        secretKeyRef":
          key: "redisUrl"
          name: "<my-redis-secret>"
    - name: "CE_BUCKETS"
      value: "1,2.5,5,10,30,60,300,600,900,1800"
podAnnotations:
  prometheus.io/scrape: "true"
serviceMonitor:
  enabled: true

The Helm chart source can be found here.

Grafana Dashboards

As mentioned previously, the Celery-mixin has two dashboards. A Celery overview dashboard and a Celery tasks breakdown by task dashboard. The dashboards are split into two, otherwise there would be many graphs in one dashboard. Also, filters would be applicable for a portion of the panels as not all metrics contain the filtered labels making it unclear when they apply and some expensive metrics would do heavy queries to your Prometheus backend if they do not have filters applied.

The upcoming sections will describe each dashboard.

Celery Tasks Overview

The Celery overview dashboard focuses on providing an overview of your entire Celery system. The following things are core for the dashboard:

Summary - provides a section that summarizes your Celery state
- Numbers of workers
- Tasks active
- Tasks received the last week
- Success rates the last week
- Average runtime the last week
- Top failing tasks by task the last week
- Top task exceptions the last week
- Top average runtime by task the last week
Queues - provides a section that covers the queue length
Tasks - provides a section that covers the task stats
- Task stats table - instant insights to all the task states and success rates
- Task state over time - a graph visualizing task state over time
- Task runtime over time - a graph visualizing runtime over time

celery-tasks-overview

Celery Tasks by Task

The Celery tasks by task focuses on providing a breakdown of specific tasks and visualizing the more expensive metrics such as task runtime. The following things are core for the dashboard:

Filters - allows us to filter by queue and task, which are applied to the majority of panels
Tasks - provides a section that covers the task stats
- Task stats table - instant insights to all the task states and success rates
- Task exceptions table - instant insights to all the exceptions
- Task state over time - a graph visualizing task state over time
- Task exceptions over time - a graph visualizing the exceptions over time
- Task runtime over time - a graph visualizing runtime over time

Note: some views are replicated from the overview, just remember that they’re now broken down by task and that’s why they’re in a separate dashboard.

celery-tasks-by-task

Alerts

Alerts are trickier to get right for a generic use case, however they are still provided by the Celery-mixin. They are also configurable with the config.libsonnet file in the repository, if you are familiar with Jsonnet then customizing the alerts should be fairly straight forward. The alerts can be found on GitHub and I’ll add a description for the alerts below.

Alert name: CeleryTaskHighFailRate

Alerts when more than 5% of a specific task failed for the past 10 minutes.

Alert name: CeleryHighQueueLength

Alerts when the queue length for a specific queue is higher than 100 for 20 minutes.

Alert name: CeleryWorkerDown

Alerts when a worker is offline for more than 15 minutes.

Summary

The Celery-exporter is a great exporter, and Grafana and Prometheus are amazing open source tools for monitoring purposes. The dashboard and alerts presented in this blog post should be easy to reuse and extend if needed. I think they set a good basis for Celery monitoring, but they can be improved and adjusted, therefore it would be great if you have any suggestions, then open issues in the Celery-exporter GitHub repository. Looking for any input to hopefully standardize dashboards and alerts for Celery over time!

I’ve also written a blog post on Django Monitoring with Prometheus and Grafana!