5 min read
Celery is a python project used for asynchronous job processing and task scheduling in web applications or distributed systems. It is very commonly used together with Django, Celery as the asynchronous job processor and Django as the web framework. Celery has great documentation on how to use it, deploy it and integrate it with Django. However, monitoring is less covered - this is what this blog post aims to do. There is a great Prometheus exporter for Celery that has dashboards and alerts that come with them.
The Prometheus exporter for Celery can be found here. This blog post is based on the metrics exposed by that exporter.
There are already two dashboards that are published in Grafana:
There are also Prometheus alerts stored in GitHub that you can import that cover success rates, queue length and worker uptime.
The dashboards and alerts are work in progress and feel free to share feedback in the Celery-exporter repository of what you would like to see or any issues you experience.
If you want to go directly to the dashboards you can use the links above, the rest of the blog post will describe setting up the Celery-exporter and the various alerts and dashboards.
First, we need to ensure Celery sends the task events as the exporter depends on those. Task events can be enabled through the setting
CELERY_WORKER_SEND_TASK_EVENTS, we also want to enable sending the
SENT event. The
SENT event indicates when a task is sent and the
RECEIVED event (enabled by default) tracks. Having both events will show the difference in tasks between clients and workers. The below settings can be appended to your Django/Celery application.
# https://docs.celeryq.dev/en/stable/userguide/configuration.html#worker-send-task-events CELERY_WORKER_SEND_TASK_EVENTS = True # https://docs.celeryq.dev/en/stable/userguide/configuration.html#std-setting-task_send_sent_event CELERY_TASK_SEND_SENT_EVENT = True
Now Celery will omit events and the Celery-exporter subscribes to these events and turns them into metrics. The Celery-exporter has Docker images published at
danihodovic/celery-exporter and that is the default way of deploying the Celery-exporter.
To run it you just need to configure the broker URL that Celery uses. To do this set the
CE_BROKER_URL environment variable:
CE_BROKER_URL=redis://<my redis url for example>
Additionally, I'd recommend setting up histogram buckets that are more suited for your use case. The Celery-exporter's default buckets use Prometheus default buckets which are suited for requests - they are 10 buckets from 0.1 ms to 10 s. Since Celery is used for asynchronous processing there's a high probability your tasks are longer than many of these buckets. For one of my projects I've set them to the following:
Now that the environment variables are set, you can just run the Docker image and add the scrape endpoint
<your-celery-exporter-endpoint>:9808/metrics to your Prometheus configuration to scrape the metrics.
The Celery-exporter also comes with a Helm chart. It is hosted at
https://danihodovic.github.io/celery-exporter. It supports both setting Prometheus scrape annotations and the Prometheus-operator's
ServiceMonitor custom resource definition. With the below Helm values the exporter should be deployable to your Kubernetes cluster:
env: - name: "CE_BROKER_URL" valueFrom: secretKeyRef": key: "redisUrl" name: "<my-redis-secret>" - name: "CE_BUCKETS" value: "1,2.5,5,10,30,60,300,600,900,1800" podAnnotations: prometheus.io/scrape: "true" serviceMonitor: enabled: true
The Helm chart source can be found here.
As mentioned previously the Celery-mixin has two dashboards. A Celery overview dashboard and a Celery tasks breakdown by task dashboard. The dashboards are split into two, otherwise there would be many graphs in one dashboard. Also, filters would be applicable for a portion of the panels as not all metrics contain the filtered labels making it unclear when they apply and some expensive metrics would do heavy queries to your Prometheus backend if they do not have filters applied.
The upcoming sections will describe each dashboard.
The Celery overview dashboard focuses on providing an overview of your entire Celery system. The following things are core for the dashboard:
The Celery tasks by task focuses on providing a breakdown of specific tasks and visualizing the more expensive metrics such as task runtime. The following things are core for the dashboard:
Note: some views are replicated from the overview, just remember that they're now broken down by task and that's why they're in a separate dashboard.
Alerts are trickier to get right for a generic use case, however they are still provided by the Celery-mixin. They are also configurable with the
config.libsonnet file in the repository, if you are familiar with Jsonnet then customizing the alerts should be fairly straight forward. The alerts can be found on GitHub and I'll add a description for the alerts below.
Alerts when more than 5% of a specific task failed for the past 10 minutes.
Alerts when the queue length for a specific queue is higher than 100 for 20 minutes.
Alerts when a worker is offline for more than 15 minutes.
The Celery-exporter is a great exporter and Grafana and Prometheus are amazing open source tools for monitoring purposes. The dashboard and alerts presented in this blog post should be easy to reuse and extend if needed. I think they set a good basis for Celery monitoring, but they can be improved and adjusted, therefore it would be great if you have any suggestions, then open issues in the Celery-exporter GitHub repository. Looking for any input to hopefully standardize dashboards and alerts for Celery over time!
I've also written a blog post on Django Monitoring with Prometheus and Grafana!
5 min read
The [Prometheus package](https://github.com/korfuri/django-prometheus) for Django provides a great Prometheus integration, but the open source dashboards and alerts that exist are not that great. The to-go [Grafana dashboard](https://grafana.com/grafana/dashboards/9528-django-prometheus/) does not use a large portion of metrics …
6 min read
As good as a framework that Django is the default method of sending an email when getting an error leave much to be desired. The database or cache acting flaky? Expect 1000s of emails depicting …
4 min read
Apollo GraphQL and NestJS are gaining traction quickly, however the monitoring approaches are unclear. At the moment (late 2021 / early 2022) there are no default exporters or libraries for Prometheus metrics and the same …