Celery is a python project used for asynchronous job processing and task scheduling in web applications or distributed systems. It is very commonly used together with Django, Celery as the asynchronous job processor and Django as the web framework. Celery has great documentation on how to use it, deploy it and integrate it with Django. However, monitoring is less covered - this is what this blog post aims to do. There is a great Prometheus exporter for Celery that has dashboards and alerts that come with them.
Blog Posts
Most Popular Blog Tags
Core Continuous Integration (CI) Steps for Python and Django Applications
Recently I worked on a Django project where I built a Django API using DRF, but I also was responsible for the infrastructure setup. Setting up a baseline Django project was easy due to the Django cookiecutter, but for the CI/CD setup the path was not as straight forward. Overtime, I've found a large ecosystem of tools that format, lint and test Django/Python projects very well. Getting a good CI suite with GitHub Actions was easy due to the ecosystem of formatting, linting and testing tools, but there was no blog post/guide on the all the various tools available. Therefore, I thought of sharing all the various actions we're using to ensure the highest code standard as possible using these tools.
Django Monitoring with Prometheus and Grafana
The Prometheus package for Django provides a great Prometheus integration, but the open source dashboards and alerts that exist are not that great. The to-go Grafana dashboard does not use a large portion of metrics provided by the Django-Prometheus package, alongside this there are no filters for views, methods, jobs and namespaces. This blog post will introduce the Django-mixin - a set of Prometheus rules and Grafana dashboards for Django. The dashboard and alerts will provide insights on applied/unapplied migrations, RED (requests per second, error percentage of the request, latency for each request) metrics, database ops and cache hit rate.
Django Error Tracking and Performance Monitoring with Sentry
As good as a framework that Django is, the default method of sending an email when getting an error leave much to be desired. The database or cache acting flaky? Expect 1000s of emails depicting that error, which usually does not provide enough details for you to understand the issue. Luckily, Sentry, which is an error tracking and monitoring platform provides an out-of-the-box integration for Django and Celery.
RabbitMQ Per Queue Monitoring
RabbitMQ has a native built-in Prometheus plugin and by default it has granular metrics disabled. Granular metrics means per-queue/vhost metrics - detailed metrics that provide message lag and consumer info on a queue and vhost basis. You could enable granular per-object metrics but this is not recommended as the plugin becomes much slower on a large cluster and the label cardinality for your time series database could become high.
To solve this you could use the unofficial OSS RabbitMQ exporter written by kbudde that will allow you to have granular metrics enabled and also disable specific metrics that the native Prometheus plugin provides. The unofficial exporter refers to a mixed approach where you use the unofficial exporter for detailed metrics and disable all other metrics and use the native RabbitMQ Prometheus plugin for all other metrics.
IRSA and Workload Identity with Terraform
The go-to practice when pods require permissions to access cloud services when using Kubernetes is using service accounts. The various clouds offering managed Kubernetes solutions have different implementations but they have the same concept, EKS has IRSA and GKE has Workload Identity. The service accounts that your containers use will have the required permissions to impersonate cloud IAM roles(AWS) or service accounts(GCP) so that they can access cloud resources. There are other alternatives as AWS instance roles but they are not fine-grained enough when running containerized workflows, every container has access to the resources the node is allowed to access. It might be a bit more complex and different coming from a non Kubernetes background but preexisting Terraform modules simplify the creation of the required resources to allow Kubernetes service accounts to impersonate and access cloud resources.
Private EKS API Endpoint behind OpenVPN
AWS offers a managed Kubernetes solution called Elastic Kubernetes Service (EKS). When an EKS cluster is spun up the Kubernetes API is by default accessible by the public. However, this might be something that your company does not approve of due to security reasons, they might want to limit Kubernetes API access only to private networks. In that case you might want to bring up a service as OpenVPN and route private traffic through it. That would allow you to access the Kubernetes API through a private endpoint using OpenVPN. In this blog post we'll use Terraform to provision our infrastructure required for a private EKS cluster and we'll use OpenVPN Access Server as our VPN solution.
CI/CD for Apollo GraphQL Managed Federation
GraphQL federation is great to use when you want a single API/gateway for all your queries. The simple to-go approach is schema stitching, where you run a gateway microservice which targets all other microservices and composes a graph. This works initially fine, however over time you'd like schema checking, auto-polling for graph updates, seamless rollouts(no issues with schema stitching when rolling out) and overall a process that's well integrated into your continuous integration and continuous delivery pipeline. The basic approach of schema stiching does not provide this, using managed federation provided by Apollo Studio improves the workflow and solves many of the pain points.
NestJS Apollo GraphQL Prometheus Metrics and Grafana Dashboards
Apollo GraphQL and NestJS are gaining traction quickly, however the monitoring approaches are unclear. At the moment (late 2021 / early 2022) there are no default exporters or libraries for Prometheus metrics and the same goes for Grafana dashboards, this blog post will provide both. Just to ensure that you are aware - Apollo Studio provides metrics and many other features for your graphs. The only downside is you'll most likely end up with a paid plan and you will be locked-in to their offering. Also, there is no way of exporting metrics to your Prometheus instance and centralizing alerting & dashboards.
This blog post will be based on a NestJS implementation for the dependency injection of Prometheus metrics, however it should work similarly in other setups.
Creating Awesome Alertmanager Templates for Slack
Prometheus, Grafana, and Alertmanager is the default stack when deploying a monitoring system. The Prometheus and Grafana bits are well documented, and there exist tons of open source approaches on how to make the best use of them. Alertmanager, on the other hand, is not highlighted as much, and even though the use case can be seen as fairly simple, it can be complex. The templating language has lots of features and capabilities. Alertmanager configuration, templates, and rules make a huge difference, especially when the team has an approach of 'not staring at dashboards all day'. Detailed Slack alerts can be created with tons of information, such as dashboard links, runbook links, and alert descriptions, which go well together with the rest of a ChatOps stack. This post goes through how to make efficient Slack alerts.