karpenter-overview-1

Karpenter Monitoring with Prometheus and Grafana

5 days ago New!
5 min read

With the release of Karpenter v1 we have more and better Prometheus metrics, but the Grafana dashboards are not that great and there are no open source alerts. Therefore, I decided to create a monitoring-mixin that provides a set of Prometheus rules and Grafana dashboards for Karpenter. This blog post will introduce the kubernetes-autoscaling-mixin - a set of Prometheus rules and Grafana dashboards for Kubernetes autoscaling, but we will only write about Karpenter monitoring in this blog post.

There are already three dashboards that are published in Grafana:

  • Karpenter Overview - Karpenter overview, a simple overview of nodepools, nodes and pods.
  • Karpenter Activity - Karpenter activity, provides insights to Karpenter's activity, showing when scale up and scale downs happen and the reasoning behind those.
  • Karpenter Performance - Karpenter performance, provides insights to Karpenter's performance, showing things as for example cloud provider errors, node termination duration and pod startup duration.

There are also Prometheus alerts stored in GitHub that you can import that cover the common Karpenter issues.

Note: I've written a blog post about Comprehensive Kubernetes Autoscaling Monitoring with Prometheus and Grafana, this blog post takes excerpts from that blog post and focuses on Karpenter monitoring. This blog post also uses the same monitoring mixin as the Kubernetes autoscaling monitoring blog post. However, you can disable the other dashboards and alerts if you only want to monitor Karpenter. I'd also recommend checking out the autoscaling blog post, it covers HPAs, VPAs and PDBs.

If you want to go directly to the dashboards you can use the links above, the rest of the blog post will describe the various alerts and dashboards.

Grafana Dashboards

The upcoming sections will describe each dashboard.

Karpenter Overview

The Grafana dashboard provides an overview of Karpenter in your Kubernetes cluster. It includes the following panels:

  • Filters - Allows us to filter by namespace and Karpenter controller. The filters also allow us to break down the node pools by region, zone, architecture, OS, instance type and capacity type.
  • Node pool summary - Provides an overview of the node pools. The node pool count and the usage and limits of the node pools. It also provides a summary of the node pools by region, zone, architecture, OS, instance type and capacity type.
  • Pod summary - Provides an overview of pod usage and limits, along with a summary by node pool, instance type, and capacity type.
  • Node pools - Displays a table of the node pools and their characteristics.
  • Nodes - Displays a table of the nodes and their characteristics.

Karpenter-overview-1

Karpenter-overview-2

Karpenter Activity

The Grafana dashboard offers an overview of node pool status (disruptions and scaling) and pod activity (phases and startup times) in your Kubernetes cluster. It includes the following panels:

  • Node pool activity - Provides the activity of the node pools - the amount of disruptions and scaling events and the reasoning behind them.
  • Pod activity - Displays pod activity, including time series for pod phases and startup durations.

Karpenter-activity-1

Karpenter Performance

The Grafana dashboard provides an overview of the Karpenter's performance in your Kubernetes cluster. It includes the following panels:

  • Summary - Summarizes Karpenter performance, displaying cluster sync status, total node count, cloud provider errors, node termination duration, and pod startup duration.
  • Interruption queue - Provides insights to the interruption queue. It shows the received messages, the deleted messages and the interruption duration.
  • Work queue - Visualizes work queue depth along with queuing and processing durations.
  • Controller - Summarizes the controller's reconciliation requests per second, categorized by request type.

Karpenter-performance-1

Karpenter-performance-2

Alerts

Alerts are trickier to get right for a generic use case, however they are still provided by the Kubernetes-autoscaling-mixin. They are also configurable with the config.libsonnet package in the repository, if you are familiar with Jsonnet then customizing the alerts should be fairly straight forward. The alerts can be found on GitHub, and I'll add a description for the alerts below.

  • Alert name: KarpenterCloudProviderErrors

Alerts when Karpenter has had cloud provider errors in the last 5 minutes.

  • Alert name: KarpenterNodeClaimsInstanceTerminationDurationHigh

Alerts when the instance termination duration for a node claim is high in the last 5 minutes. This indicates that the node claim is taking too long to terminate instances.

  • Alert name: KarpenterNodepoolNearCapacity

Alerts when a Karpenter node pool is near capacity in the last 15 minutes. The current threshold is 75% of the limit. This indicates that the node pool limits need to be scaled.

Summary

Karpenter is a great tool for autoscaling your Kubernetes cluster, but it's important to monitor it to ensure that it's working as expected. The kubernetes-autoscaling-mixin provides a set of Prometheus rules and Grafana dashboards that can help you monitor Karpenter and ensure that it's working as expected. The dashboards provide an overview of Karpenter in your Kubernetes cluster, including node pool and pod activity, and performance. The alerts can help you identify issues with Karpenter, such as cloud provider errors and node pools that are near capacity. If you're using Karpenter in your Kubernetes cluster, I highly recommend checking out the kubernetes-autoscaling-mixin to help you monitor it.


Similar Posts

Django Monitoring with Prometheus and Grafana

6 min read

The Prometheus package for Django provides a great Prometheus integration, but the open source dashboards and alerts that exist are not that great. The to-go Grafana dashboard does not use a large portion of metrics provided by the Django-Prometheus package, …


Celery Monitoring with Prometheus and Grafana

5 min read

Celery is a python project used for asynchronous job processing and task scheduling in web applications or distributed systems. It is very commonly used together with Django, Celery as the asynchronous job processor and Django as the web framework. Celery …


Showcase: Using Jsonnet & Mixins to Simplify Endpoint Monitoring with Blackbox-exporter

4 min read

Blackbox-exporter is a Prometheus exporter that probes endpoints and exposes metrics of the probe result. There are multiple guides on how to use the Blackbox-exporter, and we won't go into that, but rather focus on newer things as Jsonnet as …