Karpenter Monitoring with Prometheus and Grafana

Table of Contents

Grafana Dashboards
Alerts
Conclusion

With the release of Karpenter v1 there are more and better Prometheus metrics, but the Grafana dashboards aren’t that great and there are no open source alerts. Therefore, I decided to create a monitoring-mixin that provides a set of Prometheus rules and Grafana dashboards for Karpenter. This blog post is going to introduce the kubernetes-autoscaling-mixin - a set of Prometheus rules and Grafana dashboards for Kubernetes autoscaling, but this blog post is going to only cover Karpenter monitoring.

There are already three dashboards that published in Grafana:

Karpenter Overview - Karpenter overview, a simple overview of nodepools, nodes and pods.
Karpenter Activity - Karpenter activity, provides insights to Karpenter’s activity, showing when scale up and scale downs happen and the reasoning behind those.
Karpenter Performance - Karpenter performance, provides insights to Karpenter’s performance, showing things as for example cloud provider errors, node termination duration and pod startup duration.

There are also Prometheus alerts stored in GitHub that you can import that cover the common Karpenter issues.

Note: I’ve written a blog post about Comprehensive Kubernetes Autoscaling Monitoring with Prometheus and Grafana, this blog post takes excerpts from that blog post and focuses on Karpenter monitoring. This blog post also uses the same monitoring mixin as the Kubernetes autoscaling monitoring blog post. However, you can turn off the other dashboards and alerts if you only want to monitor Karpenter. I’d also recommend checking out the autoscaling blog post; it covers HPAs, VPAs, and PDBs.

If you want to go directly to the dashboards, use the links in the preceding section. The rest of the blog post describes the various alerts and dashboards.

Grafana Dashboards

Upcoming sections describe each dashboard.

Karpenter Overview

The Grafana dashboard provides an overview of Karpenter in your Kubernetes cluster. It includes the following panels:

Filters - Allows you to filter by namespace and Karpenter controller. The filters also allow you to break down the node pools by region, zone, architecture, OS, instance type and capacity type.
Node pool summary - Provides an overview of the node pools. The node pool count and the usage and limits of the node pools. It also provides a summary of the node pools by region, zone, architecture, OS, instance type and capacity type.
Pod summary - Provides an overview of pod usage and limits, along with a summary by node pool, instance type, and capacity type.
Node pools - Displays a table of the node pools and their characteristics.
Nodes - Displays a table of the nodes and their characteristics.

Karpenter-overview-1

Karpenter-overview-2

Karpenter Activity

The Grafana dashboard offers an overview of node pool status (disruptions and scaling) and pod activity (phases and startup times) in your Kubernetes cluster. It includes the following panels:

Node pool activity - Provides the activity of the node pools - the amount of disruptions and scaling events and the reasoning behind them.
Pod activity - Displays pod activity, including time series for pod phases and startup durations.

Karpenter-activity-1

Karpenter Performance

The Grafana dashboard provides an overview of the Karpenter’s performance in your Kubernetes cluster. It includes the following panels:

Summary - Summarizes Karpenter performance, displaying cluster sync status, total node count, cloud provider errors, node termination duration, and pod startup duration.
Interruption queue - Provides insights to the interruption queue. It shows the received messages, the deleted messages and the interruption duration.
Work queue - Visualizes work queue depth along with queuing and processing durations.
Controller - Summarizes the controller’s reconciliation requests per second, categorized by request type.

Karpenter-performance-1

Karpenter-performance-2

Alerts

Alerts are trickier to get right for a generic use case, however they’re still provided by the Kubernetes-autoscaling-mixin. They’re also configurable with the config.libsonnet package in the repository, if you are familiar with Jsonnet then customizing the alerts should be fairly straight forward. Alerts are available on GitHub, and I’ll add a description for the alerts below.

Alert name: KarpenterCloudProviderErrors

Alerts when Karpenter has had cloud provider errors in the last 15 minutes.

Alert name: KarpenterNodeClaimsTerminationDurationHigh

Alerts when the termination duration for a node claim is high in the last 15 minutes. This indicates that the node claim is taking too long to stop instances.

Alert name: KarpenterNodepoolNearCapacity

Alerts when a Karpenter node pool is near capacity in the last 15 minutes. The current threshold is 75% of the limit. This indicates the need to scale the node pool limits.

Conclusion

Karpenter is a great tool for autoscaling your Kubernetes cluster, but it’s important to monitor it to ensure that it’s working as expected. The kubernetes-autoscaling-mixin provides a set of Prometheus rules and Grafana dashboards that can help you monitor Karpenter and ensure that it’s working as expected. The dashboards provide an overview of Karpenter in your Kubernetes cluster, including node pool and pod activity, and performance. The alerts can help you identify issues with Karpenter, such as cloud provider errors and node pools that are near capacity. If you’re using Karpenter in your Kubernetes cluster, I highly recommend checking out the kubernetes-autoscaling-mixin to help you monitor it.