go-runtime-3

Monitoring Go runtime with Prometheus and Grafana

Published on March 25, 2026, 17:40 UTC 4 minutes New!

Go applications expose a useful set of runtime metrics, but raw /metrics output does not make it easy to spot GC pressure, scheduler latency, memory growth, or file descriptor exhaustion. This post covers a go-mixin for Prometheus and Grafana that adds a dashboard and alerts for the Go runtime.

The mixin is available on GitHub. It currently ships with one Grafana dashboard and three alert rules:

  • Go / Overview - A dashboard for runtime CPU usage, scheduler latency, garbage collection, heap churn, mutex contention, cgo activity, and file descriptor pressure.
  • GoHighGcCpu - Alerts when a Go process spends too much CPU time in garbage collection.
  • GoHighSchedulerLatency - Alerts when runnable goroutines wait too long to be scheduled.
  • GoHighFdUsage - Alerts when a Go process is close to its file descriptor limit.

The repo also includes generated dashboard JSON and Prometheus rule files, so you can either vendor the mixin into your Jsonnet setup or import the generated files directly.

Prerequisites

The mixin assumes that your Go processes already expose Prometheus metrics and that Prometheus is scraping them. The dashboard and alerts depend on standard Go and process metrics such as go_info, go_sched_latencies_seconds_bucket, go_gc_pauses_seconds_bucket, process_open_fds, and process_resident_memory_bytes.

If your application already uses prometheus/client_golang and exposes /metrics with promhttp, you likely have most of what you need. In Kubernetes that usually means adding or updating a ServiceMonitor or PodMonitor so Prometheus can scrape the target.

Setup

Clone the repo and install the Jsonnet dependencies:

git clone https://github.com/adinhodovic/go-mixin
cd go-mixin

jb install

Then generate the Prometheus rules and Grafana dashboard:

make prometheus_alerts.yaml
make dashboards_out

Load prometheus_alerts.yaml into Prometheus and import dashboards_out/go-overview.json into Grafana.

If you vendor the mixin into an existing Jsonnet setup, you can override the selectors in config.libsonnet. By default the mixin includes alert targets for node-exporter, kubelet, and kube-apiserver, while the generic all matcher is disabled.

{
  _config+:: {
    alerts+: {
      defaults+: {
        highGcCpu+: {
          threshold: 0.25,
        },

        highSchedulerLatency+: {
          threshold: 0.01,
        },

        highFdUsage+: {
          threshold: 0.80,
        },
      },

      jobs+: {
        all: {
          enabled: false,
        },

        myService: {
          enabled: true,
          matcher: 'job="my-service"',

          highGcCpu+: {
            severity: 'critical',
            threshold: 0.35,
          },

          highSchedulerLatency+: {
            threshold: 0.02,
          },
        },

        kubelet: {
          highGcCpu+: {
            threshold: 0.40,
          },

          highFdUsage+: {
            severity: 'critical',
            threshold: 0.90,
          },
        },
      },
    },
  },
}

This keeps the generic matcher disabled, adds a dedicated job="my-service" target, and shows how to override thresholds and severity per job.

alerts.defaults defines the shared thresholds and severity, and each entry under alerts.jobs can override only the fields you need.

Grafana dashboard

Go / Overview

The dashboard is split into a few sections so it is easier to move from a high-level runtime view to a single pod or instance:

  • Filters - Filter by cluster, namespace, job, pod, container, and instance.
  • Overview - Shows instance count, memory used, CPU cores, current goroutines, and a runtime configuration table with Go version, GOMAXPROCS, and GOMEMLIMIT.
  • CPU and scheduling - Shows CPU usage by target, runnable goroutines, thread count, scheduler latency, and GC CPU ratio.
  • Memory - Shows resident memory, runtime footprint, heap breakdown, scan pressure, next GC target, and allocation size percentiles.
  • Garbage collection - Shows GC CPU phases, cycle counts, GC duration quantiles, pause latency, and heap allocation and free churn.
  • Contention and OS - Shows open file descriptors, file descriptor usage ratio, mutex wait time, and cgo call rate.

Most runtime dashboards stop at goroutines, heap size, and GC pauses. This one goes a bit deeper into scheduler backlog, scan pressure, allocation size, and mutex wait time, which makes it easier to tell whether a problem is caused by allocation churn, lock contention, or plain CPU pressure.

go-runtime-overview

go-runtime-memory-and-gc

go-runtime-contention-and-os

Alerts

The mixin keeps the alert set small and focused. All three alerts link back to the dashboard with the matching namespace, job, and pod filters.

  • GoHighGcCpu - Fires when the ratio between go_cpu_classes_gc_total_cpu_seconds_total and total Go runtime CPU stays above the configured threshold. This is useful when a process is still up but spends too much time collecting garbage.
  • GoHighSchedulerLatency - Fires when Go scheduler P99 latency stays above the configured threshold. This usually points to runnable goroutines waiting too long for CPU time.
  • GoHighFdUsage - Fires when process_open_fds / process_max_fds stays above the configured threshold. This catches descriptor exhaustion before the process starts failing to open sockets or files.

The defaults are conservative: 15m for all three alerts, 25% GC CPU, 0.01 seconds for scheduler P99 latency, and 80% file descriptor usage. Override them per job if you know a workload has different behavior.

Issues and feedback are welcome in the GitHub repository.

Related Posts

Configuring VPA to Use Historical Metrics for Recommendations and Expose Them in Kube-state-metrics

The Vertical Pod Autoscaler (VPA) can manage both your pods' resource requests but also recommend what the limits and requests for a pod should be. Recently, the kube-state-metrics project removed built-in support for VPA recommendation metrics, which made the VPA require additional configuration to be valuable. This blog post will cover how to configure the VPA to expose the recommendation metrics and how to visualize them in Grafana.

Configuring Kube-prometheus-stack Dashboards and Alerts for K3s Compatibility

The kube-prometheus-stack Helm chart, which deploys the kubernetes-mixin, is designed for standard Kubernetes setups, often pre-configured for specific cloud environments. However, these configurations are not directly compatible with k3s, a lightweight Kubernetes distribution. Since k3s lacks many of the default cloud integrations, issues arise, such as missing metrics, broken graphs, and unavailable endpoints (example issue). This blog post will guide you through adapting the kube-prometheus-stack Helm chart and the kubernetes-mixin to work seamlessly in k3s environments, ensuring functional dashboards and alerts tailored to k3s.

Kubernetes Events Monitoring with Loki, Alloy, and Grafana

Kubernetes events offer valuable insights into the activities within your cluster, providing a comprehensive view of each resource's status. While they're beneficial for debugging individual resources, they often face challenges due to the absence of aggregation. This can lead to issues such as events being garbage collected, the necessity to view them promptly, difficulties in filtering and searching, and limited accessibility for other systems. The blog post explores configuring Loki with Alloy to efficiently scrape Kubernetes events and visualize them in Grafana.