Argo Workflows monitoring with Prometheus and Grafana

Published on April 04, 2026, 20:00 UTC 3 minutes New!

Argo Workflows exposes enough metrics to see whether workflows are backing up, CronWorkflows are firing, and the controller is keeping up, but raw /metrics output does not make any of that easy to read. This post covers argo-workflows-mixin, a Prometheus and Grafana mixin that adds two dashboards and a focused alert set for Argo Workflows.

The mixin is available on GitHub. It currently ships with two Grafana dashboards and multiple alert rules:

  • Argo Workflows / Overview - A cluster-level view of workflow phases, recent completions, success rate, workflow pods, and CronWorkflow activity.
  • Argo Workflows / Controller - A controller-focused dashboard for reconciliation errors, Kubernetes API traffic, work queue pressure, and worker saturation.

The repo includes generated dashboard JSON in dashboards_out/ and a ready-to-load prometheus_alerts.yaml, so you can either vendor the mixin into your Jsonnet setup or import the generated files directly.

Prerequisites

The mixin assumes that Argo Workflows metrics are already exposed and scraped by Prometheus.

In Kubernetes that usually means exposing the controller metrics endpoint and scraping it with a ServiceMonitor or PodMonitor. If Prometheus cannot scrape the controller, most of the mixin will stay empty.

Setup

Clone the repo and install the Jsonnet dependencies:

git clone https://github.com/adinhodovic/argo-workflows-mixin
cd argo-workflows-mixin

jb install

Then generate the alert file and Grafana dashboards:

make prometheus_alerts.yaml
make dashboards_out

Load prometheus_alerts.yaml into Prometheus and import dashboards_out/argo-workflows-overview.json and dashboards_out/argo-workflows-controller.json into Grafana.

If you vendor the mixin into an existing Jsonnet setup, override the selector in config.libsonnet so the mixin only targets your Argo Workflows metrics job. You can also adjust alert thresholds there.

{
  _config+:: {
    alerts+: {
      workflowFailureRate+: {
        threshold: '20',
      },

      workflowsPending+: {
        threshold: '10',
        interval: '10m',
      },

      queueDepthHigh+: {
        severity: 'warning',
        threshold: '200',
      },
    },
  },
}

This overrides only the alert thresholds that differ from the defaults.

Grafana dashboards

Argo Workflows / Overview

The overview dashboard starts at the workload level and then drills into workflow and pod behavior:

  • Filters - Filter by cluster, namespace, job, and workflow namespace.
  • Summary - Shows active workflows, unhealthy workflows, recent completions, success rate, running pods, and pending workflows.
  • Workflows - Shows current workflows by phase, completions by phase, success rate over time, workflow operation duration, and a namespace table for recent totals, failures, and success rate.
  • Pods - Shows workflow-created pods by phase, pending reasons, restarts, and missing pods.
  • CronWorkflows - Shows trigger counts and concurrency policy actions so you can see whether scheduled workflows are firing and whether Forbid or Replace policies are getting in the way.

This is the dashboard to keep open when you want a quick answer to "are workflows moving normally?". It is also the place to start before switching to the controller dashboard.

Argo Workflows / Controller

The controller dashboard focuses on reconciliation and controller internals:

  • Summary - Shows controller error rate, Kubernetes API success rate, queue depth, and busy workers.
  • Errors - Breaks down controller errors by cause and log messages by level.
  • Kubernetes API - Shows request rate by kind and verb, request rate by status code, success rate, and request duration percentiles.
  • Work queues - Shows queue depth, queue adds, queue latency, queue duration, retries, longest running work items, unfinished work, and busy workers by worker type.

When workflows are stuck in Pending, success rate drops, or throughput changes sharply, this dashboard makes it easier to tell whether the problem is in Argo Workflows itself, the Kubernetes API, or controller queue pressure.

Alerts

The mixin keeps the alert set small. All four alerts default to warning severity in config.libsonnet, and you can tighten the thresholds once you know what normal looks like in your cluster.

  • ArgoWorkflowsHighWorkflowFailureRate - Fires when more than 10% of workflows in a namespace end in Failed or Error over the last 5m. This is the highest signal alert in the set because it points to broken workflow logic, bad inputs, or widespread downstream failures.
  • ArgoWorkflowsPendingWorkflows - Fires when more than 5 workflows stay in Pending for 15m. This usually points to scheduling issues, missing resources, unschedulable pods, or controller backlog.
  • ArgoWorkflowsControllerHighErrorRate - Fires when controller errors exceed 5 per second over 5m, grouped by cause. This catches repeated reconciliation failures such as OperationPanic, CronWorkflowSubmissionError, or other controller-side problems.
  • ArgoWorkflowsQueueDepthHigh - Fires when a controller queue stays above 100 items for 5m. A growing queue usually means the controller cannot keep up with the incoming work.

Issues and feedback are welcome in the GitHub repository.

Related Posts

Configuring VPA to Use Historical Metrics for Recommendations and Expose Them in Kube-state-metrics

The Vertical Pod Autoscaler (VPA) can manage both your pods' resource requests but also recommend what the limits and requests for a pod should be. Recently, the kube-state-metrics project removed built-in support for VPA recommendation metrics, which made the VPA require additional configuration to be valuable. This blog post will cover how to configure the VPA to expose the recommendation metrics and how to visualize them in Grafana.

Configuring Kube-prometheus-stack Dashboards and Alerts for K3s Compatibility

The kube-prometheus-stack Helm chart, which deploys the kubernetes-mixin, is designed for standard Kubernetes setups, often pre-configured for specific cloud environments. However, these configurations are not directly compatible with k3s, a lightweight Kubernetes distribution. Since k3s lacks many of the default cloud integrations, issues arise, such as missing metrics, broken graphs, and unavailable endpoints (example issue). This blog post will guide you through adapting the kube-prometheus-stack Helm chart and the kubernetes-mixin to work seamlessly in k3s environments, ensuring functional dashboards and alerts tailored to k3s.

Kubernetes Events Monitoring with Loki, Alloy, and Grafana

Kubernetes events offer valuable insights into the activities within your cluster, providing a comprehensive view of each resource's status. While they're beneficial for debugging individual resources, they often face challenges due to the absence of aggregation. This can lead to issues such as events being garbage collected, the necessity to view them promptly, difficulties in filtering and searching, and limited accessibility for other systems. The blog post explores configuring Loki with Alloy to efficiently scrape Kubernetes events and visualize them in Grafana.