Argo Workflows monitoring with Prometheus and Grafana

Table of Contents

Prerequisites
Setup
Grafana dashboards
Alerts

Argo Workflows exposes enough metrics to see whether workflows are backing up, CronWorkflows are firing, and the controller is keeping up, but raw /metrics output does not make any of that easy to read. This post covers argo-workflows-mixin, a Prometheus and Grafana mixin that adds two dashboards and a focused alert set for Argo Workflows.

The mixin is available on GitHub. It currently ships with two Grafana dashboards and multiple alert rules:

Argo Workflows / Overview - A cluster-level view of workflow phases, recent completions, success rate, workflow pods, and CronWorkflow activity.
Argo Workflows / Controller - A controller-focused dashboard for reconciliation errors, Kubernetes API traffic, work queue pressure, and worker saturation.

The repo includes generated dashboard JSON in dashboards_out/ and a ready-to-load prometheus_alerts.yaml, so you can either vendor the mixin into your Jsonnet setup or import the generated files directly.

Prerequisites

The mixin assumes that Argo Workflows metrics are already exposed and scraped by Prometheus.

In Kubernetes that usually means exposing the controller metrics endpoint and scraping it with a ServiceMonitor or PodMonitor. If Prometheus cannot scrape the controller, most of the mixin will stay empty.

Setup

Clone the repo and install the Jsonnet dependencies:

git clone https://github.com/adinhodovic/argo-workflows-mixin
cd argo-workflows-mixin

jb install

Then generate the alert file and Grafana dashboards:

make prometheus_alerts.yaml
make dashboards_out

Load prometheus_alerts.yaml into Prometheus and import dashboards_out/argo-workflows-overview.json and dashboards_out/argo-workflows-controller.json into Grafana.

If you vendor the mixin into an existing Jsonnet setup, override the selector in config.libsonnet so the mixin only targets your Argo Workflows metrics job. You can also adjust alert thresholds there.

{
  _config+:: {
    alerts+: {
      workflowFailureRate+: {
        threshold: '20',
      },

      workflowsPending+: {
        threshold: '10',
        interval: '10m',
      },

      queueDepthHigh+: {
        severity: 'warning',
        threshold: '200',
      },
    },
  },
}

This overrides only the alert thresholds that differ from the defaults.

Grafana dashboards

Argo Workflows / Overview

The overview dashboard starts at the workload level and then drills into workflow and pod behavior:

Filters - Filter by cluster, namespace, job, and workflow namespace.
Summary - Shows active workflows, unhealthy workflows, recent completions, success rate, running pods, and pending workflows.
Workflows - Shows current workflows by phase, completions by phase, success rate over time, workflow operation duration, and a namespace table for recent totals, failures, and success rate.
Pods - Shows workflow-created pods by phase, pending reasons, restarts, and missing pods.
CronWorkflows - Shows trigger counts and concurrency policy actions so you can see whether scheduled workflows are firing and whether Forbid or Replace policies are getting in the way.

This is the dashboard to keep open when you want a quick answer to "are workflows moving normally?". It is also the place to start before switching to the controller dashboard.

Argo Workflows / Controller

The controller dashboard focuses on reconciliation and controller internals:

Summary - Shows controller error rate, Kubernetes API success rate, queue depth, and busy workers.
Errors - Breaks down controller errors by cause and log messages by level.
Kubernetes API - Shows request rate by kind and verb, request rate by status code, success rate, and request duration percentiles.
Work queues - Shows queue depth, queue adds, queue latency, queue duration, retries, longest running work items, unfinished work, and busy workers by worker type.

When workflows are stuck in Pending, success rate drops, or throughput changes sharply, this dashboard makes it easier to tell whether the problem is in Argo Workflows itself, the Kubernetes API, or controller queue pressure.

Alerts

The mixin keeps the alert set small. All four alerts default to warning severity in config.libsonnet, and you can tighten the thresholds once you know what normal looks like in your cluster.

ArgoWorkflowsHighWorkflowFailureRate - Fires when more than 10% of workflows in a namespace end in Failed or Error over the last 5m. This is the highest signal alert in the set because it points to broken workflow logic, bad inputs, or widespread downstream failures.
ArgoWorkflowsPendingWorkflows - Fires when more than 5 workflows stay in Pending for 15m. This usually points to scheduling issues, missing resources, unschedulable pods, or controller backlog.
ArgoWorkflowsControllerHighErrorRate - Fires when controller errors exceed 5 per second over 5m, grouped by cause. This catches repeated reconciliation failures such as OperationPanic, CronWorkflowSubmissionError, or other controller-side problems.
ArgoWorkflowsQueueDepthHigh - Fires when a controller queue stays above 100 items for 5m. A growing queue usually means the controller cannot keep up with the incoming work.

Issues and feedback are welcome in the GitHub repository.