opencost-overview

Proactive Kubernetes Budget Alerts Using Prometheus

Published on June 13, 2025, 15:00 UTC 3 minutes New!

Not long ago, I shared how to monitor Kubernetes costs using OpenCost, Prometheus, and Grafana. In this follow-up, I want to highlight a specific feature: setting up proactive budget alerts using OpenCost and Prometheus. It’s a simple but powerful way to catch cost overruns early.

For the full guide, including setup and dashboards, check out the original post: Kubernetes Cost Tracking Simplified with OpenCost, Prometheus, and Grafana. This post builds on that setup to show how to configure proactive budget alerts using the same tools.

Assuming OpenCost metrics are already flowing into Prometheus, the following sections cover the Prometheus alert rules used to track and respond to cost thresholds.

Alerts

The cost alerts are customizable using the config.libsonnet file provided in the repository for the Prometheus alerts. If you’re familiar with Jsonnet, adjusting, and tailoring these alerts to your specific needs should be straightforward. That said, it’s important to note: these alerts aren’t plug-and-play - you must adapt the configuration to reflect your own Kubernetes workloads, budget expectations, and cost tolerances.

You can find the default alert definitions in the prometheus_alerts.yaml file. Below are descriptions and example use cases for the two primary alerts included:

OpenCostMonthlyBudgetExceeded

This alert fires when your predicted monthly cost exceeds a predefined threshold. The system predicts monthly cost by taking the current hourly cost, adding the hourly PersistentVolumeCost, and multiplying the total by 730 hours - a rough estimate for a 30-day month. Example:

expr: |
  (
    sum(
      node_total_hourly_cost{
        job=~"opencost"
      }
    ) by (cluster) * 730
    or vector(0)
    +
    sum(
      sum(
        kube_persistentvolume_capacity_bytes{
          job=~"opencost"
        }
        / 1024 / 1024 / 1024
      - by (cluster, persistentvolume)
      *
      sum(
        pv_hourly_cost{
          job=~"opencost"
        }
      - by (cluster, persistentvolume)
    ) * 730
    or vector(0)
  )
  > 200

Default threshold: $200. Adjust this based on your budget expectations in the config.libsonnet file using the field monthlyCostThreshold.

What it means: if your current usage rate continues, you’re projected to spend more than $200 this month.

Use case: set this alert if you many clusters and want to catch cost creep early.

OpenCostAnomalyDetected

This alert watches for sudden cost spikes. It compares the average hourly cost over the past 3 hours with the average over the last 7 days, and fires if the short-term cost exceeds the long-term average by more than 15%.

Example:

expr: |
  (
    (
      (
        avg_over_time(
          sum(
            node_total_hourly_cost{
              job=~"opencost"
            }
          ) by (cluster) [3h:30m]
        )
        or vector(0)
      )
      +
      (
        avg_over_time(
          sum(
            (
              kube_persistentvolume_capacity_bytes{
                job=~"opencost"
              } / 1024 / 1024 / 1024
            )
            - on (cluster, persistentvolume)
            group_left()
            pv_hourly_cost{
              job=~"opencost"
            }
          ) by (cluster) [3h:30m]
        )
        or vector(0)
      )
    )
    -
    (
      (
        avg_over_time(
          sum(
            node_total_hourly_cost{
              job=~"opencost"
            }
          ) by (cluster) [7d:30m]
        )
        or vector(0)
      )
      +
      (
        avg_over_time(
          sum(
            (
              kube_persistentvolume_capacity_bytes{
                job=~"opencost"
              } / 1024 / 1024 / 1024
            )
            - on (cluster, persistentvolume)
            group_left()
            pv_hourly_cost{
              job=~"opencost"
            }
          ) by (cluster) [7d:30m]
        )
        or vector(0)
      )
    )
  )
  /
  (
    (
      (
        avg_over_time(
          sum(
            node_total_hourly_cost{
              job=~"opencost"
            }
          )  by (cluster) [7d:30m]
        )
        or vector(0)
      )
      +
      (
        avg_over_time(
          sum(
            (
              kube_persistentvolume_capacity_bytes{
                job=~"opencost"
              } / 1024 / 1024 / 1024
            )
            - on (cluster, persistentvolume)
            group_left()
            pv_hourly_cost{
              job=~"opencost"
            }
          )   by (cluster) [7d:30m]
        )
        or vector(0)
      )
    )
  )
  > (15 / 100)

Default threshold: 15%. Adjust this in the config.libsonnet file using the field anomalyThreshold.

What it means: it alerts you when your hourly cost suddenly jumps above normal, possibly due to a misconfigured deployment or an unexpected workload.

Use case: use this to catch runaway workloads or unexpected autoscaler behavior quickly.

Conclusion

Setting up cost alerts with OpenCost and Prometheus is a simple way to stay ahead of unexpected spending. With just a few rules, you can catch cost spikes or budget overruns early and take action before they become a problem.

Related Posts

Kubernetes Cost Tracking Simplified with OpenCost, Prometheus, and Grafana

OpenCost is an open-source tool designed to help you monitor and understand the cost of your cloud infrastructure. As a project under the Cloud Native Computing Foundation (CNCF), OpenCost offers a transparent and powerful solution for cloud cost management. It provides both a user-friendly interface for visualizing cloud costs and Prometheus metrics, enabling you to query and visualize these costs using Grafana. The popular tool KubeCost is built on top of OpenCost, offering an enhanced feature set and user experience. However, KubeCost is not open-source, and its free plan has limitations on data retention and storage. Given these constraints and a preference for consolidating data visualization within Grafana, I opted to use OpenCost. This blog post will introduce the opencost-mixin - a set of Prometheus rules and Grafana dashboards for OpenCost.

NestJS Apollo GraphQL Prometheus Metrics and Grafana Dashboards

Apollo GraphQL and NestJS are gaining traction quickly, however the monitoring approaches are unclear. At the moment (late 2021 / early 2022) there are no default exporters or libraries for Prometheus metrics and the same goes for Grafana dashboards, this blog post will provide both. Just to ensure that you are aware - Apollo Studio provides metrics and many other features for your graphs. The only downside is you’ll most likely end up with a paid plan and you will be locked-in to their offering. Also, there is no way of exporting metrics to your Prometheus instance and centralizing alerting & dashboards.

This blog post will be based on a NestJS implementation for the dependency injection of Prometheus metrics, however it should work similarly in other setups.

RabbitMQ Per Queue Monitoring

RabbitMQ has a native built-in Prometheus plugin and by default it has granular metrics disabled. Granular metrics means per-queue/vhost metrics - detailed metrics that provide message lag and consumer info on a queue and vhost basis. You could enable granular per-object metrics but this is not recommended as the plugin becomes much slower on a large cluster and the label cardinality for your time series database could become high.

To solve this you could use the unofficial OSS RabbitMQ exporter written by kbudde that will allow you to have granular metrics enabled and also disable specific metrics that the native Prometheus plugin provides. The unofficial exporter refers to a mixed approach where you use the unofficial exporter for detailed metrics and disable all other metrics and use the native RabbitMQ Prometheus plugin for all other metrics.

Shynet