Kubernetes Events Monitoring with Loki, Alloy, and Grafana

Kubernetes events offer valuable insights into the activities within your cluster, providing a comprehensive view of each resource's status. While they're beneficial for debugging individual resources, they often face challenges due to the absence of aggregation. This can lead to issues such as events being garbage collected, the necessity to view them promptly, difficulties in filtering and searching, and limited accessibility for other systems. The blog post explores configuring Loki with Alloy to efficiently scrape Kubernetes events and visualize them in Grafana.

This blog post presents an opinionated approach using Loki, Prometheus, and Alloy as the tools of choice. Loki serves as a cost-effective and user-friendly log aggregation system, while Alloy functions as a tool for telemetry collections and Prometheus stores time series data. The post focuses on the additional configuration required, assuming you have already installed Loki, Alloy, and Prometheus.

This blog post also introduces the kubernetes-events-mixin where you can find a set of Grafana dashboards and Prometheus rules for monitoring Kubernetes events. It won't work out of the box, and it requires configuration of Alloy and Loki as described in the rest of the blog post.

Table of Contents

Configuring Loki to Write Metrics to Prometheus
Grafana Dashboards
Summary

Configuring Alloy to Scrape Kubernetes events

First, use Alloy's Kubernetes events source to scrape the Kubernetes events. Deploy Alloy using Helm with the following values which scrape the Kubernetes events and forward them to Loki:

alloy:
  configMap:
    content: |
      loki.process "default" {
        stage.replace {
          expression = "(\"type\":\"Normal\")"
          replace = "\"type\":\"Normal\",\"level\":\"info\""
        }
        forward_to = [loki.write.default.receiver]
        stage.replace {
          expression = "(\"type\":\"Warning\")"
          replace = "\"type\":\"Warning\",\"level\":\"warning\""
        }
        stage.json {
          expressions = {
            "k8s_resource_kind" = "kind",
            "k8s_resource_name" = "name",
          }
        }
        stage.labels {
          values = {
            "k8s_namespace_name" = "namespace",
            "k8s_resource_kind" = "k8s_resource_kind"
          }
        }
        stage.structured_metadata {
          values = {
            "k8s_resource_name" = "k8s_resource_name"
          }
        }
        stage.label_keep {
          values = ["cluster", "organization", "region", "job", "k8s_namespace_name", "k8s_resource_kind"]
        }
      }
      loki.source.kubernetes_events "default" {
        forward_to = [loki.process.default.receiver]
        log_format = "json"
      }
      loki.write "default" {
        endpoint {
          url = "http://loki-gateway.logging.svc/loki/api/v1/push"
        }
        external_labels = {
          "cluster" = "my-cluster",
          "environment" = "production",
          "region" = "europe-west1",
        }
      }
    enabled: true
controller:
  type: statefulset

The configuration performs the following actions:

loki.source.kubernetes_events - Scrapes the Kubernetes events and forwards them to the Loki processor.
loki.process - It handles the Kubernetes events by replacing the type field with level, and adds labels and structured metadata. The structured metadata is crucial for filtering and searching the events. Grafana utilizes the level field to assess the severity of the event. The label k8s_resource_kind differentiates between various Kubernetes kinds alongside k8s_resource_namespace, which indicates the namespace the resource kind is in. They're indexed, but typically, Kubernetes resource kinds shouldn't lead to label cardinality issues since they're usually limited in number. However, if you have many different API kinds, you might want to consider an alternative approach.
loki.write - Forwards the processed events to Loki. The external_labels field adds additional labels to the events, such as the cluster, environment, and region.
controller - Specifies the type of controller to deploy Alloy as. In this case, it deploys a statefulset controller. You only need a single instance of Alloy to scrape the Kubernetes events.

The events should be flowing into Loki after deploying Alloy with the preceding configuration. You can verify this by querying the Loki API for the Kubernetes events:

sum (count_over_time({job="loki.source.kubernetes_events"} | json [1m])) by (k8s_namespace_name, k8s_resource_kind, type)

Configuring Loki to Write Metrics to Prometheus

Loki's strong suite isn't aggregation over long periods of time or complex queries, which is where Prometheus comes in. Prometheus is a time series database that excels at storing and querying time series data. Therefore, instead of running complex log queries over long time periods, the remote_write feature with recording rules applies complex queries only during short time intervals at a repeated interval. The goal is to count the number of events by k8s_namespace_name, k8s_resource_kind, and type every minute and store that in Prometheus. This way the data can be easily queried in Grafana without putting too much pressure on Loki.

To write metrics from Loki to Prometheus, you need to configure Loki. Deploy Loki using Helm with the following values:

loki:
  rulerConfig:
    remote_write:
      client:
        url: http://prometheus-k8s.monitoring.svc:9090/api/v1/write
      enabled: true
    rule_path: /rules
    storage:
      local:
        directory: /rules
      type: local
    wal:
      dir: /var/loki/ruler/wal

Replace prometheus-k8s.monitoring.svc with the Prometheus service endpoint. The configuration writes the metrics to Prometheus using remote writes.

Loki also requires configuration to load rules from ConfigMaps. The following configuration enables a sidecar container to load the rules from a ConfigMap:

sidecar:
  rules:
    folder: /rules/fake
    label: loki.grafana.com/rule
    labelValue: "true"
    searchNamespace: ALL

The sidecar loads rules from the ConfigMap with the label loki.grafana.com/rule=true. It stores the rules in the folder /rules/fake. Single tenant deployments use the fake tenant folder.

Adding Prometheus Rules to Loki

To write metrics to Prometheus, you need to add Prometheus rules to Loki. Create a ConfigMap with the following rules:

apiVersion: v1
data:
  kubernetes-events.yaml: |-
    "groups":
    - "interval": "1m"
      "name": "kubernetes-events.rules"
      "rules":
      - "expr": |
          sum (count_over_time({job="loki.source.kubernetes_events"} | json [1m])) by (k8s_namespace_name, k8s_resource_kind, type)
        "record": "namespace_kind_type:kubernetes_events:count1m"
kind: ConfigMap
metadata:
  labels:
    loki.grafana.com/rule: "true"
  name: kubernetes-events
  namespace: logging

If you configure everything correctly, querying the following metric in your Prometheus instance works:

namespace_kind_type:kubernetes_events:count1m

Grafana Dashboards

Now that you have the Kubernetes events in Loki and Prometheus, you can visualize them in Grafana.

As mentioned previously, the kubernetes-events-mixin has two dashboards. A Kubernetes events overview, and a Kubernetes events timeline.

The upcoming sections describe each dashboard.

Kubernetes Events Overview Dashboard

The Kubernetes overview dashboard focuses on providing an overview of Kubernetes events, it uses primarily the Prometheus metrics to visualize the events. The following things are core for the dashboard:

Summary - Provides a section that summarizes events over time by kind, namespace, and type. It also shows an overview of the top source of warning/normal events over the last week.
Kind Summary - Provides a section that shows events by kind and namespace using the filters applied. It also shows a pie chart with a breakdown by type.

Kubernetes-events-overview

Kubernetes Events Timeline

The Kubernetes events timeline dashboard focuses on providing a timeline of Kubernetes events, it uses Loki logs to visualize the events. The dashboard offers more detailed insights into individual events but requires more aggressive filtering, limiting visualization only by kind and namespace. Also, the dashboard doesn't prove useful without applying a search for the name of the resource that originated the event. The following things are core for the dashboard:

Events Logs - Displays events in a log panel with 100 entries, it shows the name, type, and message. Name searches are highly recommended, otherwise the logs are too noisy coming from too many sources of events.
Events Timeline - Displays a timeline of events by kind and namespace, it shows the type, reason and message of the event. Again, name searches are highly recommended.

Kubernetes-events-timeline

Summary

This blog post explored configuring Loki and Alloy to efficiently scrape Kubernetes events and visualize them in Grafana. The post presented an opinionated approach using Loki, Prometheus, and Alloy as the tools of choice. It also introduced the kubernetes-events-mixin where you can find a set of Grafana dashboards and Prometheus rules for monitoring Kubernetes events. This approach is an awesome improvement of previous event monitoring setups I've had. The Grafana UI, specifically the timeline panel, allows displaying events over time in a great way.