slack-alert-logo.jpg

Creating Awesome Alertmanager Templates for Slack

4 months ago
1542 views

6 min read


Prometheus, Grafana and Alertmanager is the default stack for me when deploying a monitoring system. The Prometheus and Grafana bits are well documented and there exists tons of open source approaches on how to make use of them the best. Alertmanager on the other hand is not highlighted as much and even though the use case can be seen as fairly simple it can be complex, the templating language has lots of features and functionality. Alertmanager configuration, templates and rules make a huge difference, especially when the team has an approach of 'not staring at dashboards all day'. You can create detailed Slack alerts with tons of information as dashboard links, runbook links and alert descriptions which go well together with the rest of your ChatOps stack. This post will go through how to make efficient Slack alerts.

Basics: guidelines for alert names, labels, and annotations

The monitoring-mixin documentation goes through guidelines for alert names, labels and annotations. More or less standards or best practises that many monitoring-mixins follow. Monitoring-mixins are OSS resources of Prometheus alerts and rules and Grafana dashboards for a specific technology. Even though you might not be familiar with monitoring-mixins there's a probability that you've used them. For example the kube-prometheus project(backbone of the kube-prometheus-stack Helm chart) uses both the Kubernetes-monitoring and Node exporter mixin amongst others mixins. The great thing about these guidelines is that you can have a single Alertmanager template to make use of labels and annotations shared accross all alerts. We expect all OSS Prometheus rules and alerts to follow a specific pattern and you apply the same pattern internally to your alerts and rules.

The original documentation describes the guidelines well, and I'll just summarize it. Use the mandatory annotation summary to summarize the alert and description as an optional annotation for any details regarding the alert. Use the label severity to indicate the severity of an alert with the following values: info - not routed anywhere, but provides insights when debugging, warning - not urgent enough to wake someone up or any immediate action, in my case warnings fall into Slack, larger organizations might queue them into a bug tracking system, critical - someone gets paged. Additionally, we have two optional but recommended annotations: dashboard_url - an url to a dashboard related to the alert, runbook_url - an url to a runbook for handling the alert.

From the above guidelines we can conclude that we'll route warnings to Slack, however I also route critical alerts to Slack since they provide easy access to dashboard links, runbook links and extensive information about alert. This is more difficult to provide through a paging incident sent to your phone. It's also easier to interact with alerts using Slack on your computer than text messages on your phone. We'll use summary as the Slack message headline and description as anything detailed regarding the alert. We'll add additional buttons with links to your dashboards and runbooks.

Slack template

Alertmanager has great support for custom templates where you can make use of both labels and annotations. We'll create with inspiration from Monzo's template but adjust it to the above guidelines and to my preference. First we'll define a silence link - this is great as we can silence alerts directly from Slack with all the labels needed to specifically target that individual alert and not silence a group of alerts. E.g silencing a specific container that's crashing but not all other containers if they crash.

{{ define "__alert_silence_link" -}}
    {{ .ExternalURL }}/#/silences/new?filter=%7B
    {{- range .CommonLabels.SortedPairs -}}
        {{- if ne .Name "alertname" -}}
            {{- .Name }}%3D"{{- .Value -}}"%2C%20
        {{- end -}}
    {{- end -}}
    alertname%3D"{{- .CommonLabels.alertname -}}"%7D
{{- end }}

We'll then define the alert severity variable with the levels critical, warning and info. Even though we mentioned that alerts of the severity info and critical might not be routed to Slack we'll add them to the template in case there's a different preference.

{{ define "__alert_severity" -}}
    {{- if eq .CommonLabels.severity "critical" -}}
    *Severity:* `Critical`
    {{- else if eq .CommonLabels.severity "warning" -}}
    *Severity:* `Warning`
    {{- else if eq .CommonLabels.severity "info" -}}
    *Severity:* `Info`
    {{- else -}}
    *Severity:* :question: {{ .CommonLabels.severity }}
    {{- end }}
{{- end }}

Now we'll define our title. The title consists of the status and in the case that the status is firing, then we'll add how many alerts are triggered. The alert name will be next to it.

{{ define "slack.title" -}}
  [{{ .Status | toUpper -}}
  {{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{- end -}}
  ] {{ .CommonLabels.alertname }}
{{- end }}

The core Slack message text contains of:

  • Summary.
  • Severity.
  • A range of descriptions individual to each alert.

The summary should according to the guidelines be generic to the alerting group i.e no specifics for each alert in the group. Therefore, there is no loop over the summary annotation for each individual alert in the group. However, we loop through the description for each alert as we expect each individual alert description to be unique. The description annotation can have dynamic labels and annotations in the text. For example - which pod is crashing, what service is having high 5xx errors, what node has too high CPU usage.

Even though guidelines are set and most projects are moving towards them there are still some OSS mixins that use the message annotation instead of summary + description. Therefore we'll add a conditional statement that handles that case - we loop through the message annotation if it's present.

{{ define "slack.text" -}}

    {{ template "__alert_severity" . }}
    {{- if (index .Alerts 0).Annotations.summary }}
    {{- "\n" -}}
    *Summary:* {{ (index .Alerts 0).Annotations.summary }}
    {{- end }}

    {{ range .Alerts }}

        {{- if .Annotations.description }}
        {{- "\n" -}}
        {{ .Annotations.description }}
        {{- "\n" -}}
        {{- end }}
        {{- if .Annotations.message }}
        {{- "\n" -}}
        {{ .Annotations.message }}
        {{- "\n" -}}
        {{- end }}

    {{- end }}

{{- end }}

We'll also define a color which is used as an indicator in the Slack message on the severity of the alert.

{{ define "slack.color" -}}
    {{ if eq .Status "firing" -}}
        {{ if eq .CommonLabels.severity "warning" -}}
            warning
        {{- else if eq .CommonLabels.severity "critical" -}}
            danger
        {{- else -}}
            #439FE0
        {{- end -}}
    {{ else -}}
    good
    {{- end }}
{{- end }}

Configuring the Slack receiver

We'll make use of the template created when defining our Slack receiver and we'll also add couple of buttons:

  • A runbook button using the runbook_url.
  • A query button which links to the Prometheus query that triggered the alert.
  • A dashboard button that links to the dashboard for the alert.
  • A silence button which pre-populates all fields to silence the alert.

The color, title and text all come from the template we created above.

receivers:
  - name: slack
    slack_configs:
      - channel: '#alerts-<env>'
        color: '{{ template "slack.color" . }}'
        title: '{{ template "slack.title" . }}'
        text: '{{ template "slack.text" . }}'
        send_resolved: true
        actions:
          - type: button
            text: 'Runbook :green_book:'
            url: '{{ (index .Alerts 0).Annotations.runbook_url }}'
          - type: button
            text: 'Query :mag:'
            url: '{{ (index .Alerts 0).GeneratorURL }}'
          - type: button
            text: 'Dashboard :chart_with_upwards_trend:'
            url: '{{ (index .Alerts 0).Annotations.dashboard_url }}'
          - type: button
            text: 'Silence :no_bell:'
            url: '{{ template "__alert_silence_link" . }}'
templates: ['/etc/alertmanager/configmaps/**/*.tmpl']

Screenshots

Here are a couple of screenshots of alerts with various statuses, severities and button links.

A single KubePodCrashing alert that is resolved.

KubePodCrashing

Multiple HelmOperatorFailedReleaseChart alerts that have the severity critical and that have a dashboard button link.

HelmOperatorFailedReleaseChart

A single KubeDeploymentReplicasMismatch alert that has the severity warning and that has a runbook button link.

KubeDeploymentReplicasMismatch

Summary

The post displays highly efficient Slack alerts that goes together with any other ChatOps setup you have running. You might be running incident management through Slack where the incident commander leads and create threads in Slack, or you might use Dispatch's Slack integration. In all of these cases the Slack alerts should integrate easily with them and add great value.


Similar Posts

1 year ago
cloudflare devops web-dev cloudflare-workers sre

Quick, Pretty and Easy Maintenance Page using Cloudflare Workers & Terraform

3 min read

Maintenance pages are a neat way to inform your users that there are operational changes that require downtime. Cloudflare Workers allows you to execute Javascript and serve HTML close to your users and when Cloudflare …


1 year ago
mailgun statuscake terraform cloudflare devops s3 rds django

Kickstarting Infrastructure for Django Applications with Terraform

8 min read

When creating Django applications or using cookiecutters as Django Cookiecutter you will have by default a number of dependencies that will be needed to be created as a S3 bucket, a Postgres Database and a …


1 year ago Popular post!
devops gitlab-ci kaniko automation ci/cd

Creating templates for Gitlab CI Jobs

4 min read

Writing Gitlab CI templates becomes repetitive when you have similar applications running the same jobs. If a change to a job is needed it will be most likely needed to do the same change in …