GKE on a Budget: Disabling Expensive Defaults for Leaner Clusters

Table of Contents

Disabling Logging
Disabling Managed Prometheus
Disable Managed Backups
Disable VPAs
Optimize Autoscaling
Disable Network Policies
Disable Cost Allocation
Scaling down Managed Pods
Using your Own Node Pools
Conclusion

A while back, I wrote a blog post on creating a low-cost managed Kubernetes cluster. The solution centers around Google Kubernetes Engines's (GKE) free zonal cluster and preemptive node pools. This allows for a very low-cost Kubernetes cluster which is useful for learning purposes or for small workloads. The same setup is in use today for me; however, over time, the GKE cluster has by default become bloated. Google have enabled by default logging, monitoring, and other features to the cluster, which is great for production workloads, but if you are looking to cut costs, then many of these features don't make sense.

The first blog post is available here and remains relevant today. This blog post follows up on the previous one, focusing on reducing costs by disabling unnecessary features and services in GKE. The GKE Terraform module serves as a basis for the configuration.

Disabling Logging

GKE sends logs to Google Cloud Platform Logging by default. Storing logs can cost a large portion of the cluster cost in some cases and turning the off can save a lot of money. The following Terraform snippet disables logging:

module "gke" {
  source = "terraform-google-modules/kubernetes-engine/google//modules/beta-public-cluster"

  logging_service = "none"
}

Grafana's Loki is a great alternative for logging, and it's easy to set up. You can find the Loki documentation here.

Note: it won't turn off audit logs. Those are still available and don't cost anything.

Disabling Managed Prometheus

GKE turns on managed Prometheus by default in clusters, and in general, time series databases are resource-intensive. If you don't need Prometheus, you can turn it off. The following Terraform snippet disables Prometheus:

module "gke" {
  source = "terraform-google-modules/kubernetes-engine/google//modules/beta-public-cluster"

  monitoring_service    = "none"
}

An alternative deployment is using Kube-prometheus - this allows you to deploy Prometheus on your own terms and not rely on Google's managed Prometheus. It should be a much more cost-effective solution, albeit more complex.

Disable Managed Backups

Managed backups costs monthly $1 for each pod in the cluster, which can add up quickly. If you don't need backups, you can turn them off. The following Terraform snippet disables backups:

module "gke" {
  source = "terraform-google-modules/kubernetes-engine/google//modules/beta-public-cluster"

  gke_backup_agent_config = false
}

Velero is a great alternative for backups, and it's easy to set up. You can find the Velero documentation here. The cost of $1 per pod per month is a lot for backups and Velero is a much more cost-effective solution.

Disable VPAs

If you aren't using the Vertical Pod Autoscaler (VPA), you can turn it off. The following Terraform snippet disables VPA:

module "gke" {
  source = "terraform-google-modules/kubernetes-engine/google//modules/beta-public-cluster"

  enable_vertical_pod_autoscaling = false
}

An alternative deployment is using Fairwind's VPA Helm Charts - this allows you to deploy VPA on your own terms and not rely on Google's managed VPA.

Optimize Autoscaling

Ensure the node pools you create have OPTIMIZE_UTILIZATION autoscaling enabled. This configures the autoscaler to scale down the cluster more aggressively, removal of nodes is quicker, and the cluster scales down faster. The following Terraform snippet configures the autoscaler to optimize utilization:

module "gke" {
  source = "terraform-google-modules/kubernetes-engine/google//modules/beta-public-cluster"

  cluster_autoscaling = {
    autoscaling_profile = "OPTIMIZE_UTILIZATION"
  }
}

Disable Network Policies

GKE deploys Cilium as the default network policy provider. This is great for production workloads, but for a small cluster, it's unnecessary. Disabling network policies can save resources and reduce complexity. The following Terraform snippet disables network policies:

module "gke" {
  source = "terraform-google-modules/kubernetes-engine/google//modules/beta-public-cluster"

  network_policy {
    enabled = false
  }
}

Disable Cost Allocation

Cost allocation is a feature that exports cost data to BigQuery. If you don't need cost allocation, you can turn it off. The following Terraform snippet disables cost allocation:

module "gke" {
  source = "terraform-google-modules/kubernetes-engine/google//modules/beta-public-cluster"

  enable_cost_allocation = false
}

As an alternative, I've written a blog post on Kubernetes Cost Tracking with OpenCost, Prometheus, and Grafana. The solution deploys the CNCF project OpenCost, which allows you to track Kubernetes using Prometheus and Grafana.

Scaling down Managed Pods

GKE deploys a number of managed pods by default. You can scale down the number of managed pods to save resources. However, you need to edit ConfigMaps in the kube-system namespace and change the number of replicas for the pods you want to scale down.

Konnectivity Agent

Edit the konnectivity-agent-autoscaler-config in the kube-system namespace and change the number of replicas to a smaller number.

apiVersion: v1
data:
  ladder: |-
    {
      "coresToReplicas": [],
      "nodesToReplicas":
      [
        [1, 1],
        [2, 2],
        [3, 3],
        [4, 3],
        [5, 3],
        [6, 3],
        [10, 8],
        [100, 12],
        [250, 18],
        [500, 25],
        [2000, 50],
        [5000, 100]
      ]
    }
kind: ConfigMap
metadata:
  annotations:
    components.gke.io/layer: addon
  labels:
    addonmanager.kubernetes.io/mode: EnsureExists
    kubernetes.io/cluster-service: "true"
  name: konnectivity-agent-autoscaler-config
  namespace: kube-system

Note: the default deployed konnectivity agents for a 6 node cluster is 6, and I've scaled it down to 3.

Kube DNS

Edit the kube-dns-autoscaler in the kube-system namespace and change the number of min replicas to a number you want.

apiVersion: v1
data:
  linear: '{"coresPerReplica":256,"nodesPerReplica":16,"preventSinglePointFailure":true,"min":3}'
kind: ConfigMap
metadata:
  name: kube-dns-autoscaler
  namespace: kube-system

Note: the snippet actually increases the minimum number of replicas for kube-dns from 2 to 3, as I had issues with DNS availability when it was the default of 2 combined with aggressive scale downs and preemptive node pools.

Using your Own Node Pools

GKE Autopilot is a cool feature, but I've had mixed experiences. In high-scale production environments where costs aren't the biggest factor, it performs well. However, in staging clusters, it usually deploys nodes that are too small, where system and monitoring DaemonSets consume a large portion of the resources. Therefore, users prefer custom node pools in some circumstances. The following Terraform snippet creates a node pool:

module "gke" {
  source = "terraform-google-modules/kubernetes-engine/google//modules/beta-public-cluster"

  node_pools = [
    {
      name               = "default-europe-west1-3d13"
      machine_type       = "t2d-standard-1"
      min_count          = 2
      max_count          = 3
      auto_repair        = true
      auto_upgrade       = false
      initial_node_count = 2
      version            = data.google_container_engine_versions.node_pool_version.latest_node_version
    },
}

Using Preemptible Node Pools

Preemptible nodes are a great way to save costs. They're usually 60-70% cheaper than regular nodes. The following Terraform snippet creates a preemptible node pool:

module "gke" {
  source = "terraform-google-modules/kubernetes-engine/google//modules/beta-public-cluster"

  node_pools = [
    {
      name               = "default-europe-west1-3d13"
      machine_type       = "t2d-standard-1"
      min_count          = 2
      max_count          = 3
      auto_repair        = true
      auto_upgrade       = false
      preemptible        = true
      initial_node_count = 2
      version            = data.google_container_engine_versions.node_pool_version.latest_node_version
    },
}

Use Low Cost Disk Options

Using standard disks instead of SSDs can save costs. The usual default of 100 GB storage is excessive as well. The following Terraform snippet creates a node pool with standard disks and 40 GB storage:

module "gke" {
  source = "terraform-google-modules/kubernetes-engine/google//modules/beta-public-cluster"

  node_pools = [
    {
      name               = "default-europe-west1-3d13"
      machine_type       = "t2d-standard-1"
      min_count          = 2
      max_count          = 3
      auto_repair        = true
      auto_upgrade       = false
      preemptible        = true
      disk_size_gb       = 40
      disk_type          = "pd-standard"
      initial_node_count = 2
      version            = data.google_container_engine_versions.node_pool_version.latest_node_version
    },
}

Conclusion

Combining and applying most of the preceding optimizations can save a significant amount of money. Individual savings might not seem like much, but when combined, they can make up a large portion of the cluster cost. This was less of an issue when GKE was first released, but over time, Google has added more features and services to the default cluster configuration. As said before this is great for production workloads, but for small clusters, it can be overkill - alternatively self-hosting some of the options is much more affordable. The Terraform snippets provided in this blog post can help you optimize your GKE cluster and save money.

Here are additional blog posts I've written on Kubernetes cost management: