A while back, I wrote a blog post on creating a low-cost managed Kubernetes cluster. The solution centers around Google Kubernetes Engines’s (GKE) free zonal cluster and preemptive node pools. This allows for a very low-cost Kubernetes cluster which is useful for learning purposes or for small workloads. The same setup is in use today for me; however, over time, the GKE cluster has by default become bloated. Google have enabled by default logging, monitoring, and other features to the cluster, which is great for production workloads, but if you are looking to cut costs, then many of these features don’t make sense.
The first blog post is available here and remains relevant today. This blog post follows up on the previous one, focusing on reducing costs by disabling unnecessary features and services in GKE. The GKE Terraform module serves as a basis for the configuration.
Disabling Logging
GKE sends logs to Google Cloud Platform Logging by default. Storing logs can cost a large portion of the cluster cost in some cases and turning the off can save a lot of money. The following Terraform snippet disables logging:
module "gke" {
source = "terraform-google-modules/kubernetes-engine/google//modules/beta-public-cluster"
logging_service = "none"
}
Grafana’s Loki is a great alternative for logging, and it’s easy to set up. You can find the Loki documentation here.
Note: it won’t turn off audit logs or cluster autoscaler logs. Those are still available and don’t cost anything.
Disabling Managed Prometheus
GKE turns on managed Prometheus by default in clusters, and in general, time series databases are resource-intensive. If you don’t need Prometheus, you can turn it off. The following Terraform snippet disables Prometheus:
module "gke" {
source = "terraform-google-modules/kubernetes-engine/google//modules/beta-public-cluster"
monitoring_service = "none"
}
An alternative deployment is using Kube-prometheus - this allows you to deploy Prometheus on your own terms and not rely on Google’s managed Prometheus. It should be a much more cost-effective solution, albeit more complex.
Disable Managed Backups
Managed backups costs monthly $1 for each pod in the cluster, which can add up quickly. If you don’t need backups, you can turn them off. The following Terraform snippet disables backups:
module "gke" {
source = "terraform-google-modules/kubernetes-engine/google//modules/beta-public-cluster"
gke_backup_agent_config = false
}
Velero is a great alternative for backups, and it’s easy to set up. You can find the Velero documentation here. The cost of $1 per pod per month is a lot for backups and Velero is a much more cost-effective solution.
Disable VPAs
If you aren’t using the Vertical Pod Autoscaler (VPA), you can turn it off. The following Terraform snippet disables VPA:
module "gke" {
source = "terraform-google-modules/kubernetes-engine/google//modules/beta-public-cluster"
enable_vertical_pod_autoscaling = false
}
An alternative deployment is using Fairwind’s VPA Helm Charts - this allows you to deploy VPA on your own terms and not rely on Google’s managed VPA.
Optimize Autoscaling
Ensure the node pools you create have OPTIMIZE_UTILIZATION
autoscaling enabled. This configures the autoscaler to scale down the cluster more aggressively, removal of nodes is quicker, and the cluster scales down faster. The following Terraform snippet configures the autoscaler to optimize utilization:
module "gke" {
source = "terraform-google-modules/kubernetes-engine/google//modules/beta-public-cluster"
cluster_autoscaling = {
autoscaling_profile = "OPTIMIZE_UTILIZATION"
}
}
Disable Network Policies
GKE deploys Cilium as the default network policy provider. This is great for production workloads, but for a small cluster, it’s unnecessary. Disabling network policies can save resources and reduce complexity. The following Terraform snippet disables network policies:
module "gke" {
source = "terraform-google-modules/kubernetes-engine/google//modules/beta-public-cluster"
network_policy {
enabled = false
}
}
Disable Cost Allocation
Cost allocation is a feature that exports cost data to BigQuery. If you don’t need cost allocation, you can turn it off. The following Terraform snippet disables cost allocation:
module "gke" {
source = "terraform-google-modules/kubernetes-engine/google//modules/beta-public-cluster"
enable_cost_allocation = false
}
As an alternative, I’ve written a blog post on Kubernetes Cost Tracking with OpenCost, Prometheus, and Grafana. The solution deploys the CNCF project OpenCost, which allows you to track Kubernetes using Prometheus and Grafana.
Scaling down Managed Pods
GKE deploys a number of managed pods by default. You can scale down the number of managed pods to save resources. However, you need to edit ConfigMaps
in the kube-system
namespace and change the number of replicas for the pods you want to scale down.
Konnectivity Agent
Edit the konnectivity-agent-autoscaler-config
in the kube-system
namespace and change the number of replicas to a smaller number.
apiVersion: v1
data:
ladder: |-
{
"coresToReplicas": [],
"nodesToReplicas":
[
[1, 1],
[2, 2],
[3, 3],
[4, 3],
[5, 3],
[6, 3],
[10, 8],
[100, 12],
[250, 18],
[500, 25],
[2000, 50],
[5000, 100]
]
}
kind: ConfigMap
metadata:
annotations:
components.gke.io/layer: addon
labels:
addonmanager.kubernetes.io/mode: EnsureExists
kubernetes.io/cluster-service: "true"
name: konnectivity-agent-autoscaler-config
namespace: kube-system
Note: the default deployed konnectivity agents
for a 6 node cluster is 6, and I’ve scaled it down to 3.
Kube DNS
Edit the kube-dns-autoscaler
in the kube-system
namespace and change the number of min replicas to a number you want.
apiVersion: v1
data:
linear: '{"coresPerReplica":256,"nodesPerReplica":16,"preventSinglePointFailure":true,"min":3}'
kind: ConfigMap
metadata:
name: kube-dns-autoscaler
namespace: kube-system
Note: the snippet actually increases the minimum number of replicas for kube-dns
from 2 to 3, as I had issues with DNS availability when it was the default of 2 combined with aggressive scale downs and preemptive node pools.
Using your Own Node Pools
GKE Autopilot is a cool feature, but I’ve had mixed experiences. In high-scale production environments where costs aren’t the biggest factor, it performs well. However, in staging clusters, it usually deploys nodes that are too small, where system and monitoring DaemonSets
consume a large portion of the resources. Therefore, users prefer custom node pools in some circumstances. The following Terraform snippet creates a node pool:
module "gke" {
source = "terraform-google-modules/kubernetes-engine/google//modules/beta-public-cluster"
node_pools = [
{
name = "default-europe-west1-3d13"
machine_type = "t2d-standard-1"
min_count = 2
max_count = 3
auto_repair = true
auto_upgrade = false
initial_node_count = 2
version = data.google_container_engine_versions.node_pool_version.latest_node_version
},
}
Using Preemptible Node Pools
Preemptible nodes are a great way to save costs. They’re usually 60-70% cheaper than regular nodes. The following Terraform snippet creates a preemptible node pool:
module "gke" {
source = "terraform-google-modules/kubernetes-engine/google//modules/beta-public-cluster"
node_pools = [
{
name = "default-europe-west1-3d13"
machine_type = "t2d-standard-1"
min_count = 2
max_count = 3
auto_repair = true
auto_upgrade = false
preemptible = true
initial_node_count = 2
version = data.google_container_engine_versions.node_pool_version.latest_node_version
},
}
Use Low Cost Disk Options
Using standard disks instead of SSDs can save costs. The usual default of 100 GB storage is excessive as well. The following Terraform snippet creates a node pool with standard disks and 40 GB storage:
module "gke" {
source = "terraform-google-modules/kubernetes-engine/google//modules/beta-public-cluster"
node_pools = [
{
name = "default-europe-west1-3d13"
machine_type = "t2d-standard-1"
min_count = 2
max_count = 3
auto_repair = true
auto_upgrade = false
preemptible = true
disk_size_gb = 40
disk_type = "pd-standard"
initial_node_count = 2
version = data.google_container_engine_versions.node_pool_version.latest_node_version
},
}
Summary
Combining and applying most of the preceding optimizations can save a significant amount of money. Individual savings might not seem like much, but when combined, they can make up a large portion of the cluster cost. This was less of an issue when GKE was first released, but over time, Google has added more features and services to the default cluster configuration. As said before this is great for production workloads, but for small clusters, it can be overkill - alternatively self-hosting some of the options is much more affordable. The Terraform snippets provided in this blog post can help you optimize your GKE cluster and save money.
Here are additional blog posts I’ve written on Kubernetes cost management: