Provisioning a Private Talos Kubernetes Cluster on Hetzner Cloud

This is a follow up to Private Networking on Hetzner Cloud with Tailscale

The previous post was about the network. This one is about what I put inside that network: a private Kubernetes cluster running Talos on Hetzner Cloud.

The important part is not just “Kubernetes on Hetzner Cloud”. There are many posts about it. The part I cared about was making the cluster private from the first boot. No public IPs on the control plane. No public IPs on the workers. Access only through the Tailnet.

That made Talos a good fit. No package manager, no SSH. You give it machine configuration, it becomes a Kubernetes node, and that is mostly it.

Mostly.

What I Wanted from the Cluster

Private-only nodes: every Kubernetes node should live only on the Hetzner private network.
Terraform-managed bootstrap: machines, Talos config, kubeconfig, and base add-ons should come from code.
Talos: no manual server maintenance.
Separate node pools: platform components should not fight application workloads.
GitOps: Terraform can bootstrap ArgoCD, then ArgoCD owns the platform.

The goal was to build something small enough that I could understand every moving part, but powerful enough that I could run actual projects on it.

Cluster Shape

The private network from the previous post gives the cluster a /24 to live in. I split that range into explicit chunks:

Control plane: 10.0.128.16/28
Platform workers: 10.0.128.32/27
General workers: 10.0.128.64/27
Service network: 10.0.192.0/21
Pod network: 10.0.200.0/19

The control plane has three nodes. Platform workers run things like ArgoCD and platform components. General workers run applications like snapbyte.dev.

flowchart TB
    Tailnet((Tailnet))
    Internet((Internet))

    subgraph VPC["Private network 10.0.0.0/16"]
        subgraph Subnet["Subnet 10.0.128.0/24"]
            NAT["NAT Gateway"]

            subgraph CP["Control Plane 10.0.128.16/28"]
                CP1["cp-1"]
                CP2["cp-2"]
                CP3["cp-3"]
            end

            subgraph Platform["Platform Workers 10.0.128.32/27"]
                ArgoCD["ArgoCD"]
                PlatformApps["Platform components"]
            end

            subgraph General["General Workers 10.0.128.64/27"]
                PublicApps["Public apps"]
                InternalApps["Internal apps"]
            end
        end
    end

    Tailnet -->|kubectl and talosctl| CP1
    Tailnet --> Platform
    Tailnet --> General
    CP --> NAT
    Platform --> NAT
    General --> NAT
    NAT --> Internet

The Kubernetes API endpoint is the first control plane node’s private IP:

locals {
  cluster_endpoint = "https://${local.control_plane_private_ips[0]}:6443"
}

That endpoint is only useful if you are already inside the private network through Tailscale.

Building the Talos Image

Before Terraform could create any nodes, I needed a Talos image that Hetzner could boot.

I started this cluster on Talos v1.11.3. The later v1.12.6 upgrade came from an operational incident, not the initial design.

Hetzner does not give you Talos as an image option, so I build my own snapshot with Packer. The flow is based on hcloud-talos/terraform-hcloud-talos.

It starts a temporary Hetzner server, downloads the Talos raw image from the Talos Image Factory, writes it to disk, and saves the result as a snapshot.

variable "talos_version" {
  type    = string
  default = "v1.11.3"
}

source "hcloud" "talos" {
  rescue       = "linux64"
  image        = "debian-11"
  location     = "nbg1"
  server_type  = "cx22"
  ssh_username = "root"

  snapshot_name = "talos-${var.talos_version}-amd64"
  snapshot_labels = {
    type    = "infra"
    os      = "talos"
    version = var.talos_version
    arch    = "amd64"
  }
}

The label part is the important bit. Terraform can later find the image by selector instead of relying on snapshot name:

data "hcloud_image" "talos" {
  with_selector = "os=talos,type=infra,version=${var.talos_version},arch=amd64"
}

Worker Pools

Before creating the machines, I needed a way to describe what kind of nodes I wanted.

This is basically the same idea as node pools in managed Kubernetes offerings. GKE, EKS, and AKS all let you create groups of nodes with different sizes, labels, or taints. I wanted the same mental model.

Each pool also gets its own Hetzner placement group. That tells Hetzner to spread the nodes in that pool across different physical hosts where possible. It does not make the pool highly available, but it avoids the failure mode where every platform worker ends up on the same machine.

The pool config looks like this:

worker_pools = {
  platform = {
    count      = 3
    sku        = "cx33"
    cidr       = "10.0.128.32/27"
    datacenter = "nbg1-dc3"
    labels     = { purpose = "platform" }
  }

  general = {
    count      = 3
    sku        = "cx23"
    cidr       = "10.0.128.64/27"
    datacenter = "nbg1-dc3"
    labels     = { purpose = "general" }
  }
}

This makes the Terraform code easier to reason about. It lets me create named groups of machines with known CIDR ranges, placement groups, and labels.

Terraform Creates the Machines

The node resources are just regular Hetzner servers, but with the public network disabled.

resource "hcloud_server" "control_plane" {
  count = var.control_plane.count

  name        = "${local.cluster_name}-cp-${count.index + 1}"
  datacenter  = var.control_plane.datacenter
  image       = data.hcloud_image.talos.id
  server_type = var.control_plane.sku

  public_net {
    ipv4_enabled = false
    ipv6_enabled = false
  }

  network {
    network_id = var.network_id
    ip         = local.control_plane_ips[count.index]
  }
}

The worker pool map from the previous section gets flattened into individual servers. The code is not elegant, but the outcome is simple: if I add another worker to the general pool, it gets the next private IP in that pool and the right labels.

locals {
  workers_flat = merge([
    for pool_name, pool_config in var.worker_pools : {
      for i in range(1, pool_config.count + 1) :
      "${pool_name}-${i}" => {
        pool       = pool_name
        index      = i
        sku        = pool_config.sku
        datacenter = pool_config.datacenter
        labels     = pool_config.labels
      }
    }
  ]...)
}

Talos Bootstraps Kubernetes

Once the servers exist, the Talos Terraform provider takes over. It generates machine secrets, creates control plane and worker configs, applies patches, bootstraps the first control plane node, waits for Talos cluster health, and gives me a kubeconfig.

There is one base config for control plane nodes, and one worker base config per pool:

data "talos_machine_configuration" "control_plane" {
  cluster_name       = local.cluster_name
  cluster_endpoint   = local.cluster_endpoint
  machine_type       = "controlplane"
  machine_secrets    = talos_machine_secrets.this.machine_secrets
  talos_version      = var.talos_version
  kubernetes_version = var.kubernetes_version
}

data "talos_machine_configuration" "worker" {
  for_each = var.worker_pools

  cluster_name       = local.cluster_name
  cluster_endpoint   = local.cluster_endpoint
  machine_type       = "worker"
  machine_secrets    = talos_machine_secrets.this.machine_secrets
  talos_version      = var.talos_version
  kubernetes_version = var.kubernetes_version
}

Then Terraform applies the patched control-plane config to each control-plane node:

resource "talos_machine_configuration_apply" "control_plane" {
  count = var.control_plane.count

  client_configuration        = talos_machine_secrets.this.client_configuration
  machine_configuration_input = data.talos_machine_configuration.control_plane.machine_configuration
  node                        = local.control_plane_private_ips[count.index]

  config_patches = [
    yamlencode(local.control_plane_patch),
  ]
}

Workers follow the same pattern, except the patch comes from the worker pool:

resource "talos_machine_configuration_apply" "worker" {
  for_each = local.workers_flat

  client_configuration        = talos_machine_secrets.this.client_configuration
  machine_configuration_input = data.talos_machine_configuration.worker[each.value.pool].machine_configuration
  node                        = flatten(hcloud_server.worker[each.key].network)[0].ip

  config_patches = [
    yamlencode(local.worker_pool_patches[each.value.pool]),
  ]
}

Nodes need the right installer image, node IP selection, default route, pod and service CIDRs, and CNI behavior.

This is the part I messed up and later caused the first real failure.

A simplified version of the patch, showing the final intent, looks like this:

common_patch = {
  machine = {
    install = {
      disk  = var.install_disk
      image = "factory.talos.dev/installer/${var.talos_schematic_id}:${var.talos_version}"
    }

    kubelet = {
      extraArgs = {
        "cloud-provider"             = "external"
        "rotate-server-certificates" = true
      }
      nodeIP = {
        validSubnets = local.node_cidrs
      }
    }

    network = {
      interfaces = [
        {
          interface = "eth0"
          routes = [{
            network = "0.0.0.0/0"
            gateway = var.gateway
          }]
          dhcp = true
        },
        {
          interface = "eth1"
          ignore    = true
        }
      ]
    }

    features = {
      hostDNS = {
        enabled              = true
        forwardKubeDNSToHost = true
        resolveMemberNames   = true
      }
    }
  }

  cluster = {
    network = {
      podSubnets     = [var.pod_ipv4_cidr]
      serviceSubnets = [var.service_ipv4_cidr]
      cni = {
        name = "none"
      }
    }
    proxy = {
      disabled = true
    }
  }
}

Most of that is ordinary cluster setup. The important bits are: nodeIP.validSubnets, the default route, and the interface names. If those are wrong, the cluster does not fail in an obvious way. It half-works, which is worse.

The control plane patch adds the other important private-networking detail:

cluster = {
  etcd = {
    advertisedSubnets = [var.control_plane.cidr]
  }

  controllerManager = {
    extraArgs = {
      "cloud-provider"           = "external"
      "node-cidr-mask-size-ipv4" = "24"
      "bind-address"             = "0.0.0.0"
    }
  }
}

That keeps etcd on the control-plane private CIDR and makes Kubernetes pod CIDR allocation line up with the cluster network.

The worker patch mostly adds the pool labels, so nodes become purpose=platform or purpose=general. After the configs are applied, Terraform bootstraps only the first control plane node:

resource "talos_machine_bootstrap" "this" {
  client_configuration = talos_machine_secrets.this.client_configuration
  node                 = local.control_plane_private_ips[0]
}

The First Failure Was Networking

The first failure was simple: the nodes did not agree on their own network identity.

Private-only Hetzner machines still need a default route for egress. In my setup, the node route goes to the subnet gateway, and Hetzner’s network route sends outbound traffic to the NAT gateway.

Node -> Subnet Gateway -> NAT Gateway -> Internet

In practice, the Talos machine config needs to point that route at the right interface, and kubelet needs to pick the right node IP.

I had a few false starts here. At one point I assumed the private network interface was enp7s0. Then I tried routes on both enp7s0 and eth0, with eth1 ignored. Hetzner Cloud VMs were using eth0 for the network path I needed, so eth0 was the path that actually mattered.

The other subtle part was nodeIP.validSubnets. My first attempt pointed kubelet at the broader subnet. Talos and Kubernetes do better when the allowed node IP ranges are exactly the ranges where nodes live.

So the module builds that list from the control plane CIDR and each worker pool CIDR:

locals {
  node_cidrs = concat(
    [var.control_plane.cidr],
    [for _, pool_config in var.worker_pools : pool_config.cidr]
  )
}

Configuring networking is not easy. It is a chain of small settings that all need to agree: Hetzner routes, Talos interfaces, kubelet node IPs, etcd advertised subnets, and Kubernetes pod CIDR allocation.

Cilium First, Then Cloud Integrations

Once the nodes agreed on their private addresses and routes, the next step was getting the cluster network itself working.

I install Cilium with Helm after Talos bootstraps Kubernetes. It runs in native routing mode and replaces kube-proxy.

resource "helm_release" "cilium" {
  name       = "cilium"
  repository = "https://helm.cilium.io/"
  chart      = "cilium"
  namespace  = "kube-system"

  set {
    name  = "routingMode"
    value = "native"
  }

  set {
    name  = "kubeProxyReplacement"
    value = "true"
  }
}

After Cilium is in place, the Hetzner Cloud Controller Manager can run. That order matters because the CCM depends on a working cluster network and is responsible for Hetzner-specific behavior like load balancers and node metadata.

The CSI driver and metrics server follow the same idea. Terraform installs the base pieces that make the cluster usable.

GitOps from Day One

Terraform installs ArgoCD because something needs to install ArgoCD. After that bootstrap step, the responsibilities are clear: Terraform owns infrastructure and the first bootstrap, ArgoCD owns platform components.

This is where the earlier platform pool starts to make sense. ArgoCD should not land randomly on the same general workers as application traffic.

global:
  nodeSelector:
    purpose: platform

The Terraform side creates the namespace, repository credentials, and Helm release. ArgoCD itself is exposed on the internal ingress as argocd.int.noreturn.dev. That keeps the GitOps UI private, but still easy to reach from my laptop through Tailscale.

From there, the platform repository can install cert-manager, external-secrets, ingress controllers, monitoring, and the rest.

Then Real Workloads Showed Up

The original cluster came up on Talos v1.11.3. It worked. And then snapbyte.dev started running on the general workers.

That is when the cluster stopped being a learning exercise.

A high-churn set of CronJobs in snapbyte started leaving the cluster in a bad state. One worker showed SystemOOM, Grafana had gaps in node metrics, disk I/O looked suspicious, and Talos showed hundreds of containerd-shim-runc-v2 processes on a hot worker.

At first it looked like a disk problem. Then it looked like a workload problem. Then it looked like both.

Restarting the affected workers would probably have been fine, but I got curious around 1am one night and kept digging.

Eventually, the runtime became part of the investigation because those nodes were on Talos v1.11.3 with containerd 2.1.4. That pointed me to this issue: containerd-shim processes leak during high pod churn.

By 3am, curiosity had turned into an in-place Talos upgrade to v1.12.6, moving the nodes to containerd 2.1.6, and a much more careful rollout process than I had planned when I first built this cluster.

But that is its own post.

What I Learned

Private-only nodes still need a default route for outbound traffic
The Talos interface names have to match what the Hetzner VM actually uses
nodeIP.validSubnets should only include the ranges where Kubernetes nodes actually live: the control plane CIDR and each worker pool CIDR
etcd.advertisedSubnets should stay on the control-plane private CIDR
The service and pod CIDRs are not random ranges. I used 10.0.192.0/21 for services and 10.0.200.0/19 for pods, outside the node ranges in 10.0.128.0/24.
The pod range is /19 because Kubernetes allocates one /24 pod CIDR per node in this setup. A /19 contains 32 /24 ranges, so the cluster has room for up to 32 nodes without overlapping the node or service ranges.
Talos passes those ranges to Kubernetes, the controller manager allocates /24 pod CIDRs to nodes from the pod range, and Cilium routes traffic based on that Kubernetes view of the network.
Running real workloads changes the meaning of the cluster. Once snapbyte.dev was running there, it was not a learning exercise anymore. A personal cluster is still production if real services depend on it.

Next up: the Talos upgrade from v1.11.3 to v1.12.6, why I did it, what I checked before each node, and how a small snapbyte incident became a container runtime debugging session.