DevOps

Don't Even Touch Kubernetes Without This

A comprehensive guide covering everything you need to plan before deploying a production Kubernetes cluster: architecture, storage, networking, security, observability, autoscaling, and CI/CD pipelines.

You can deploy a basic Kubernetes cluster in 15 minutes. But for it to actually work in production — to answer all the questions before installation, to plan everything — that takes days of brainstorming.

This guide covers everything you need to think through before diving into Kubernetes.

Architecture

Kubernetes consists of nodes (servers) divided into control plane nodes and worker nodes. Alongside the Control Plane sits etcd — a distributed database that stores information about pods, services, and configs. The system works declaratively: the administrator describes the desired state in a config, and the Control Plane brings the system into compliance with that description.

Critical point: for large, heavily loaded clusters, the Control Plane and etcd should be on separate dedicated nodes without any workload.

What to Deploy

Deployment methods:

Managed Service in the cloud — the provider manages the Control Plane, etcd, and updates. Advantages: automatic node scaling, creating persistent volumes. Disadvantages: dependency on the provider and its limits.
Bare Metal or your own VMs — full control, but all the complexity of installation, updates, and backups falls on the administrator.
Cloud VMs — a compromise between control and convenience.
Container hosting — maximum simplification for quick solutions.

Installation tools:

Kubeadm — the official tool, but the "samurai's path" with manual configuration of everything.
Kubespray — the recommended choice, automates 90% of the work via Ansible.
Specialized solutions — for specific OSes like Talos OS.

OS for nodes:

Standard Linux distributions (Ubuntu LTS, CentOS, Debian)
Specialized (Fedora CoreOS, Flatcar Container Linux with a read-only filesystem)
Talos — a minimalist option without a shell, managed entirely via API

Container Runtime:

Docker (dockershim) — removed from Kubernetes, but still supported via compatibility layers.
Containerd — the current de facto standard, supported by Kubernetes directly.
CRI-O — an alternative from Red Hat.
Special variants (Kata Containers, KubeVirt) for specific use cases.

Data Storage

Critical for stateful applications (databases, queues, caches). Forget about using local disks on nodes to store important data in production!

Solutions:

Ceph + Rook — distributed storage where Ceph pods run inside Kubernetes with automatic recovery on node failure.
LINSTOR — an alternative to Ceph.
Cloud CSI drivers (EBS, GCP Persistent Disk, Azure Disk) for cloud deployments.

The mechanism works like this: an application creates a PVC (Persistent Volume Claim) — a request for a disk. Kubernetes contacts the storage system via the Container Storage Interface (CSI), the system creates the disk, and Kubernetes attaches it to the pod.

StorageClass lets you define storage tiers (fast-ssd, slow-hdd, cloud-premium) and abstract away from the specific implementation.

Network Infrastructure

Networking in Kubernetes is arguably one of the most complex topics. Kubernetes creates a virtual network on top of the physical one. Each pod gets its own IP address. For this magic to work, you need a CNI plugin (Container Network Interface).

Popular CNI plugins:

Cilium — uses eBPF in the Linux kernel for high performance; includes the Hubble tool for visualizing network flows.
Calico — a mature plugin supporting VXLAN/IPIP or BGP routing.
Flannel — a simple option for beginners.
Cloud-specific plugins (AWS VPC CNI, Azure CNI, GKE Netd).

Service — an abstraction for accessing pods:

ClusterIP — internal access only.
NodePort — opens a port on every node (inconvenient for production).
LoadBalancer — in the cloud, creates a cloud load balancer; on bare metal, requires MetalLB.

MetalLB emulates a cloud LoadBalancer on bare metal, taking an IP from a dedicated pool and announcing it on the network.

Ingress — the standard for publishing web applications (HTTP/HTTPS). Requires an Ingress Controller:

Nginx Ingress Controller — the most popular, though vulnerabilities have been found.
Traefik — automatically obtains Let's Encrypt certificates.
HAProxy Ingress, Contour — alternatives.

Gateway API is a new specification that should eventually replace Ingress, but for now it's recommended to start with standard Ingress.

Information Security

Passwords, tokens, and keys should NEVER be in code or YAML manifests.

Storage mechanisms:

Secrets — for sensitive data (passwords, tokens, certificates), stored in etcd as base64 (not encryption!), with encryption and RBAC mechanisms available.
ConfigMaps — for non-sensitive configuration.
Integration with external systems — HashiCorp Vault, External Secrets Operator for centralized management and rotation.

Authentication:

X.509 certificates
Tokens (static, JWT, OIDC via Keycloak or Dex)
LDAP/AD integration

RBAC (Role-Based Access Control):

Roles and ClusterRoles describe permissions.
RoleBindings and ClusterRoleBindings link roles to users/groups/service accounts.
Namespaces provide logical isolation of resources.
The golden rule: least privilege.

Observability

Running a Kubernetes cluster without a configured monitoring and logging system is like putting a blind kitten in charge of administration.

Metrics:

Metrics Server — collects CPU/RAM from nodes and pods, needed for kubectl top and HPA.
Prometheus + Grafana — the de facto standard: Prometheus collects time-series metrics, Grafana builds dashboards.
Applications should expose metrics via an HTTP endpoint at /metrics in Prometheus format.

Logs:

Containers write logs to stdout/stderr.
You need centralized collection: agents (Fluentd, Fluent Bit, Promtail) on each node collect logs and send them to storage.
Storage options: Elasticsearch (ELK/EFK stack) or Loki (better integration with Prometheus, LGTM stack).
Logs should be structured for easy searching.

Tracing:

For microservice architectures, you need distributed tracing.
Tools: Jaeger, Zipkin, Tempo.
Applications must pass trace IDs and span IDs and send information about their operations.

Scaling

Operators:

These are custom controllers for managing complex applications. They work through Custom Resource Definitions (CRDs):

CRDs and an operator are installed (e.g., for PostgreSQL).
A YAML file describing the resource (PostgresCluster) is created.
The operator automatically creates StatefulSets/Deployments, provisions disks, configures replication and backups.

Autoscaling:

HPA (Horizontal Pod Autoscaler) — automatically adds/removes pods based on metrics (CPU, RAM, custom metrics).
VPA (Vertical Pod Autoscaler) — adjusts CPU/RAM requests and limits; less popular and can conflict with HPA.
Cluster Autoscaler — adds/removes entire nodes. This is the killer feature of cloud Managed Kubernetes. On bare metal, it requires the Cluster API.

A note on cloud provider limits: some Russian clouds have rather amusing limits on the maximum number of nodes in a single Managed cluster (e.g., 32-500 nodes), while Kubernetes officially supports 5,000 nodes and 150,000 pods.

Applications for Kubernetes

The 12-Factor App methodology is required reading. Your application should:

Read configs and secrets from the environment or files
Write logs to stdout/stderr, not to files
Be as stateless as possible
Start quickly
Handle SIGTERM correctly for graceful shutdown
Expose metrics (health checks, readiness/liveness probes)
Be packaged in a lightweight Docker image

CI/CD pipelines should automatically:

Build Docker images on each commit to Git
Push images to a Docker Registry (Harbor, GitLab Registry, Docker Hub)
Update YAML manifests with the new image version
Deploy manifests to Kubernetes
Run tests after deployment
Enable rollback to the previous version

Tools: Jenkins, GitLab CI, GitHub Actions, Tekton, Argo CD, Flux.

Internal Developer Platform:

In an ideal world, your developers shouldn't even know the word "Kubernetes." This is the highest level of abstraction — a developer writes code, commits to Git, and the platform automatically builds, tests, and deploys to production.

Is It Worth It?

If you have a genuinely complex, large, dynamic system that needs to be fault-tolerant, scale under load, and update frequently without downtime — the answer is unequivocally YES. The initial pain pays off with flexibility, speed, and reliability.

If Kubernetes seems too complex, there are simplified alternatives like container hosting services that completely hide the orchestrator and accept docker-compose.yaml files, or Managed Kubernetes with a ready-made Control Plane, etcd, and autoscaling.

Resources for Learning

Minikube — a sandbox for learning
Kubespray — automated deployment
Official Kubernetes documentation — tutorials, concepts, tasks
Production Kubernetes (O'Reilly) — a book also available in Russian
Networking and Kubernetes (O'Reilly) — deep dive into networking, available in Russian
Lens — a GUI for Kubernetes
The Twelve-Factor App — application development methodology

Don't Even Touch Kubernetes Without This

Architecture

What to Deploy

Data Storage

Network Infrastructure

Information Security

Observability

Scaling

Applications for Kubernetes

Is It Worth It?

Resources for Learning

Further reading

Why Airships Never Took Off. Part 12: Italian Semi-Rigid Airships

Why Airships Never Took Off. Part 11: Aircraft Carriers in the Sky

Why Airships Never Took Off. Part 10: The Most Famous and Successful Zeppelin

Why Airships Never Took Off. Part 9: Ashes of War and New Opportunities