Zero-Downtime Migrations to Kubernetes

Migrating a production monolith to Kubernetes is one of the most impactful — and risky — infrastructure transformations an engineering team can undertake. Done poorly, it leads to extended outages and frustrated users. Done right, it unlocks a new level of scalability, resilience, and deployment velocity.

Here's a battle-tested strategy for achieving zero-downtime migration.

Phase 1: Containerize Without Migrating

The first mistake teams make is trying to containerize and migrate simultaneously. Instead, containerize your application first and deploy it alongside the existing infrastructure.

# Dockerfile for a Spring Boot monolith
FROM eclipse-temurin:21-jre-alpine

WORKDIR /app
COPY target/application.jar app.jar

# Health check for K8s readiness probes
HEALTHCHECK --interval=10s --timeout=3s --retries=3 \
  CMD wget -qO- http://localhost:8080/actuator/health || exit 1

EXPOSE 8080
ENTRYPOINT ["java", "-jar", "app.jar"]

Run the containerized version in parallel with production. Route a small percentage of traffic to it using weighted load balancing. This validates that the container behaves identically to the bare-metal deployment.

Phase 2: The Strangler Fig Pattern

Rather than migrating everything at once, use the Strangler Fig pattern — gradually extract services and route traffic to them. Start with the least critical components.

The strangler fig grows around a host tree until it eventually replaces it entirely. Your new microservices should grow around the monolith the same way.

Identify bounded contexts within your monolith. Extract them one by one:

Identify the boundary — find a module with clear inputs/outputs and minimal shared state
Create the new service — build it as a standalone K8s deployment
Proxy traffic — use an API gateway to route requests to the new service
Validate in shadow mode — run both old and new in parallel, comparing outputs
Cut over — once confidence is high, route 100% to the new service

Phase 3: Kubernetes Deployment Strategy

For zero-downtime deployments within Kubernetes, configure your deployments correctly:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: payment-service
spec:
  replicas: 3
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 0      # Never take a pod down before a new one is ready
      maxSurge: 1             # Add one pod before removing old ones
  template:
    spec:
      containers:
      - name: payment-service
        readinessProbe:
          httpGet:
            path: /health/ready
            port: 8080
          initialDelaySeconds: 10
          periodSeconds: 5
        livenessProbe:
          httpGet:
            path: /health/live
            port: 8080
          initialDelaySeconds: 30
          periodSeconds: 10
        lifecycle:
          preStop:
            exec:
              command: ["/bin/sh", "-c", "sleep 15"]  # Graceful drain

Critical Configuration Points

maxUnavailable: 0 — ensures a new pod is fully ready before old ones shut down
Readiness probes — prevent traffic from reaching pods that aren't ready to serve
preStop hook — allows in-flight requests to complete before pod termination
PodDisruptionBudgets — protect against accidentally destroying too many pods during node upgrades

Phase 4: Database Migration

The hardest part is the database. Stateless services are easy to migrate — stateful components require careful planning:

Read replicas first — point new services to read replicas; keep writes on the original database
Change Data Capture (CDC) — use tools like Debezium to stream changes between databases
Feature flags — toggle between old and new data paths without deployments
Dual-write with reconciliation — write to both databases and reconcile differences

Monitoring the Migration

You cannot migrate what you cannot measure. Before starting, establish comprehensive observability:

Golden signals — latency, traffic, errors, saturation for every service
Distributed tracing — correlate requests across old and new infrastructure
Dashboards — real-time comparison of old vs. new system performance
Automated rollback — if error rate exceeds thresholds, automatically route back to the old system

Zero-downtime migration isn't a single event — it's a process. Take it slow, measure everything, and always have a rollback plan.