Containerization Best Practices for DevOps
Master containerization strategies.
Container Fundamentals and Docker Basics
Containers transformed how I think about application deployment. After years of wrestling with “it works on my machine” problems, containers finally gave us a way to package applications with their entire runtime environment.
Why Containers Matter
Traditional deployment meant installing applications directly on servers, managing dependencies, and hoping everything worked together. I’ve seen production outages caused by a missing library version or conflicting Python packages. Containers solve this by packaging everything your application needs into a single, portable unit.
Think of containers as lightweight virtual machines, but more efficient. They share the host operating system kernel while maintaining isolation between applications.
# Check if Docker is running
docker --version
docker info
Understanding Container Images
Container images are the blueprints for containers. They contain your application code, runtime, system tools, libraries, and settings. Images are built in layers, which makes them efficient to store and transfer.
# Simple example - a basic web server
FROM nginx:alpine
COPY index.html /usr/share/nginx/html/
EXPOSE 80
This Dockerfile creates an image based on the lightweight Alpine Linux version of Nginx. The COPY
instruction adds your HTML file to the web server’s document root.
Building and running this container:
# Build the image
docker build -t my-web-server .
# Run a container from the image
docker run -d -p 8080:80 --name web my-web-server
The -d
flag runs the container in the background, -p 8080:80
maps port 8080 on your host to port 80 in the container.
Container Lifecycle Management
Containers have a simple lifecycle: created, running, stopped, or removed. Understanding this lifecycle helps you manage containers effectively.
# List running containers
docker ps
# Stop a running container
docker stop web
# Start a stopped container
docker start web
# Remove a container
docker rm web
I always use meaningful names for containers in development. It makes debugging much easier when you can identify containers by purpose rather than random IDs.
Working with Container Logs
Container logs are crucial for debugging. Docker captures everything your application writes to stdout and stderr.
# View container logs
docker logs web
# Follow logs in real-time
docker logs -f web
# Show only the last 50 lines
docker logs --tail 50 web
In production, I’ve learned to always log to stdout/stderr rather than files. This makes log aggregation much simpler and follows the twelve-factor app methodology.
Container Networking Basics
Containers can communicate with each other and the outside world through Docker’s networking system. By default, Docker creates a bridge network that allows containers to communicate.
# List Docker networks
docker network ls
# Create a custom network
docker network create my-network
# Run containers on the custom network
docker run -d --network my-network --name app1 nginx:alpine
docker run -d --network my-network --name app2 nginx:alpine
Custom networks provide better isolation and allow containers to communicate using container names as hostnames.
Volume Management for Data Persistence
Containers are ephemeral by design - when you remove a container, its data disappears. Volumes solve this by providing persistent storage that survives container restarts and removals.
# Create a named volume
docker volume create my-data
# Run a container with a volume mount
docker run -d -v my-data:/data --name data-container alpine sleep 3600
# List all volumes
docker volume ls
I prefer named volumes over bind mounts for production workloads because Docker manages them automatically and they work consistently across different host operating systems.
Resource Management and Limits
Containers share host resources, so it’s important to set appropriate limits to prevent one container from consuming all available CPU or memory.
# Run a container with resource limits
docker run -d \
--memory="512m" \
--cpus="1.0" \
--name limited-container \
nginx:alpine
# Check resource usage
docker stats limited-container
Setting resource limits prevents runaway containers from affecting other applications on the same host.
Container Image Management
Managing images efficiently becomes important as you work with more containers:
# List all images
docker images
# Remove an image
docker rmi nginx:alpine
# Remove unused images
docker image prune
I run docker system prune
regularly in development environments to keep disk usage under control.
What Makes Containers Different
Containers aren’t just lightweight VMs. They share the host kernel, which makes them much more efficient but also means they have different security and compatibility considerations.
Virtual machines virtualize hardware, while containers virtualize the operating system. This means containers start faster, use less memory, and allow higher density on the same hardware.
The trade-off is that all containers on a host share the same kernel. You can’t run Windows containers on a Linux host, and kernel-level security vulnerabilities affect all containers.
Understanding these fundamentals sets the foundation for everything else we’ll cover. In the next part, we’ll dive into writing effective Dockerfiles and building optimized images that are both secure and efficient.
Writing Effective Dockerfiles
I’ve written hundreds of Dockerfiles over the years, and I’ve learned that the difference between a good and bad Dockerfile often determines whether your containers succeed in production. A well-crafted Dockerfile creates smaller, more secure, and faster-building images.
Dockerfile Structure and Layer Optimization
Every instruction in a Dockerfile creates a new layer. Understanding this is crucial for building efficient images. Docker caches layers, so ordering instructions correctly can dramatically speed up your builds.
# Poor layer ordering - cache invalidated frequently
FROM node:18-alpine
COPY . /app
WORKDIR /app
RUN npm install
EXPOSE 3000
CMD ["npm", "start"]
This Dockerfile copies all source code before installing dependencies. Every code change invalidates the npm install cache, making builds slower.
# Better layer ordering - leverages cache effectively
FROM node:18-alpine
WORKDIR /app
# Copy package files first
COPY package*.json ./
# Install dependencies (cached unless package files change)
RUN npm ci --only=production
# Copy source code last
COPY . .
EXPOSE 3000
CMD ["npm", "start"]
Now dependency installation only runs when package files change, not on every code modification. I’ve seen this simple change reduce build times from 5 minutes to 30 seconds.
Choosing the Right Base Image
Base image selection affects security, size, and compatibility. I’ve learned to be deliberate about this choice rather than defaulting to full distributions.
# Full Ubuntu image - large but familiar
FROM ubuntu:22.04
RUN apt-get update && apt-get install -y python3 python3-pip
COPY requirements.txt .
RUN pip3 install -r requirements.txt
COPY . .
CMD ["python3", "app.py"]
This works but creates a 200MB+ image for a simple Python app. Alpine Linux offers a much smaller alternative:
# Alpine-based image - smaller and more secure
FROM python:3.11-alpine
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
CMD ["python", "app.py"]
Alpine reduces the image size to under 50MB. However, Alpine uses musl libc instead of glibc, which can cause compatibility issues with some Python packages.
For maximum security and minimal size, consider distroless images:
# Multi-stage build with distroless runtime
FROM python:3.11-slim as builder
WORKDIR /app
COPY requirements.txt .
RUN pip install --user -r requirements.txt
FROM gcr.io/distroless/python3
COPY --from=builder /root/.local /root/.local
COPY . /app
WORKDIR /app
ENV PATH=/root/.local/bin:$PATH
CMD ["app.py"]
Multi-Stage Builds for Compiled Applications
Multi-stage builds separate build-time dependencies from runtime requirements. This is especially powerful for compiled languages:
# Go application with multi-stage build
FROM golang:1.21-alpine AS builder
WORKDIR /app
# Copy go mod files first for better caching
COPY go.mod go.sum ./
RUN go mod download
# Copy source and build
COPY . .
RUN CGO_ENABLED=0 GOOS=linux go build -o app ./cmd/server
# Runtime stage
FROM alpine:3.18
RUN apk --no-cache add ca-certificates
WORKDIR /root/
COPY --from=builder /app/app .
CMD ["./app"]
The builder stage includes the full Go toolchain (300MB+), while the runtime stage contains only the compiled binary and minimal Alpine base (10MB total).
Managing Dependencies and Package Installation
How you install packages significantly impacts image size and security. I always clean up package caches and temporary files in the same layer where they’re created.
# Poor practice - leaves package cache
FROM ubuntu:22.04
RUN apt-get update
RUN apt-get install -y curl nginx
RUN rm -rf /var/lib/apt/lists/*
Each RUN instruction creates a separate layer, so the package cache exists in the middle layer even though it’s deleted in the final layer.
# Better practice - single layer with cleanup
FROM ubuntu:22.04
RUN apt-get update && \
apt-get install -y --no-install-recommends \
curl \
nginx && \
rm -rf /var/lib/apt/lists/* && \
apt-get clean
The --no-install-recommends
flag prevents apt from installing suggested packages, and cleanup happens in the same layer as installation.
Using .dockerignore Effectively
The .dockerignore file prevents unnecessary files from being sent to the Docker daemon during builds:
# Version control
.git
.gitignore
# Dependencies
node_modules
__pycache__
*.pyc
# Build artifacts
dist
build
target
# IDE files
.vscode
.idea
# Environment files
.env
.env.local
# Documentation
README.md
docs/
I’ve seen builds fail because someone accidentally included a 2GB dataset in the build context. A good .dockerignore prevents these issues.
Environment Variables and Configuration
Handle configuration through environment variables rather than baking values into images:
FROM node:18-alpine
WORKDIR /app
# Set default environment variables
ENV NODE_ENV=production
ENV PORT=3000
ENV LOG_LEVEL=info
COPY package*.json ./
RUN npm ci --only=production
COPY . .
# Create non-root user
RUN addgroup -g 1001 -S nodejs && \
adduser -S nextjs -u 1001
USER nextjs
EXPOSE $PORT
CMD ["node", "server.js"]
Security Considerations
Never run containers as root unless absolutely necessary:
FROM python:3.11-slim
# Create app user
RUN groupadd -r appuser && useradd -r -g appuser appuser
WORKDIR /app
# Install dependencies as root
COPY requirements.txt .
RUN pip install -r requirements.txt
# Copy app and change ownership
COPY . .
RUN chown -R appuser:appuser /app
# Switch to non-root user
USER appuser
CMD ["python", "app.py"]
Avoid including secrets in images:
# Bad - secret baked into image
FROM alpine
ENV API_KEY=secret123
CMD ["./app"]
# Good - secret provided at runtime
FROM alpine
ENV API_KEY=""
CMD ["./app"]
The goal is creating images that are small, secure, and fast to build. Every instruction should serve a purpose, and the order should optimize for caching and security.
In the next part, I’ll cover container security in depth, including vulnerability scanning, image signing, and runtime security practices that I’ve learned are essential for production deployments.
Container Security and Vulnerability Management
Container security kept me awake at night during my first production deployment. A single vulnerable base image could expose your entire application stack. I’ve since learned that security isn’t something you add later - it must be built into every step of your container workflow.
Understanding Container Attack Surfaces
Containers share the host kernel, which creates unique security considerations. The attack surface includes:
- Base image vulnerabilities
- Application dependencies
- Container runtime configuration
- Host system security
- Network exposure
- Secrets management
I’ve seen organizations focus only on application security while ignoring base image vulnerabilities. This is like locking your front door while leaving windows open.
Vulnerability Scanning in CI/CD Pipelines
Automated vulnerability scanning catches security issues before they reach production. I integrate scanning at build time and runtime monitoring:
# GitHub Actions workflow with security scanning
name: Container Security Pipeline
on:
push:
branches: [ main, develop ]
jobs:
security-scan:
runs-on: ubuntu-latest
steps:
- name: Checkout code
uses: actions/checkout@v4
- name: Build container image
run: docker build -t myapp:${{ github.sha }} .
- name: Run Trivy vulnerability scanner
uses: aquasecurity/trivy-action@master
with:
image-ref: 'myapp:${{ github.sha }}'
format: 'sarif'
output: 'trivy-results.sarif'
severity: 'CRITICAL,HIGH,MEDIUM'
exit-code: '1'
- name: Upload scan results
uses: github/codeql-action/upload-sarif@v2
if: always()
with:
sarif_file: 'trivy-results.sarif'
For detailed scanning, I use Trivy’s comprehensive modes:
# Scan for vulnerabilities and misconfigurations
trivy image --severity HIGH,CRITICAL myapp:latest
# Include secret detection
trivy image --scanners vuln,secret myapp:latest
# Scan Dockerfile for best practices
trivy config Dockerfile
Implementing Image Signing and Verification
Image signing ensures the integrity and authenticity of your container images. I use Cosign for its simplicity:
# Generate a key pair for signing
cosign generate-key-pair
# Sign an image after building
docker build -t myregistry.io/myapp:v1.0.0 .
cosign sign --key cosign.key myregistry.io/myapp:v1.0.0
# Verify image signature before deployment
cosign verify --key cosign.pub myregistry.io/myapp:v1.0.0
For production environments, I integrate signature verification into deployment pipelines:
# Kubernetes admission controller policy
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: verify-image-signatures
spec:
validationFailureAction: enforce
rules:
- name: verify-signature
match:
any:
- resources:
kinds:
- Pod
verifyImages:
- imageReferences:
- "myregistry.io/*"
attestors:
- entries:
- keys:
publicKeys: |
-----BEGIN PUBLIC KEY-----
MFkwEwYHKoZIzj0CAQYIKoZIzj0DAQcDQgAE...
-----END PUBLIC KEY-----
Secure Base Image Selection
Choosing secure base images is your first line of defense:
# Avoid generic latest tags
FROM node:latest # Bad - unpredictable
# Use specific, recent versions
FROM node:18.17.1-alpine3.18 # Better
# Even better - use digest for immutability
FROM node:18.17.1-alpine3.18@sha256:f77a1aef2da8d83e45ec990f45df906f9c3e8b8c0c6b2b5b5c5c5c5c5c5c5c5c
For maximum security, I prefer distroless images:
# Multi-stage build with distroless runtime
FROM node:18-alpine AS builder
WORKDIR /app
COPY package*.json ./
RUN npm ci --only=production
FROM gcr.io/distroless/nodejs18-debian11
COPY --from=builder /app/node_modules ./node_modules
COPY . .
CMD ["server.js"]
Distroless images contain only your application and runtime dependencies. No shell, no package manager, no debugging tools that attackers could exploit.
Runtime Security Configuration
Never run containers as root unless absolutely necessary:
# Create and use non-root user
FROM python:3.11-slim
# Create app user with specific UID/GID
RUN groupadd -r -g 1001 appuser && \
useradd -r -u 1001 -g appuser appuser
WORKDIR /app
# Install dependencies as root
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# Copy app files and set ownership
COPY . .
RUN chown -R appuser:appuser /app
# Switch to non-root user
USER appuser
CMD ["python", "app.py"]
When deploying, I use security contexts to enforce additional restrictions:
apiVersion: apps/v1
kind: Deployment
metadata:
name: secure-app
spec:
template:
spec:
securityContext:
runAsNonRoot: true
runAsUser: 1001
runAsGroup: 1001
containers:
- name: app
image: myregistry.io/myapp:v1.0.0
securityContext:
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
capabilities:
drop:
- ALL
resources:
limits:
memory: "512Mi"
cpu: "500m"
Secrets Management
Never include secrets in container images:
# Bad - secret in image
FROM alpine
ENV API_KEY=sk-1234567890abcdef
CMD ["./app"]
# Good - secret injected at runtime
FROM alpine
ENV API_KEY=""
CMD ["./app"]
For Kubernetes deployments, use Secrets:
# Kubernetes Secret
apiVersion: v1
kind: Secret
metadata:
name: app-secrets
type: Opaque
data:
api-key: c2stMTIzNDU2Nzg5MGFiY2RlZg== # base64 encoded
---
# Deployment using the secret
apiVersion: apps/v1
kind: Deployment
metadata:
name: app
spec:
template:
spec:
containers:
- name: app
image: myapp:latest
env:
- name: API_KEY
valueFrom:
secretKeyRef:
name: app-secrets
key: api-key
Network Security and Isolation
Use NetworkPolicies to control traffic flow:
# Network policy for database isolation
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: database-policy
spec:
podSelector:
matchLabels:
app: database
policyTypes:
- Ingress
- Egress
ingress:
- from:
- podSelector:
matchLabels:
app: api
ports:
- protocol: TCP
port: 5432
egress:
- to: []
ports:
- protocol: TCP
port: 53 # DNS only
This policy allows only API pods to connect to the database and restricts database egress to DNS queries only.
Security isn’t a one-time setup - it’s an ongoing process. Regular scanning, monitoring, and updates are essential for maintaining a secure container environment. In the next part, I’ll cover container orchestration and how to manage containers at scale while maintaining security and reliability.
Container Orchestration and Docker Compose
Managing multiple containers manually becomes unwieldy fast. I learned this the hard way when I had to coordinate a web server, database, cache, and background workers across different environments. Docker Compose solved this by letting me define entire application stacks in a single file.
Understanding Multi-Container Applications
Modern applications rarely run in isolation. A typical web application might include:
- Web server (frontend)
- API server (backend)
- Database (PostgreSQL, MySQL)
- Cache (Redis, Memcached)
- Message queue (RabbitMQ, Apache Kafka)
Coordinating these services manually means remembering port mappings, network configurations, environment variables, and startup order. Docker Compose eliminates this complexity.
Docker Compose Fundamentals
Docker Compose uses YAML files to define services, networks, and volumes:
# docker-compose.yml
version: '3.8'
services:
web:
build: .
ports:
- "8000:8000"
environment:
- DATABASE_URL=postgresql://user:password@db:5432/myapp
depends_on:
- db
- redis
db:
image: postgres:15-alpine
environment:
- POSTGRES_DB=myapp
- POSTGRES_USER=user
- POSTGRES_PASSWORD=password
volumes:
- postgres_data:/var/lib/postgresql/data
redis:
image: redis:7-alpine
command: redis-server --appendonly yes
volumes:
- redis_data:/data
volumes:
postgres_data:
redis_data:
This defines three services that can communicate with each other using service names as hostnames. The web service can connect to the database using db:5432
and Redis using redis:6379
.
Service Configuration and Dependencies
The depends_on
directive controls startup order, but it doesn’t wait for services to be ready. For applications that need the database to be fully initialized:
version: '3.8'
services:
web:
build: .
ports:
- "8000:8000"
depends_on:
db:
condition: service_healthy
db:
image: postgres:15-alpine
environment:
- POSTGRES_DB=myapp
- POSTGRES_USER=user
- POSTGRES_PASSWORD=password
volumes:
- postgres_data:/var/lib/postgresql/data
healthcheck:
test: ["CMD-SHELL", "pg_isready -U user -d myapp"]
interval: 10s
timeout: 5s
retries: 5
volumes:
postgres_data:
The healthcheck ensures the database is accepting connections before starting the web service. I always include health checks for critical services like databases.
Networking in Docker Compose
Docker Compose automatically creates a network for your services. You can also define custom networks for better isolation:
version: '3.8'
services:
frontend:
build: ./frontend
networks:
- frontend-net
ports:
- "3000:3000"
api:
build: ./api
networks:
- frontend-net
- backend-net
db:
image: postgres:15-alpine
networks:
- backend-net
volumes:
- postgres_data:/var/lib/postgresql/data
networks:
frontend-net:
backend-net:
volumes:
postgres_data:
This setup isolates the database on the backend network, preventing the frontend from directly accessing it.
Environment Configuration
Managing configuration across different environments is crucial. I use multiple compose files and environment files:
# docker-compose.yml (base configuration)
version: '3.8'
services:
web:
build: .
environment:
- NODE_ENV=${NODE_ENV:-development}
env_file:
- .env
db:
image: postgres:15-alpine
environment:
- POSTGRES_DB=${POSTGRES_DB}
- POSTGRES_USER=${POSTGRES_USER}
- POSTGRES_PASSWORD=${POSTGRES_PASSWORD}
# docker-compose.override.yml (development overrides)
version: '3.8'
services:
web:
ports:
- "8000:8000"
volumes:
- ./src:/app/src
command: npm run dev
db:
ports:
- "5432:5432"
Use different configurations with:
# Development (uses docker-compose.override.yml automatically)
docker-compose up
# Production
docker-compose -f docker-compose.yml -f docker-compose.prod.yml up
Development Workflow Optimization
For development, I optimize for fast feedback loops and easy debugging:
version: '3.8'
services:
web:
build:
context: .
dockerfile: Dockerfile.dev
ports:
- "8000:8000"
volumes:
- ./src:/app/src
- /app/node_modules # Anonymous volume to preserve node_modules
environment:
- NODE_ENV=development
- CHOKIDAR_USEPOLLING=true # For file watching in containers
command: npm run dev
db:
image: postgres:15-alpine
ports:
- "5432:5432" # Expose for external tools
environment:
- POSTGRES_DB=myapp_dev
- POSTGRES_USER=dev
- POSTGRES_PASSWORD=devpass
volumes:
- postgres_dev_data:/var/lib/postgresql/data
volumes:
postgres_dev_data:
Common Patterns and Troubleshooting
I’ve encountered these issues repeatedly and learned to avoid them:
Problem: Services can’t communicate
# Check if services are on the same network
docker-compose ps
docker network ls
Problem: Database connection refused
# Add health checks and proper depends_on
services:
web:
depends_on:
db:
condition: service_healthy
db:
healthcheck:
test: ["CMD-SHELL", "pg_isready -U user"]
interval: 10s
retries: 5
Problem: File changes not reflected in development
# Ensure proper volume mounting and file watching
services:
web:
volumes:
- ./src:/app/src
environment:
- CHOKIDAR_USEPOLLING=true
Docker Compose provides an excellent foundation for understanding container orchestration concepts. The skills you learn here translate directly to Kubernetes, which I’ll cover in the next part along with production-grade orchestration strategies.
Production Kubernetes Deployment
Moving from Docker Compose to Kubernetes felt overwhelming at first. The complexity seemed unnecessary for simple applications. But after managing production workloads for years, I understand why Kubernetes became the standard - it handles the operational complexity that emerges at scale.
Kubernetes Architecture and Core Concepts
Kubernetes orchestrates containers across multiple machines, providing features that Docker Compose can’t match: automatic failover, rolling updates, service discovery, and resource management across a cluster.
The key components you need to understand:
- Pods: The smallest deployable units, usually containing one container
- Deployments: Manage replica sets and rolling updates
- Services: Provide stable network endpoints for pods
- ConfigMaps and Secrets: Manage configuration and sensitive data
- Ingress: Handle external traffic routing
Basic deployment example:
# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: web-app
labels:
app: web-app
spec:
replicas: 3
selector:
matchLabels:
app: web-app
template:
metadata:
labels:
app: web-app
spec:
containers:
- name: web
image: myregistry.io/web-app:v1.2.0
ports:
- containerPort: 8080
env:
- name: DATABASE_URL
valueFrom:
secretKeyRef:
name: db-credentials
key: url
resources:
requests:
memory: "256Mi"
cpu: "250m"
limits:
memory: "512Mi"
cpu: "500m"
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
This deployment creates three replicas with proper resource limits and health checks. The probes ensure Kubernetes knows when your application is healthy and ready to receive traffic.
Service Discovery and Load Balancing
Services provide stable endpoints for your pods:
# service.yaml
apiVersion: v1
kind: Service
metadata:
name: web-app-service
spec:
selector:
app: web-app
ports:
- protocol: TCP
port: 80
targetPort: 8080
type: ClusterIP
---
# ingress.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: web-app-ingress
annotations:
kubernetes.io/ingress.class: nginx
cert-manager.io/cluster-issuer: letsencrypt-prod
spec:
tls:
- hosts:
- myapp.example.com
secretName: web-app-tls
rules:
- host: myapp.example.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: web-app-service
port:
number: 80
Configuration Management
Separate configuration from code using ConfigMaps and Secrets:
# configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: app-config
data:
app.properties: |
log.level=INFO
feature.new-ui=true
cache.ttl=300
---
# secret.yaml
apiVersion: v1
kind: Secret
metadata:
name: db-credentials
type: Opaque
data:
url: cG9zdGdyZXNxbDovL3VzZXI6cGFzc3dvcmRAZGI6NTQzMi9teWFwcA==
username: dXNlcg==
password: cGFzc3dvcmQ=
Mount these in your deployment:
spec:
template:
spec:
containers:
- name: web
volumeMounts:
- name: config-volume
mountPath: /etc/config
env:
- name: DATABASE_URL
valueFrom:
secretKeyRef:
name: db-credentials
key: url
volumes:
- name: config-volume
configMap:
name: app-config
Resource Management and Autoscaling
Proper resource management prevents one application from affecting others:
# hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: web-app-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: web-app
minReplicas: 3
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
The HPA automatically scales your deployment based on CPU and memory usage.
High Availability and Fault Tolerance
Design deployments to survive node failures:
apiVersion: apps/v1
kind: Deployment
metadata:
name: web-app
spec:
replicas: 6
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 1
maxSurge: 2
template:
spec:
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- web-app
topologyKey: kubernetes.io/hostname
---
# Pod Disruption Budget
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: web-app-pdb
spec:
minAvailable: 4
selector:
matchLabels:
app: web-app
This configuration spreads pods across different nodes and ensures at least 4 pods remain available during disruptions.
Monitoring Integration
Implement monitoring to understand your application’s behavior:
# service-monitor.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: web-app-metrics
spec:
selector:
matchLabels:
app: web-app
endpoints:
- port: metrics
interval: 30s
path: /metrics
---
# prometheus-rule.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: web-app-alerts
spec:
groups:
- name: web-app
rules:
- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.1
for: 5m
labels:
severity: warning
annotations:
summary: High error rate detected
description: "Error rate is {{ $value }} errors per second"
Deployment Strategies
Implement safe deployment practices:
#!/bin/bash
# deploy.sh - Safe deployment script
IMAGE_TAG=${1:-latest}
NAMESPACE=${2:-default}
echo "Deploying web-app:$IMAGE_TAG to namespace $NAMESPACE"
# Update the deployment
kubectl set image deployment/web-app web=myregistry.io/web-app:$IMAGE_TAG -n $NAMESPACE
# Wait for rollout to complete
kubectl rollout status deployment/web-app -n $NAMESPACE --timeout=300s
# Verify deployment health
READY_REPLICAS=$(kubectl get deployment web-app -n $NAMESPACE -o jsonpath='{.status.readyReplicas}')
DESIRED_REPLICAS=$(kubectl get deployment web-app -n $NAMESPACE -o jsonpath='{.spec.replicas}')
if [ "$READY_REPLICAS" != "$DESIRED_REPLICAS" ]; then
echo "Deployment failed: $READY_REPLICAS/$DESIRED_REPLICAS replicas ready"
kubectl rollout undo deployment/web-app -n $NAMESPACE
exit 1
fi
echo "Deployment successful"
Kubernetes provides the foundation for running containers at scale, but it requires careful planning and configuration. The complexity pays off when you need features like automatic scaling, rolling updates, and multi-region deployments. In the next part, I’ll cover CI/CD integration and how to automate your container deployment pipeline.
CI/CD Pipeline Integration
Manually building and deploying containers works for development, but production demands automation. I’ve built CI/CD pipelines that deploy hundreds of times per day while maintaining security and reliability. The key is treating your pipeline as code and building in quality gates at every step.
Pipeline Architecture Strategy
A robust container CI/CD pipeline includes several stages that fail fast and provide clear feedback:
- Source Stage: Code commit triggers the pipeline
- Test Stage: Unit tests, integration tests, and code quality checks
- Security Stage: Vulnerability scanning and compliance checks
- Build Stage: Container image creation and optimization
- Deploy Stage: Progressive rollout with monitoring and rollback capabilities
GitHub Actions Container Pipeline
GitHub Actions provides excellent container support with built-in Docker registry integration:
# .github/workflows/container-pipeline.yml
name: Container CI/CD Pipeline
on:
push:
branches: [ main, develop ]
pull_request:
branches: [ main ]
env:
REGISTRY: ghcr.io
IMAGE_NAME: ${{ github.repository }}
jobs:
test:
runs-on: ubuntu-latest
steps:
- name: Checkout code
uses: actions/checkout@v4
- name: Set up Node.js
uses: actions/setup-node@v4
with:
node-version: '18'
cache: 'npm'
- name: Install dependencies
run: npm ci
- name: Run tests
run: npm test -- --coverage
- name: Upload coverage
uses: codecov/codecov-action@v3
security-scan:
runs-on: ubuntu-latest
needs: test
steps:
- name: Checkout code
uses: actions/checkout@v4
- name: Run Trivy scanner
uses: aquasecurity/trivy-action@master
with:
scan-type: 'fs'
format: 'sarif'
output: 'trivy-results.sarif'
- name: Upload scan results
uses: github/codeql-action/upload-sarif@v2
with:
sarif_file: 'trivy-results.sarif'
build-and-push:
runs-on: ubuntu-latest
needs: [test, security-scan]
if: github.event_name != 'pull_request'
steps:
- name: Checkout code
uses: actions/checkout@v4
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3
- name: Log in to registry
uses: docker/login-action@v3
with:
registry: ${{ env.REGISTRY }}
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
- name: Build and push
uses: docker/build-push-action@v5
with:
context: .
platforms: linux/amd64,linux/arm64
push: true
tags: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:${{ github.sha }}
cache-from: type=gha
cache-to: type=gha,mode=max
deploy-staging:
runs-on: ubuntu-latest
needs: build-and-push
if: github.ref == 'refs/heads/develop'
environment: staging
steps:
- name: Deploy to staging
run: |
kubectl set image deployment/web-app web=${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:${{ github.sha }} -n staging
kubectl rollout status deployment/web-app -n staging
Advanced Testing Strategies
Comprehensive testing prevents issues from reaching production:
# docker-compose.test.yml
version: '3.8'
services:
app:
build:
context: .
dockerfile: Dockerfile.test
environment:
- NODE_ENV=test
- DATABASE_URL=postgresql://test:test@db:5432/testdb
depends_on:
db:
condition: service_healthy
db:
image: postgres:15-alpine
environment:
- POSTGRES_DB=testdb
- POSTGRES_USER=test
- POSTGRES_PASSWORD=test
healthcheck:
test: ["CMD-SHELL", "pg_isready -U test -d testdb"]
interval: 5s
retries: 5
integration-tests:
build:
context: .
dockerfile: Dockerfile.test
environment:
- API_URL=http://app:8080
depends_on:
- app
command: npm run test:integration
Security Integration
Security scanning should be integrated throughout the pipeline:
#!/bin/bash
# security-scan.sh
IMAGE_NAME=${1:-myapp:latest}
SEVERITY_THRESHOLD=${2:-HIGH}
echo "Running security scan for $IMAGE_NAME"
# Scan the built image
trivy image \
--severity $SEVERITY_THRESHOLD,CRITICAL \
--format json \
--output image-scan.json \
$IMAGE_NAME
# Scan for secrets
trivy fs \
--scanners secret \
--format json \
--output secret-scan.json \
.
# Check results
CRITICAL_VULNS=$(jq '.Results[]?.Vulnerabilities[]? | select(.Severity == "CRITICAL") | length' image-scan.json | wc -l)
if [ "$CRITICAL_VULNS" -gt 0 ]; then
echo "Critical vulnerabilities found. Failing build."
exit 1
fi
echo "Security scan passed"
Progressive Deployment
Implement safe deployment patterns that minimize risk:
# canary-deployment.yaml
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: web-app-rollout
spec:
replicas: 10
strategy:
canary:
steps:
- setWeight: 10
- pause: {duration: 2m}
- setWeight: 50
- pause: {duration: 5m}
analysis:
templates:
- templateName: success-rate
canaryService: web-app-canary
stableService: web-app-stable
selector:
matchLabels:
app: web-app
template:
spec:
containers:
- name: web
image: myregistry.io/web-app:latest
Multi-Environment Pipeline
Manage deployments across environments with proper promotion gates:
# Multi-environment workflow
jobs:
deploy-dev:
needs: build
environment: development
steps:
- name: Deploy to dev
run: kubectl set image deployment/web-app web=${{ needs.build.outputs.image }} -n dev
test-dev:
needs: deploy-dev
steps:
- name: Run smoke tests
run: curl -f https://dev.myapp.com/health
deploy-staging:
needs: test-dev
environment: staging
steps:
- name: Deploy to staging
run: kubectl set image deployment/web-app web=${{ needs.build.outputs.image }} -n staging
deploy-production:
needs: deploy-staging
environment: production
steps:
- name: Deploy to production
run: |
kubectl set image deployment/web-app web=${{ needs.build.outputs.image }} -n production
kubectl rollout status deployment/web-app -n production
Pipeline Troubleshooting
Common issues I’ve encountered and their solutions:
Flaky tests causing failures:
- name: Run tests with retry
uses: nick-invision/retry@v2
with:
timeout_minutes: 10
max_attempts: 3
command: npm test
Docker build cache misses:
- name: Build with cache
uses: docker/build-push-action@v5
with:
cache-from: type=gha
cache-to: type=gha,mode=max
A well-designed CI/CD pipeline becomes the backbone of your containerization strategy. It should be fast, reliable, and provide clear feedback when things go wrong. In the next part, I’ll cover monitoring and observability for production container environments.
Monitoring and Observability
Production containers without proper monitoring are like flying blind. I learned this during a midnight outage when our application was failing, but we had no visibility into why. Since then, I’ve built comprehensive observability into every containerized system.
The Three Pillars of Observability
Effective observability requires three types of data:
Metrics: Numerical measurements over time (CPU usage, request rates, error counts)
Logs: Discrete events with context (application logs, error messages, audit trails)
Traces: Request flows through distributed systems (service calls, database queries)
Each pillar provides different insights, but they’re most powerful when correlated together.
Prometheus and Grafana Stack
Prometheus scrapes metrics from your applications, while Grafana provides visualization and alerting:
# prometheus-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-config
data:
prometheus.yml: |
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
Deploy Prometheus with proper resource limits:
apiVersion: apps/v1
kind: Deployment
metadata:
name: prometheus
spec:
replicas: 1
selector:
matchLabels:
app: prometheus
template:
spec:
containers:
- name: prometheus
image: prom/prometheus:v2.45.0
ports:
- containerPort: 9090
resources:
requests:
memory: "512Mi"
cpu: "250m"
limits:
memory: "1Gi"
cpu: "500m"
Application Metrics Integration
Your applications need to expose metrics. I’ll show you the Node.js instrumentation I use in production:
// metrics.js
const promClient = require('prom-client');
const register = new promClient.Registry();
// Add default metrics
promClient.collectDefaultMetrics({ register });
// Custom business metrics
const httpRequestDuration = new promClient.Histogram({
name: 'http_request_duration_seconds',
help: 'Duration of HTTP requests in seconds',
labelNames: ['method', 'route', 'status_code'],
buckets: [0.1, 0.3, 0.5, 0.7, 1, 3, 5, 7, 10]
});
const httpRequestsTotal = new promClient.Counter({
name: 'http_requests_total',
help: 'Total number of HTTP requests',
labelNames: ['method', 'route', 'status_code']
});
register.registerMetric(httpRequestDuration);
register.registerMetric(httpRequestsTotal);
// Middleware to collect HTTP metrics
const metricsMiddleware = (req, res, next) => {
const start = Date.now();
res.on('finish', () => {
const duration = (Date.now() - start) / 1000;
const route = req.route ? req.route.path : req.path;
httpRequestDuration
.labels(req.method, route, res.statusCode)
.observe(duration);
httpRequestsTotal
.labels(req.method, route, res.statusCode)
.inc();
});
next();
};
module.exports = { register, metricsMiddleware };
Expose metrics in your application:
// app.js
const express = require('express');
const { register, metricsMiddleware } = require('./metrics');
const app = express();
app.use(metricsMiddleware);
// Metrics endpoint for Prometheus
app.get('/metrics', async (req, res) => {
res.set('Content-Type', register.contentType);
res.end(await register.metrics());
});
app.listen(8080);
Add Prometheus annotations to your deployment:
apiVersion: apps/v1
kind: Deployment
metadata:
name: web-app
spec:
template:
metadata:
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8080"
prometheus.io/path: "/metrics"
Centralized Logging
Structured logging makes debugging much easier:
// logger.js
const winston = require('winston');
const logger = winston.createLogger({
level: process.env.LOG_LEVEL || 'info',
format: winston.format.combine(
winston.format.timestamp(),
winston.format.errors({ stack: true }),
winston.format.json()
),
defaultMeta: {
service: 'web-app',
version: process.env.APP_VERSION || 'unknown'
},
transports: [new winston.transports.Console()]
});
// Request logging middleware
const requestLogger = (req, res, next) => {
const correlationId = req.headers['x-correlation-id'] || generateId();
req.correlationId = correlationId;
logger.info('Request started', {
correlationId,
method: req.method,
url: req.url,
ip: req.ip
});
res.on('finish', () => {
logger.info('Request completed', {
correlationId,
statusCode: res.statusCode,
duration: Date.now() - req.startTime
});
});
next();
};
function generateId() {
return Math.random().toString(36).substring(2, 15);
}
module.exports = { logger, requestLogger };
Alerting Rules
Set up intelligent alerting that catches real issues:
# alert-rules.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: application-alerts
spec:
groups:
- name: application.rules
rules:
- alert: HighErrorRate
expr: |
(
sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
/
sum(rate(http_requests_total[5m])) by (service)
) > 0.05
for: 2m
labels:
severity: warning
annotations:
summary: "High error rate detected"
description: "Service {{ $labels.service }} has {{ $value | humanizePercentage }} error rate"
- alert: HighLatency
expr: |
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
) > 1
for: 5m
labels:
severity: warning
annotations:
summary: "High latency detected"
description: "Service {{ $labels.service }} 95th percentile latency is {{ $value }}s"
- alert: PodCrashLooping
expr: rate(kube_pod_container_status_restarts_total[15m]) * 60 * 15 > 0
for: 0m
labels:
severity: critical
annotations:
summary: "Pod is crash looping"
description: "Pod {{ $labels.pod }} is restarting frequently"
Performance Monitoring
Monitor key performance indicators and set up automated responses:
#!/bin/bash
# performance-monitor.sh
check_performance() {
local service=$1
local threshold_p95=1.0
local threshold_error_rate=0.05
# Get 95th percentile latency
p95_latency=$(curl -s "http://prometheus:9090/api/v1/query?query=histogram_quantile(0.95,sum(rate(http_request_duration_seconds_bucket{service=\"$service\"}[5m]))by(le))" | jq -r '.data.result[0].value[1]')
echo "Service: $service, P95 latency: ${p95_latency}s"
if (( $(echo "$p95_latency > $threshold_p95" | bc -l) )); then
echo "WARNING: High latency detected"
kubectl scale deployment $service --replicas=6
fi
}
# Monitor all services
for service in web-app api-service; do
check_performance $service
done
Comprehensive monitoring provides the visibility needed to maintain performance and debug issues in production. In the final part, I’ll cover troubleshooting techniques and operational practices for production container environments.
Troubleshooting and Production Operations
Production containers will fail. I’ve been woken up at 3 AM by alerts more times than I care to count. The difference between a minor incident and a major outage often comes down to how quickly you can diagnose and resolve issues.
Systematic Troubleshooting Approach
When containers misbehave, I follow a systematic approach that starts broad and narrows down to the root cause. Panic leads to random changes that make problems worse.
The Container Troubleshooting Hierarchy:
- Application Layer: Is the application code working correctly?
- Container Layer: Is the container configured and running properly?
- Orchestration Layer: Is Kubernetes scheduling containers correctly?
- Network Layer: Can services communicate with each other?
- Infrastructure Layer: Are the underlying nodes healthy?
Start at the application layer and work your way down. Most issues are application-related, not infrastructure problems.
Essential Debugging Commands
These commands have saved me countless hours:
# Check pod status and events
kubectl get pods -o wide
kubectl describe pod <pod-name>
kubectl get events --sort-by=.metadata.creationTimestamp
# View container logs
kubectl logs <pod-name> -c <container-name>
kubectl logs <pod-name> --previous # Previous container instance
kubectl logs -f <pod-name> # Follow logs in real-time
# Execute commands in running containers
kubectl exec -it <pod-name> -- /bin/sh
kubectl exec <pod-name> -- ps aux
# Debug networking issues
kubectl get svc,endpoints
kubectl describe svc <service-name>
# Check resource usage
kubectl top pods
kubectl top nodes
For Docker without Kubernetes:
# Container inspection
docker ps -a
docker logs <container-id>
docker exec -it <container-id> /bin/sh
# Network debugging
docker network inspect <network-name>
Common Container Issues
I’ve encountered these problems repeatedly:
Container Won’t Start
Symptoms: Pod stuck in Pending
, CrashLoopBackOff
, or ImagePullBackOff
Diagnosis:
kubectl describe pod <pod-name>
kubectl get events --field-selector involvedObject.name=<pod-name>
Common Solutions:
- Image Pull Issues:
# Check image name and registry credentials
kubectl describe pod <pod-name> | grep -A5 "Failed to pull image"
docker pull <image-name> # Test locally
- Resource Constraints:
# Check node resources
kubectl describe node <node-name>
kubectl top nodes
# Adjust resource requests
kubectl patch deployment <name> -p '{"spec":{"template":{"spec":{"containers":[{"name":"app","resources":{"requests":{"memory":"256Mi"}}}]}}}}'
Network Connectivity Problems
Symptoms: Services can’t communicate or DNS resolution fails
Diagnosis:
# Test service connectivity
kubectl exec <pod-name> -- nslookup <service-name>
kubectl exec <pod-name> -- curl -v http://<service-name>:<port>
# Check service endpoints
kubectl get endpoints <service-name>
Solutions:
# Check CoreDNS
kubectl get pods -n kube-system | grep coredns
kubectl logs -n kube-system <coredns-pod>
# Verify service selectors match pod labels
kubectl get svc <service-name> -o yaml
kubectl get pods --show-labels
Advanced Debugging Techniques
For complex issues, I use specialized tools:
# Deploy network debugging pod
kubectl run netshoot --rm -it --image=nicolaka/netshoot -- /bin/bash
# Inside the debugging pod:
ping <service-name>
nmap -p <port> <service-name>
tcpdump -i eth0 -w capture.pcap
Disaster Recovery Essentials
Prepare for worst-case scenarios:
#!/bin/bash
# backup-cluster.sh
BACKUP_DIR="/backup/cluster-$(date +%Y%m%d-%H%M%S)"
mkdir -p $BACKUP_DIR
# Backup critical resources
for resource in deployments services configmaps secrets; do
kubectl get $resource --all-namespaces -o yaml > $BACKUP_DIR/$resource.yaml
done
tar -czf $BACKUP_DIR.tar.gz -C $(dirname $BACKUP_DIR) $(basename $BACKUP_DIR)
echo "Backup completed: $BACKUP_DIR.tar.gz"
Operational Practices
These practices maintain stable production environments:
Health Checks
// health-check.js
app.get('/health', (req, res) => {
res.json({
status: 'healthy',
timestamp: new Date().toISOString(),
uptime: process.uptime()
});
});
app.get('/ready', async (req, res) => {
try {
await checkDatabase();
res.json({ status: 'ready' });
} catch (error) {
res.status(503).json({ status: 'not ready', error: error.message });
}
});
Graceful Shutdown
// Handle shutdown signals
process.on('SIGTERM', gracefulShutdown);
function gracefulShutdown(signal) {
console.log(`Received ${signal}. Starting graceful shutdown...`);
server.close((err) => {
if (err) {
console.error('Error during shutdown:', err);
process.exit(1);
}
closeDatabase()
.then(() => process.exit(0))
.catch(() => process.exit(1));
});
// Force shutdown after timeout
setTimeout(() => process.exit(1), 30000);
}
Incident Response
When things go wrong:
- Immediate Response (0-5 minutes): Acknowledge alert, assess impact
- Investigation (5-30 minutes): Gather logs, identify root cause
- Resolution (30+ minutes): Apply fix, verify recovery
- Post-Incident (24-48 hours): Conduct blameless post-mortem
#!/bin/bash
# incident-response.sh
INCIDENT_ID=$(date +%Y%m%d-%H%M%S)
mkdir -p /tmp/incident-$INCIDENT_ID
# Gather system state
kubectl get pods,svc --all-namespaces > /tmp/incident-$INCIDENT_ID/system-state.txt
kubectl get events --all-namespaces > /tmp/incident-$INCIDENT_ID/events.txt
echo "Incident data collected in /tmp/incident-$INCIDENT_ID"
Production containers require discipline and preparation. The techniques in this guide will help you build reliable systems that handle production challenges. Remember: operational excellence is a journey of continuous improvement.