Production Best Practices and Optimization

Alright, you’ve learned the fundamentals, built some applications, and now you’re thinking about production. This is where things get real. Running Kubernetes in production isn’t just about getting your apps to work—it’s about making them reliable, secure, and maintainable.

I’ve seen plenty of teams rush to production with Kubernetes and then spend months fixing issues they could have avoided. Let’s make sure you’re not one of them.

Security - Don’t Be That Company in the News

Security in Kubernetes isn’t optional. It’s not something you add later. It needs to be baked in from the start, because once you’re compromised, you’re in for a world of hurt.

The Principle of Least Privilege

Never give more permissions than absolutely necessary. This applies to everything—users, service accounts, network access, you name it.

# Bad: Running as root
apiVersion: v1
kind: Pod
metadata:
  name: bad-pod
spec:
  containers:
  - name: app
    image: myapp:latest
    # Runs as root by default - dangerous!

# Good: Running as non-root user
apiVersion: v1
kind: Pod
metadata:
  name: good-pod
spec:
  securityContext:
    runAsNonRoot: true
    runAsUser: 1000
    runAsGroup: 1000
  containers:
  - name: app
    image: myapp:latest
    securityContext:
      allowPrivilegeEscalation: false
      readOnlyRootFilesystem: true
      capabilities:
        drop:
        - ALL

Network Policies - Lock Down Your Traffic

By default, any pod can talk to any other pod. That’s convenient for development, but terrible for security. Network policies let you control traffic flow:

# Deny all traffic by default
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-all
  namespace: production
spec:
  podSelector: {}
  policyTypes:
  - Ingress
  - Egress
---
# Allow specific traffic
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-frontend-to-api
  namespace: production
spec:
  podSelector:
    matchLabels:
      app: api
  policyTypes:
  - Ingress
  ingress:
  - from:
    - podSelector:
        matchLabels:
          app: frontend
    ports:
    - protocol: TCP
      port: 8080

RBAC - Control Who Can Do What

Role-Based Access Control (RBAC) controls what users and service accounts can do in your cluster:

# Create a role for developers
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  namespace: development
  name: developer
rules:
- apiGroups: [""]
  resources: ["pods", "services", "configmaps", "secrets"]
  verbs: ["get", "list", "create", "update", "patch", "delete"]
- apiGroups: ["apps"]
  resources: ["deployments", "replicasets"]
  verbs: ["get", "list", "create", "update", "patch", "delete"]
---
# Bind the role to users
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: developer-binding
  namespace: development
subjects:
- kind: User
  name: [email protected]
  apiGroup: rbac.authorization.k8s.io
- kind: User
  name: [email protected]
  apiGroup: rbac.authorization.k8s.io
roleRef:
  kind: Role
  name: developer
  apiGroup: rbac.authorization.k8s.io

Resource Management - Don’t Let One App Kill Your Cluster

Always Set Resource Limits

I can’t stress this enough: always set resource requests and limits. Without them, one misbehaving pod can consume all resources on a node and bring down other applications.

spec:
  containers:
  - name: app
    image: myapp:latest
    resources:
      requests:
        memory: "256Mi"
        cpu: "200m"
      limits:
        memory: "512Mi"
        cpu: "500m"

Use ResourceQuotas for Namespaces

Prevent teams from consuming all cluster resources:

apiVersion: v1
kind: ResourceQuota
metadata:
  name: team-quota
  namespace: team-alpha
spec:
  hard:
    requests.cpu: "10"
    requests.memory: 20Gi
    limits.cpu: "20"
    limits.memory: 40Gi
    pods: "50"
    persistentvolumeclaims: "10"

LimitRanges for Default Limits

Set default resource limits so developers don’t have to remember:

apiVersion: v1
kind: LimitRange
metadata:
  name: default-limits
  namespace: production
spec:
  limits:
  - default:
      memory: "256Mi"
      cpu: "200m"
    defaultRequest:
      memory: "128Mi"
      cpu: "100m"
    type: Container

Monitoring and Observability - Know What’s Happening

You can’t manage what you can’t see. Proper monitoring is essential for production Kubernetes.

Health Checks Are Non-Negotiable

Every container should have health checks:

spec:
  containers:
  - name: app
    image: myapp:latest
    livenessProbe:
      httpGet:
        path: /health
        port: 8080
      initialDelaySeconds: 30
      periodSeconds: 10
      timeoutSeconds: 5
      failureThreshold: 3
    readinessProbe:
      httpGet:
        path: /ready
        port: 8080
      initialDelaySeconds: 5
      periodSeconds: 5
      timeoutSeconds: 3
      failureThreshold: 3

Your application needs to implement these endpoints:

  • /health - Is the app running? (liveness)
  • /ready - Is the app ready to serve traffic? (readiness)

Logging Strategy

Centralize your logs. Don’t rely on kubectl logs in production:

# Use a logging sidecar or DaemonSet
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: fluentd
spec:
  selector:
    matchLabels:
      app: fluentd
  template:
    metadata:
      labels:
        app: fluentd
    spec:
      containers:
      - name: fluentd
        image: fluent/fluentd:latest
        volumeMounts:
        - name: varlog
          mountPath: /var/log
        - name: dockerlogs
          mountPath: /var/lib/docker/containers
      volumes:
      - name: varlog
        hostPath:
          path: /var/log
      - name: dockerlogs
        hostPath:
          path: /var/lib/docker/containers

Metrics Collection

Use Prometheus for metrics collection:

# Add Prometheus annotations to your pods
apiVersion: apps/v1
kind: Deployment
metadata:
  name: app
spec:
  template:
    metadata:
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "8080"
        prometheus.io/path: "/metrics"
    spec:
      containers:
      - name: app
        image: myapp:latest
        ports:
        - containerPort: 8080

Deployment Strategies - Rolling Out Changes Safely

Rolling Updates with Proper Configuration

Configure rolling updates to minimize disruption:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: app
spec:
  replicas: 6
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 1      # Never take down more than 1 pod
      maxSurge: 2           # Can create up to 2 extra pods during update
  template:
    spec:
      containers:
      - name: app
        image: myapp:v1.2.3
        readinessProbe:
          httpGet:
            path: /ready
            port: 8080
          initialDelaySeconds: 10
          periodSeconds: 5

Pod Disruption Budgets

Protect your applications during cluster maintenance:

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: app-pdb
spec:
  minAvailable: 2  # Always keep at least 2 pods running
  selector:
    matchLabels:
      app: myapp

Configuration Management Best Practices

Environment-Specific Configurations

Use Kustomize or Helm for environment-specific configurations:

# base/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: app
spec:
  replicas: 1  # Will be overridden per environment
  template:
    spec:
      containers:
      - name: app
        image: myapp:latest
        env:
        - name: LOG_LEVEL
          value: "info"  # Will be overridden per environment
# overlays/production/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
- ../../base
patchesStrategicMerge:
- deployment-patch.yaml

# overlays/production/deployment-patch.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: app
spec:
  replicas: 5
  template:
    spec:
      containers:
      - name: app
        env:
        - name: LOG_LEVEL
          value: "warn"

Secret Management

Never commit secrets to Git. Use external secret management:

# Use sealed-secrets or external-secrets operator
kubectl create secret generic app-secrets \
  --from-literal=database-password="$(openssl rand -base64 32)" \
  --dry-run=client -o yaml | \
  kubeseal -o yaml > sealed-secret.yaml

Performance Optimization

Node Affinity and Anti-Affinity

Control where your pods run:

spec:
  affinity:
    # Prefer to run on nodes with SSD storage
    nodeAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 100
        preference:
          matchExpressions:
          - key: storage-type
            operator: In
            values:
            - ssd
    # Don't run multiple replicas on the same node
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchExpressions:
          - key: app
            operator: In
            values:
            - myapp
        topologyKey: kubernetes.io/hostname

Horizontal Pod Autoscaling

Scale automatically based on metrics:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: app-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: app
  minReplicas: 3
  maxReplicas: 20
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
      - type: Percent
        value: 50
        periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
      - type: Percent
        value: 10
        periodSeconds: 60

Backup and Disaster Recovery

Backup Strategies

Don’t forget to backup your persistent data:

# Use Velero for cluster backups
apiVersion: velero.io/v1
kind: Backup
metadata:
  name: daily-backup
spec:
  includedNamespaces:
  - production
  - staging
  storageLocation: aws-s3
  ttl: 720h0m0s  # 30 days

Database Backups

For databases, use application-specific backup tools:

apiVersion: batch/v1
kind: CronJob
metadata:
  name: postgres-backup
spec:
  schedule: "0 2 * * *"  # Daily at 2 AM
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: backup
            image: postgres:13
            command:
            - /bin/bash
            - -c
            - |
              pg_dump -h postgres-service -U postgres myapp > /backup/backup-$(date +%Y%m%d).sql
              aws s3 cp /backup/backup-$(date +%Y%m%d).sql s3://my-backups/
            env:
            - name: PGPASSWORD
              valueFrom:
                secretKeyRef:
                  name: postgres-secret
                  key: password
            volumeMounts:
            - name: backup-storage
              mountPath: /backup
          volumes:
          - name: backup-storage
            emptyDir: {}
          restartPolicy: OnFailure

Troubleshooting Production Issues

Essential Debugging Commands

Keep these handy for when things go wrong:

# Check cluster health
kubectl get nodes
kubectl get pods --all-namespaces
kubectl top nodes
kubectl top pods --all-namespaces

# Investigate specific issues
kubectl describe pod problematic-pod
kubectl logs problematic-pod --previous
kubectl get events --sort-by=.metadata.creationTimestamp

# Check resource usage
kubectl describe node node-name
kubectl get pods -o wide

# Network debugging
kubectl run debug --image=busybox -it --rm -- /bin/sh
# Inside: nslookup service-name, wget -qO- http://service-name/health

Common Production Issues

Pods stuck in Pending: Usually resource constraints or node selector issues CrashLoopBackOff: Application startup issues, check logs and health checks Service not accessible: Check endpoints, labels, and network policies High resource usage: Check for memory leaks, inefficient queries, or missing limits

The key to production success is preparation. Set up proper monitoring, have runbooks for common issues, and practice your incident response procedures. Kubernetes is powerful, but with great power comes great responsibility.

In the final part, we’ll put everything together with real-world examples and discuss advanced topics like GitOps, service mesh, and scaling strategies.