Production Best Practices and Optimization
Alright, you’ve learned the fundamentals, built some applications, and now you’re thinking about production. This is where things get real. Running Kubernetes in production isn’t just about getting your apps to work—it’s about making them reliable, secure, and maintainable.
I’ve seen plenty of teams rush to production with Kubernetes and then spend months fixing issues they could have avoided. Let’s make sure you’re not one of them.
Security - Don’t Be That Company in the News
Security in Kubernetes isn’t optional. It’s not something you add later. It needs to be baked in from the start, because once you’re compromised, you’re in for a world of hurt.
The Principle of Least Privilege
Never give more permissions than absolutely necessary. This applies to everything—users, service accounts, network access, you name it.
# Bad: Running as root
apiVersion: v1
kind: Pod
metadata:
name: bad-pod
spec:
containers:
- name: app
image: myapp:latest
# Runs as root by default - dangerous!
# Good: Running as non-root user
apiVersion: v1
kind: Pod
metadata:
name: good-pod
spec:
securityContext:
runAsNonRoot: true
runAsUser: 1000
runAsGroup: 1000
containers:
- name: app
image: myapp:latest
securityContext:
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
capabilities:
drop:
- ALL
Network Policies - Lock Down Your Traffic
By default, any pod can talk to any other pod. That’s convenient for development, but terrible for security. Network policies let you control traffic flow:
# Deny all traffic by default
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny-all
namespace: production
spec:
podSelector: {}
policyTypes:
- Ingress
- Egress
---
# Allow specific traffic
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-frontend-to-api
namespace: production
spec:
podSelector:
matchLabels:
app: api
policyTypes:
- Ingress
ingress:
- from:
- podSelector:
matchLabels:
app: frontend
ports:
- protocol: TCP
port: 8080
RBAC - Control Who Can Do What
Role-Based Access Control (RBAC) controls what users and service accounts can do in your cluster:
# Create a role for developers
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
namespace: development
name: developer
rules:
- apiGroups: [""]
resources: ["pods", "services", "configmaps", "secrets"]
verbs: ["get", "list", "create", "update", "patch", "delete"]
- apiGroups: ["apps"]
resources: ["deployments", "replicasets"]
verbs: ["get", "list", "create", "update", "patch", "delete"]
---
# Bind the role to users
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: developer-binding
namespace: development
subjects:
- kind: User
name: [email protected]
apiGroup: rbac.authorization.k8s.io
- kind: User
name: [email protected]
apiGroup: rbac.authorization.k8s.io
roleRef:
kind: Role
name: developer
apiGroup: rbac.authorization.k8s.io
Resource Management - Don’t Let One App Kill Your Cluster
Always Set Resource Limits
I can’t stress this enough: always set resource requests and limits. Without them, one misbehaving pod can consume all resources on a node and bring down other applications.
spec:
containers:
- name: app
image: myapp:latest
resources:
requests:
memory: "256Mi"
cpu: "200m"
limits:
memory: "512Mi"
cpu: "500m"
Use ResourceQuotas for Namespaces
Prevent teams from consuming all cluster resources:
apiVersion: v1
kind: ResourceQuota
metadata:
name: team-quota
namespace: team-alpha
spec:
hard:
requests.cpu: "10"
requests.memory: 20Gi
limits.cpu: "20"
limits.memory: 40Gi
pods: "50"
persistentvolumeclaims: "10"
LimitRanges for Default Limits
Set default resource limits so developers don’t have to remember:
apiVersion: v1
kind: LimitRange
metadata:
name: default-limits
namespace: production
spec:
limits:
- default:
memory: "256Mi"
cpu: "200m"
defaultRequest:
memory: "128Mi"
cpu: "100m"
type: Container
Monitoring and Observability - Know What’s Happening
You can’t manage what you can’t see. Proper monitoring is essential for production Kubernetes.
Health Checks Are Non-Negotiable
Every container should have health checks:
spec:
containers:
- name: app
image: myapp:latest
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 3
Your application needs to implement these endpoints:
/health
- Is the app running? (liveness)/ready
- Is the app ready to serve traffic? (readiness)
Logging Strategy
Centralize your logs. Don’t rely on kubectl logs
in production:
# Use a logging sidecar or DaemonSet
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: fluentd
spec:
selector:
matchLabels:
app: fluentd
template:
metadata:
labels:
app: fluentd
spec:
containers:
- name: fluentd
image: fluent/fluentd:latest
volumeMounts:
- name: varlog
mountPath: /var/log
- name: dockerlogs
mountPath: /var/lib/docker/containers
volumes:
- name: varlog
hostPath:
path: /var/log
- name: dockerlogs
hostPath:
path: /var/lib/docker/containers
Metrics Collection
Use Prometheus for metrics collection:
# Add Prometheus annotations to your pods
apiVersion: apps/v1
kind: Deployment
metadata:
name: app
spec:
template:
metadata:
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8080"
prometheus.io/path: "/metrics"
spec:
containers:
- name: app
image: myapp:latest
ports:
- containerPort: 8080
Deployment Strategies - Rolling Out Changes Safely
Rolling Updates with Proper Configuration
Configure rolling updates to minimize disruption:
apiVersion: apps/v1
kind: Deployment
metadata:
name: app
spec:
replicas: 6
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 1 # Never take down more than 1 pod
maxSurge: 2 # Can create up to 2 extra pods during update
template:
spec:
containers:
- name: app
image: myapp:v1.2.3
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 10
periodSeconds: 5
Pod Disruption Budgets
Protect your applications during cluster maintenance:
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: app-pdb
spec:
minAvailable: 2 # Always keep at least 2 pods running
selector:
matchLabels:
app: myapp
Configuration Management Best Practices
Environment-Specific Configurations
Use Kustomize or Helm for environment-specific configurations:
# base/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: app
spec:
replicas: 1 # Will be overridden per environment
template:
spec:
containers:
- name: app
image: myapp:latest
env:
- name: LOG_LEVEL
value: "info" # Will be overridden per environment
# overlays/production/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
- ../../base
patchesStrategicMerge:
- deployment-patch.yaml
# overlays/production/deployment-patch.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: app
spec:
replicas: 5
template:
spec:
containers:
- name: app
env:
- name: LOG_LEVEL
value: "warn"
Secret Management
Never commit secrets to Git. Use external secret management:
# Use sealed-secrets or external-secrets operator
kubectl create secret generic app-secrets \
--from-literal=database-password="$(openssl rand -base64 32)" \
--dry-run=client -o yaml | \
kubeseal -o yaml > sealed-secret.yaml
Performance Optimization
Node Affinity and Anti-Affinity
Control where your pods run:
spec:
affinity:
# Prefer to run on nodes with SSD storage
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
preference:
matchExpressions:
- key: storage-type
operator: In
values:
- ssd
# Don't run multiple replicas on the same node
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- myapp
topologyKey: kubernetes.io/hostname
Horizontal Pod Autoscaling
Scale automatically based on metrics:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: app-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: app
minReplicas: 3
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
behavior:
scaleUp:
stabilizationWindowSeconds: 60
policies:
- type: Percent
value: 50
periodSeconds: 60
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 10
periodSeconds: 60
Backup and Disaster Recovery
Backup Strategies
Don’t forget to backup your persistent data:
# Use Velero for cluster backups
apiVersion: velero.io/v1
kind: Backup
metadata:
name: daily-backup
spec:
includedNamespaces:
- production
- staging
storageLocation: aws-s3
ttl: 720h0m0s # 30 days
Database Backups
For databases, use application-specific backup tools:
apiVersion: batch/v1
kind: CronJob
metadata:
name: postgres-backup
spec:
schedule: "0 2 * * *" # Daily at 2 AM
jobTemplate:
spec:
template:
spec:
containers:
- name: backup
image: postgres:13
command:
- /bin/bash
- -c
- |
pg_dump -h postgres-service -U postgres myapp > /backup/backup-$(date +%Y%m%d).sql
aws s3 cp /backup/backup-$(date +%Y%m%d).sql s3://my-backups/
env:
- name: PGPASSWORD
valueFrom:
secretKeyRef:
name: postgres-secret
key: password
volumeMounts:
- name: backup-storage
mountPath: /backup
volumes:
- name: backup-storage
emptyDir: {}
restartPolicy: OnFailure
Troubleshooting Production Issues
Essential Debugging Commands
Keep these handy for when things go wrong:
# Check cluster health
kubectl get nodes
kubectl get pods --all-namespaces
kubectl top nodes
kubectl top pods --all-namespaces
# Investigate specific issues
kubectl describe pod problematic-pod
kubectl logs problematic-pod --previous
kubectl get events --sort-by=.metadata.creationTimestamp
# Check resource usage
kubectl describe node node-name
kubectl get pods -o wide
# Network debugging
kubectl run debug --image=busybox -it --rm -- /bin/sh
# Inside: nslookup service-name, wget -qO- http://service-name/health
Common Production Issues
Pods stuck in Pending: Usually resource constraints or node selector issues CrashLoopBackOff: Application startup issues, check logs and health checks Service not accessible: Check endpoints, labels, and network policies High resource usage: Check for memory leaks, inefficient queries, or missing limits
The key to production success is preparation. Set up proper monitoring, have runbooks for common issues, and practice your incident response procedures. Kubernetes is powerful, but with great power comes great responsibility.
In the final part, we’ll put everything together with real-world examples and discuss advanced topics like GitOps, service mesh, and scaling strategies.