Troubleshooting and Production Operations
Production containers will fail. I’ve been woken up at 3 AM by alerts more times than I care to count. The difference between a minor incident and a major outage often comes down to how quickly you can diagnose and resolve issues.
Systematic Troubleshooting Approach
When containers misbehave, I follow a systematic approach that starts broad and narrows down to the root cause. Panic leads to random changes that make problems worse.
The Container Troubleshooting Hierarchy:
- Application Layer: Is the application code working correctly?
- Container Layer: Is the container configured and running properly?
- Orchestration Layer: Is Kubernetes scheduling containers correctly?
- Network Layer: Can services communicate with each other?
- Infrastructure Layer: Are the underlying nodes healthy?
Start at the application layer and work your way down. Most issues are application-related, not infrastructure problems.
Essential Debugging Commands
These commands have saved me countless hours:
# Check pod status and events
kubectl get pods -o wide
kubectl describe pod <pod-name>
kubectl get events --sort-by=.metadata.creationTimestamp
# View container logs
kubectl logs <pod-name> -c <container-name>
kubectl logs <pod-name> --previous # Previous container instance
kubectl logs -f <pod-name> # Follow logs in real-time
# Execute commands in running containers
kubectl exec -it <pod-name> -- /bin/sh
kubectl exec <pod-name> -- ps aux
# Debug networking issues
kubectl get svc,endpoints
kubectl describe svc <service-name>
# Check resource usage
kubectl top pods
kubectl top nodes
For Docker without Kubernetes:
# Container inspection
docker ps -a
docker logs <container-id>
docker exec -it <container-id> /bin/sh
# Network debugging
docker network inspect <network-name>
Common Container Issues
I’ve encountered these problems repeatedly:
Container Won’t Start
Symptoms: Pod stuck in Pending
, CrashLoopBackOff
, or ImagePullBackOff
Diagnosis:
kubectl describe pod <pod-name>
kubectl get events --field-selector involvedObject.name=<pod-name>
Common Solutions:
- Image Pull Issues:
# Check image name and registry credentials
kubectl describe pod <pod-name> | grep -A5 "Failed to pull image"
docker pull <image-name> # Test locally
- Resource Constraints:
# Check node resources
kubectl describe node <node-name>
kubectl top nodes
# Adjust resource requests
kubectl patch deployment <name> -p '{"spec":{"template":{"spec":{"containers":[{"name":"app","resources":{"requests":{"memory":"256Mi"}}}]}}}}'
Network Connectivity Problems
Symptoms: Services can’t communicate or DNS resolution fails
Diagnosis:
# Test service connectivity
kubectl exec <pod-name> -- nslookup <service-name>
kubectl exec <pod-name> -- curl -v http://<service-name>:<port>
# Check service endpoints
kubectl get endpoints <service-name>
Solutions:
# Check CoreDNS
kubectl get pods -n kube-system | grep coredns
kubectl logs -n kube-system <coredns-pod>
# Verify service selectors match pod labels
kubectl get svc <service-name> -o yaml
kubectl get pods --show-labels
Advanced Debugging Techniques
For complex issues, I use specialized tools:
# Deploy network debugging pod
kubectl run netshoot --rm -it --image=nicolaka/netshoot -- /bin/bash
# Inside the debugging pod:
ping <service-name>
nmap -p <port> <service-name>
tcpdump -i eth0 -w capture.pcap
Disaster Recovery Essentials
Prepare for worst-case scenarios:
#!/bin/bash
# backup-cluster.sh
BACKUP_DIR="/backup/cluster-$(date +%Y%m%d-%H%M%S)"
mkdir -p $BACKUP_DIR
# Backup critical resources
for resource in deployments services configmaps secrets; do
kubectl get $resource --all-namespaces -o yaml > $BACKUP_DIR/$resource.yaml
done
tar -czf $BACKUP_DIR.tar.gz -C $(dirname $BACKUP_DIR) $(basename $BACKUP_DIR)
echo "Backup completed: $BACKUP_DIR.tar.gz"
Operational Practices
These practices maintain stable production environments:
Health Checks
// health-check.js
app.get('/health', (req, res) => {
res.json({
status: 'healthy',
timestamp: new Date().toISOString(),
uptime: process.uptime()
});
});
app.get('/ready', async (req, res) => {
try {
await checkDatabase();
res.json({ status: 'ready' });
} catch (error) {
res.status(503).json({ status: 'not ready', error: error.message });
}
});
Graceful Shutdown
// Handle shutdown signals
process.on('SIGTERM', gracefulShutdown);
function gracefulShutdown(signal) {
console.log(`Received ${signal}. Starting graceful shutdown...`);
server.close((err) => {
if (err) {
console.error('Error during shutdown:', err);
process.exit(1);
}
closeDatabase()
.then(() => process.exit(0))
.catch(() => process.exit(1));
});
// Force shutdown after timeout
setTimeout(() => process.exit(1), 30000);
}
Incident Response
When things go wrong:
- Immediate Response (0-5 minutes): Acknowledge alert, assess impact
- Investigation (5-30 minutes): Gather logs, identify root cause
- Resolution (30+ minutes): Apply fix, verify recovery
- Post-Incident (24-48 hours): Conduct blameless post-mortem
#!/bin/bash
# incident-response.sh
INCIDENT_ID=$(date +%Y%m%d-%H%M%S)
mkdir -p /tmp/incident-$INCIDENT_ID
# Gather system state
kubectl get pods,svc --all-namespaces > /tmp/incident-$INCIDENT_ID/system-state.txt
kubectl get events --all-namespaces > /tmp/incident-$INCIDENT_ID/events.txt
echo "Incident data collected in /tmp/incident-$INCIDENT_ID"
Production containers require discipline and preparation. The techniques in this guide will help you build reliable systems that handle production challenges. Remember: operational excellence is a journey of continuous improvement.