Troubleshooting and Production Operations

Production containers will fail. I’ve been woken up at 3 AM by alerts more times than I care to count. The difference between a minor incident and a major outage often comes down to how quickly you can diagnose and resolve issues.

Systematic Troubleshooting Approach

When containers misbehave, I follow a systematic approach that starts broad and narrows down to the root cause. Panic leads to random changes that make problems worse.

The Container Troubleshooting Hierarchy:

  1. Application Layer: Is the application code working correctly?
  2. Container Layer: Is the container configured and running properly?
  3. Orchestration Layer: Is Kubernetes scheduling containers correctly?
  4. Network Layer: Can services communicate with each other?
  5. Infrastructure Layer: Are the underlying nodes healthy?

Start at the application layer and work your way down. Most issues are application-related, not infrastructure problems.

Essential Debugging Commands

These commands have saved me countless hours:

# Check pod status and events
kubectl get pods -o wide
kubectl describe pod <pod-name>
kubectl get events --sort-by=.metadata.creationTimestamp

# View container logs
kubectl logs <pod-name> -c <container-name>
kubectl logs <pod-name> --previous  # Previous container instance
kubectl logs -f <pod-name>  # Follow logs in real-time

# Execute commands in running containers
kubectl exec -it <pod-name> -- /bin/sh
kubectl exec <pod-name> -- ps aux

# Debug networking issues
kubectl get svc,endpoints
kubectl describe svc <service-name>

# Check resource usage
kubectl top pods
kubectl top nodes

For Docker without Kubernetes:

# Container inspection
docker ps -a
docker logs <container-id>
docker exec -it <container-id> /bin/sh

# Network debugging
docker network inspect <network-name>

Common Container Issues

I’ve encountered these problems repeatedly:

Container Won’t Start

Symptoms: Pod stuck in Pending, CrashLoopBackOff, or ImagePullBackOff

Diagnosis:

kubectl describe pod <pod-name>
kubectl get events --field-selector involvedObject.name=<pod-name>

Common Solutions:

  1. Image Pull Issues:
# Check image name and registry credentials
kubectl describe pod <pod-name> | grep -A5 "Failed to pull image"
docker pull <image-name>  # Test locally
  1. Resource Constraints:
# Check node resources
kubectl describe node <node-name>
kubectl top nodes

# Adjust resource requests
kubectl patch deployment <name> -p '{"spec":{"template":{"spec":{"containers":[{"name":"app","resources":{"requests":{"memory":"256Mi"}}}]}}}}'

Network Connectivity Problems

Symptoms: Services can’t communicate or DNS resolution fails

Diagnosis:

# Test service connectivity
kubectl exec <pod-name> -- nslookup <service-name>
kubectl exec <pod-name> -- curl -v http://<service-name>:<port>

# Check service endpoints
kubectl get endpoints <service-name>

Solutions:

# Check CoreDNS
kubectl get pods -n kube-system | grep coredns
kubectl logs -n kube-system <coredns-pod>

# Verify service selectors match pod labels
kubectl get svc <service-name> -o yaml
kubectl get pods --show-labels

Advanced Debugging Techniques

For complex issues, I use specialized tools:

# Deploy network debugging pod
kubectl run netshoot --rm -it --image=nicolaka/netshoot -- /bin/bash

# Inside the debugging pod:
ping <service-name>
nmap -p <port> <service-name>
tcpdump -i eth0 -w capture.pcap

Disaster Recovery Essentials

Prepare for worst-case scenarios:

#!/bin/bash
# backup-cluster.sh
BACKUP_DIR="/backup/cluster-$(date +%Y%m%d-%H%M%S)"
mkdir -p $BACKUP_DIR

# Backup critical resources
for resource in deployments services configmaps secrets; do
    kubectl get $resource --all-namespaces -o yaml > $BACKUP_DIR/$resource.yaml
done

tar -czf $BACKUP_DIR.tar.gz -C $(dirname $BACKUP_DIR) $(basename $BACKUP_DIR)
echo "Backup completed: $BACKUP_DIR.tar.gz"

Operational Practices

These practices maintain stable production environments:

Health Checks

// health-check.js
app.get('/health', (req, res) => {
  res.json({
    status: 'healthy',
    timestamp: new Date().toISOString(),
    uptime: process.uptime()
  });
});

app.get('/ready', async (req, res) => {
  try {
    await checkDatabase();
    res.json({ status: 'ready' });
  } catch (error) {
    res.status(503).json({ status: 'not ready', error: error.message });
  }
});

Graceful Shutdown

// Handle shutdown signals
process.on('SIGTERM', gracefulShutdown);

function gracefulShutdown(signal) {
  console.log(`Received ${signal}. Starting graceful shutdown...`);
  
  server.close((err) => {
    if (err) {
      console.error('Error during shutdown:', err);
      process.exit(1);
    }
    
    closeDatabase()
      .then(() => process.exit(0))
      .catch(() => process.exit(1));
  });
  
  // Force shutdown after timeout
  setTimeout(() => process.exit(1), 30000);
}

Incident Response

When things go wrong:

  1. Immediate Response (0-5 minutes): Acknowledge alert, assess impact
  2. Investigation (5-30 minutes): Gather logs, identify root cause
  3. Resolution (30+ minutes): Apply fix, verify recovery
  4. Post-Incident (24-48 hours): Conduct blameless post-mortem
#!/bin/bash
# incident-response.sh
INCIDENT_ID=$(date +%Y%m%d-%H%M%S)
mkdir -p /tmp/incident-$INCIDENT_ID

# Gather system state
kubectl get pods,svc --all-namespaces > /tmp/incident-$INCIDENT_ID/system-state.txt
kubectl get events --all-namespaces > /tmp/incident-$INCIDENT_ID/events.txt

echo "Incident data collected in /tmp/incident-$INCIDENT_ID"

Production containers require discipline and preparation. The techniques in this guide will help you build reliable systems that handle production challenges. Remember: operational excellence is a journey of continuous improvement.