Troubleshooting and Debugging

Debugging containerized applications in Kubernetes environments is fundamentally different from debugging traditional applications. The distributed nature of the system, the ephemeral lifecycle of containers, and the complexity of orchestration create unique challenges that require specialized approaches and tools.

I’ve spent countless hours debugging production issues in containerized environments, and I’ve learned that successful troubleshooting requires a systematic approach combined with deep understanding of how Docker and Kubernetes work together. The key is having the right tools, techniques, and mental models to quickly isolate problems and identify root causes.

Systematic Debugging Methodology

When facing issues in containerized environments, I follow a systematic debugging methodology that starts with understanding the problem scope and progressively narrows down to specific components. This approach prevents the common mistake of diving too deep into details before understanding the broader context.

The first step is always gathering information about the current state of the system. I use a combination of kubectl commands and monitoring tools to get a comprehensive view of what’s happening:

# Get overall cluster health
kubectl get nodes
kubectl top nodes
kubectl get pods --all-namespaces | grep -v Running

# Check specific application status
kubectl get pods -n production -l app=my-app
kubectl describe deployment my-app -n production
kubectl get events -n production --sort-by='.lastTimestamp'

# Review resource utilization
kubectl top pods -n production
kubectl describe node worker-node-1

This initial assessment provides context about whether issues are isolated to specific applications or affecting the entire cluster.

Container-Level Debugging

When issues are isolated to specific containers, I use a combination of logs, metrics, and interactive debugging to understand what’s happening inside the container. The ephemeral nature of containers makes it crucial to gather information quickly before containers are restarted.

# Get container logs with context
kubectl logs -f deployment/my-app -n production --previous
kubectl logs -f deployment/my-app -n production --since=1h

# Get detailed pod information
kubectl describe pod my-app-pod-12345 -n production
kubectl get pod my-app-pod-12345 -n production -o yaml

# Execute commands inside running containers
kubectl exec -it my-app-pod-12345 -n production -- /bin/sh
kubectl exec -it my-app-pod-12345 -n production -- ps aux
kubectl exec -it my-app-pod-12345 -n production -- netstat -tulpn

For applications that don’t include debugging tools in their production images, I use debug containers to investigate issues:

apiVersion: v1
kind: Pod
metadata:
  name: debug-pod
spec:
  containers:
  - name: debug
    image: nicolaka/netshoot
    command: ["/bin/bash"]
    args: ["-c", "while true; do sleep 30; done;"]
    volumeMounts:
    - name: proc
      mountPath: /host/proc
      readOnly: true
    - name: sys
      mountPath: /host/sys
      readOnly: true
  volumes:
  - name: proc
    hostPath:
      path: /proc
  - name: sys
    hostPath:
      path: /sys
  hostNetwork: true
  hostPID: true

This debug pod provides access to network debugging tools and host-level information that can help diagnose connectivity and performance issues.

Application-Level Debugging

Application-level debugging in containerized environments requires instrumentation that provides visibility into application behavior without requiring access to the container filesystem or process space. I implement comprehensive logging and metrics that support effective debugging:

// Enhanced error logging with context
class ErrorLogger {
  static logError(error, context = {}) {
    const errorInfo = {
      timestamp: new Date().toISOString(),
      error: {
        name: error.name,
        message: error.message,
        stack: error.stack,
        code: error.code
      },
      context: {
        requestId: context.requestId,
        userId: context.userId,
        operation: context.operation,
        ...context
      },
      system: {
        hostname: process.env.HOSTNAME,
        nodeVersion: process.version,
        memoryUsage: process.memoryUsage(),
        uptime: process.uptime()
      }
    };
    
    logger.error('Application error', errorInfo);
    
    // Increment error metrics
    errorCounter.labels(
      error.name,
      context.operation || 'unknown',
      process.env.HOSTNAME
    ).inc();
  }
}

// Request tracing middleware
app.use((req, res, next) => {
  const requestId = req.headers['x-request-id'] || generateRequestId();
  const startTime = Date.now();
  
  req.context = {
    requestId,
    startTime,
    userAgent: req.get('User-Agent'),
    ip: req.ip
  };
  
  // Log request start
  logger.info('Request started', {
    requestId,
    method: req.method,
    url: req.url,
    userAgent: req.get('User-Agent'),
    ip: req.ip
  });
  
  // Track request completion
  res.on('finish', () => {
    const duration = Date.now() - startTime;
    
    logger.info('Request completed', {
      requestId,
      method: req.method,
      url: req.url,
      statusCode: res.statusCode,
      duration
    });
    
    // Update metrics
    httpRequestDuration
      .labels(req.method, req.route?.path || req.path, res.statusCode)
      .observe(duration / 1000);
  });
  
  next();
});

// Global error handler
app.use((error, req, res, next) => {
  ErrorLogger.logError(error, req.context);
  
  res.status(500).json({
    error: 'Internal server error',
    requestId: req.context?.requestId
  });
});

This instrumentation provides the detailed information needed to debug application issues without requiring direct access to containers.

Network Debugging

Network issues are common in containerized environments due to the complexity of Kubernetes networking. I use a systematic approach to diagnose network connectivity problems:

# Test basic connectivity
kubectl run debug --image=nicolaka/netshoot -it --rm -- /bin/bash

# Inside the debug pod:
# Test DNS resolution
nslookup my-service.production.svc.cluster.local
dig my-service.production.svc.cluster.local

# Test service connectivity
curl -v http://my-service.production.svc.cluster.local/health
telnet my-service.production.svc.cluster.local 80

# Test external connectivity
curl -v https://api.external-service.com
ping 8.8.8.8

# Check network policies
kubectl get networkpolicies -n production
kubectl describe networkpolicy my-app-policy -n production

For more complex network debugging, I use specialized tools that provide deeper insights into network behavior:

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: network-debug
spec:
  selector:
    matchLabels:
      name: network-debug
  template:
    metadata:
      labels:
        name: network-debug
    spec:
      hostNetwork: true
      containers:
      - name: debug
        image: nicolaka/netshoot
        command: ["/bin/bash"]
        args: ["-c", "while true; do sleep 30; done;"]
        securityContext:
          privileged: true
        volumeMounts:
        - name: proc
          mountPath: /host/proc
          readOnly: true
        - name: sys
          mountPath: /host/sys
          readOnly: true
      volumes:
      - name: proc
        hostPath:
          path: /proc
      - name: sys
        hostPath:
          path: /sys

This DaemonSet provides network debugging capabilities on every node in the cluster.

Storage and Volume Debugging

Storage issues can be particularly challenging to debug because they often involve multiple layers: the container filesystem, volume mounts, persistent volumes, and underlying storage systems. I use a systematic approach to isolate storage problems:

# Check persistent volume status
kubectl get pv
kubectl get pvc -n production
kubectl describe pvc my-app-data -n production

# Check volume mounts in pods
kubectl describe pod my-app-pod-12345 -n production
kubectl exec -it my-app-pod-12345 -n production -- df -h
kubectl exec -it my-app-pod-12345 -n production -- mount | grep my-app

# Test file system operations
kubectl exec -it my-app-pod-12345 -n production -- touch /data/test-file
kubectl exec -it my-app-pod-12345 -n production -- ls -la /data/
kubectl exec -it my-app-pod-12345 -n production -- stat /data/

For persistent volume issues, I examine the underlying storage system:

# Check storage class configuration
kubectl get storageclass
kubectl describe storageclass fast-ssd

# Check volume provisioner logs
kubectl logs -n kube-system -l app=ebs-csi-controller

# Check node-level storage
kubectl describe node worker-node-1

Performance Debugging

Performance issues in containerized environments can be caused by resource constraints, inefficient application code, or infrastructure bottlenecks. I use a combination of metrics, profiling, and load testing to identify performance problems:

// Performance profiling middleware
const performanceProfiler = {
  profileRequest(req, res, next) {
    const startTime = process.hrtime.bigint();
    const startCpuUsage = process.cpuUsage();
    const startMemory = process.memoryUsage();
    
    res.on('finish', () => {
      const endTime = process.hrtime.bigint();
      const endCpuUsage = process.cpuUsage(startCpuUsage);
      const endMemory = process.memoryUsage();
      
      const duration = Number(endTime - startTime) / 1e6; // Convert to milliseconds
      const cpuTime = (endCpuUsage.user + endCpuUsage.system) / 1000; // Convert to milliseconds
      const memoryDelta = endMemory.heapUsed - startMemory.heapUsed;
      
      if (duration > 1000) { // Log slow requests
        logger.warn('Slow request detected', {
          requestId: req.context?.requestId,
          method: req.method,
          url: req.url,
          duration,
          cpuTime,
          memoryDelta,
          statusCode: res.statusCode
        });
      }
      
      // Update performance metrics
      requestDurationHistogram
        .labels(req.method, req.route?.path || req.path)
        .observe(duration / 1000);
      
      requestCpuTimeHistogram
        .labels(req.method, req.route?.path || req.path)
        .observe(cpuTime / 1000);
    });
    
    next();
  },
  
  // Memory leak detection
  detectMemoryLeaks() {
    let previousHeapUsed = process.memoryUsage().heapUsed;
    
    setInterval(() => {
      const currentMemory = process.memoryUsage();
      const heapGrowth = currentMemory.heapUsed - previousHeapUsed;
      
      if (heapGrowth > 50 * 1024 * 1024) { // 50MB growth
        logger.warn('Potential memory leak detected', {
          heapUsed: currentMemory.heapUsed,
          heapTotal: currentMemory.heapTotal,
          heapGrowth,
          external: currentMemory.external
        });
      }
      
      previousHeapUsed = currentMemory.heapUsed;
    }, 60000); // Check every minute
  }
};

This performance profiling helps identify slow requests and potential memory leaks that could impact application performance.

Resource Constraint Debugging

Resource constraints are a common cause of issues in containerized environments. I use monitoring and analysis tools to identify when applications are hitting resource limits:

# Check resource usage
kubectl top pods -n production --sort-by=memory
kubectl top pods -n production --sort-by=cpu

# Check resource limits and requests
kubectl describe pod my-app-pod-12345 -n production | grep -A 10 "Limits\|Requests"

# Check for OOMKilled containers
kubectl get pods -n production -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.containerStatuses[0].restartCount}{"\t"}{.status.containerStatuses[0].lastState.terminated.reason}{"\n"}{end}' | grep OOMKilled

# Check node resource availability
kubectl describe node worker-node-1 | grep -A 10 "Allocated resources"

When resource constraints are identified, I analyze the application’s resource usage patterns to determine appropriate resource requests and limits.

Distributed Tracing for Debugging

Distributed tracing provides invaluable insights when debugging issues that span multiple services. I implement comprehensive tracing that helps identify bottlenecks and failures in distributed systems:

const { trace, context } = require('@opentelemetry/api');

// Enhanced tracing with error capture
function createTracedFunction(name, fn) {
  return async function(...args) {
    const tracer = trace.getTracer('my-service');
    
    return tracer.startActiveSpan(name, async (span) => {
      try {
        // Add relevant attributes
        span.setAttributes({
          'function.name': name,
          'function.args.count': args.length,
          'service.name': process.env.SERVICE_NAME,
          'service.version': process.env.SERVICE_VERSION
        });
        
        const result = await fn.apply(this, args);
        
        span.setStatus({ code: trace.SpanStatusCode.OK });
        return result;
      } catch (error) {
        // Capture error details in span
        span.recordException(error);
        span.setStatus({
          code: trace.SpanStatusCode.ERROR,
          message: error.message
        });
        
        // Add error attributes
        span.setAttributes({
          'error.name': error.name,
          'error.message': error.message,
          'error.stack': error.stack
        });
        
        throw error;
      } finally {
        span.end();
      }
    });
  };
}

// Trace database operations
const tracedDbQuery = createTracedFunction('database.query', async (query, params) => {
  const span = trace.getActiveSpan();
  span?.setAttributes({
    'db.statement': query,
    'db.operation': query.split(' ')[0].toUpperCase()
  });
  
  return await db.query(query, params);
});

// Trace HTTP requests
const tracedHttpRequest = createTracedFunction('http.request', async (url, options) => {
  const span = trace.getActiveSpan();
  span?.setAttributes({
    'http.url': url,
    'http.method': options.method || 'GET'
  });
  
  const response = await axios(url, options);
  
  span?.setAttributes({
    'http.status_code': response.status,
    'http.response_size': response.headers['content-length'] || 0
  });
  
  return response;
});

This enhanced tracing provides detailed information about request flows and helps identify where failures occur in distributed systems.

Incident Response Procedures

When production issues occur, having well-defined incident response procedures is crucial for minimizing impact and restoring service quickly. I implement incident response procedures that are specifically designed for containerized environments:

#!/bin/bash
# incident-response.sh - Emergency debugging script

set -e

NAMESPACE=${1:-production}
APP_NAME=${2:-my-app}

echo "=== Incident Response Debug Information ==="
echo "Timestamp: $(date)"
echo "Namespace: $NAMESPACE"
echo "Application: $APP_NAME"
echo

echo "=== Cluster Health ==="
kubectl get nodes
kubectl top nodes
echo

echo "=== Application Status ==="
kubectl get pods -n $NAMESPACE -l app=$APP_NAME
kubectl get deployments -n $NAMESPACE -l app=$APP_NAME
kubectl get services -n $NAMESPACE -l app=$APP_NAME
echo

echo "=== Recent Events ==="
kubectl get events -n $NAMESPACE --sort-by='.lastTimestamp' | tail -20
echo

echo "=== Resource Usage ==="
kubectl top pods -n $NAMESPACE -l app=$APP_NAME
echo

echo "=== Recent Logs ==="
kubectl logs -n $NAMESPACE -l app=$APP_NAME --since=10m --tail=50
echo

echo "=== Pod Details ==="
for pod in $(kubectl get pods -n $NAMESPACE -l app=$APP_NAME -o jsonpath='{.items[*].metadata.name}'); do
  echo "--- Pod: $pod ---"
  kubectl describe pod $pod -n $NAMESPACE | grep -A 20 "Conditions\|Events"
  echo
done

This incident response script quickly gathers the most important information needed to understand and resolve production issues.

Preventive Debugging Measures

The best debugging strategy is preventing issues from occurring in the first place. I implement several preventive measures that catch problems early:

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: preventive-alerts
spec:
  groups:
  - name: early-warning.rules
    rules:
    - alert: HighErrorRate
      expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.01
      for: 2m
      labels:
        severity: warning
      annotations:
        summary: "Elevated error rate detected"
        description: "Error rate is {{ $value }} for {{ $labels.service }}"
    
    - alert: SlowResponseTime
      expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 1
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "Response time degradation"
        description: "95th percentile response time is {{ $value }}s"
    
    - alert: MemoryLeakSuspected
      expr: increase(process_resident_memory_bytes[1h]) > 100000000
      for: 0m
      labels:
        severity: warning
      annotations:
        summary: "Potential memory leak"
        description: "Memory usage increased by {{ $value }} bytes in the last hour"

These preventive alerts help identify issues before they become critical problems.

Looking Forward

Effective troubleshooting and debugging in containerized environments requires a combination of systematic methodology, appropriate tools, and deep understanding of how Docker and Kubernetes work together. The techniques and procedures I’ve outlined provide a foundation for quickly identifying and resolving issues when they occur.

The key insight is that debugging containerized applications is fundamentally about understanding the relationships between different system components and having the right observability in place to quickly isolate problems. By implementing comprehensive logging, metrics, tracing, and alerting, you create systems that are not only reliable but also debuggable when issues do occur.

In the final part of this guide, we’ll explore production deployment strategies that bring together all the concepts we’ve covered. We’ll look at how to implement complete production systems that are secure, scalable, observable, and maintainable.