Monitoring and Observability

Production containers without proper monitoring are like flying blind. I learned this during a midnight outage when our application was failing, but we had no visibility into why. Since then, I’ve built comprehensive observability into every containerized system.

The Three Pillars of Observability

Effective observability requires three types of data:

Metrics: Numerical measurements over time (CPU usage, request rates, error counts) Logs: Discrete events with context (application logs, error messages, audit trails)
Traces: Request flows through distributed systems (service calls, database queries)

Each pillar provides different insights, but they’re most powerful when correlated together.

Prometheus and Grafana Stack

Prometheus scrapes metrics from your applications, while Grafana provides visualization and alerting:

# prometheus-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-config
data:
  prometheus.yml: |
    global:
      scrape_interval: 15s
    
    scrape_configs:
      - job_name: 'kubernetes-pods'
        kubernetes_sd_configs:
          - role: pod
        relabel_configs:
          - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
            action: keep
            regex: true

Deploy Prometheus with proper resource limits:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: prometheus
spec:
  replicas: 1
  selector:
    matchLabels:
      app: prometheus
  template:
    spec:
      containers:
      - name: prometheus
        image: prom/prometheus:v2.45.0
        ports:
        - containerPort: 9090
        resources:
          requests:
            memory: "512Mi"
            cpu: "250m"
          limits:
            memory: "1Gi"
            cpu: "500m"

Application Metrics Integration

Your applications need to expose metrics. I’ll show you the Node.js instrumentation I use in production:

// metrics.js
const promClient = require('prom-client');
const register = new promClient.Registry();

// Add default metrics
promClient.collectDefaultMetrics({ register });

// Custom business metrics
const httpRequestDuration = new promClient.Histogram({
  name: 'http_request_duration_seconds',
  help: 'Duration of HTTP requests in seconds',
  labelNames: ['method', 'route', 'status_code'],
  buckets: [0.1, 0.3, 0.5, 0.7, 1, 3, 5, 7, 10]
});

const httpRequestsTotal = new promClient.Counter({
  name: 'http_requests_total',
  help: 'Total number of HTTP requests',
  labelNames: ['method', 'route', 'status_code']
});

register.registerMetric(httpRequestDuration);
register.registerMetric(httpRequestsTotal);

// Middleware to collect HTTP metrics
const metricsMiddleware = (req, res, next) => {
  const start = Date.now();
  
  res.on('finish', () => {
    const duration = (Date.now() - start) / 1000;
    const route = req.route ? req.route.path : req.path;
    
    httpRequestDuration
      .labels(req.method, route, res.statusCode)
      .observe(duration);
    
    httpRequestsTotal
      .labels(req.method, route, res.statusCode)
      .inc();
  });
  
  next();
};

module.exports = { register, metricsMiddleware };

Expose metrics in your application:

// app.js
const express = require('express');
const { register, metricsMiddleware } = require('./metrics');

const app = express();
app.use(metricsMiddleware);

// Metrics endpoint for Prometheus
app.get('/metrics', async (req, res) => {
  res.set('Content-Type', register.contentType);
  res.end(await register.metrics());
});

app.listen(8080);

Add Prometheus annotations to your deployment:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-app
spec:
  template:
    metadata:
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "8080"
        prometheus.io/path: "/metrics"

Centralized Logging

Structured logging makes debugging much easier:

// logger.js
const winston = require('winston');

const logger = winston.createLogger({
  level: process.env.LOG_LEVEL || 'info',
  format: winston.format.combine(
    winston.format.timestamp(),
    winston.format.errors({ stack: true }),
    winston.format.json()
  ),
  defaultMeta: {
    service: 'web-app',
    version: process.env.APP_VERSION || 'unknown'
  },
  transports: [new winston.transports.Console()]
});

// Request logging middleware
const requestLogger = (req, res, next) => {
  const correlationId = req.headers['x-correlation-id'] || generateId();
  req.correlationId = correlationId;
  
  logger.info('Request started', {
    correlationId,
    method: req.method,
    url: req.url,
    ip: req.ip
  });
  
  res.on('finish', () => {
    logger.info('Request completed', {
      correlationId,
      statusCode: res.statusCode,
      duration: Date.now() - req.startTime
    });
  });
  
  next();
};

function generateId() {
  return Math.random().toString(36).substring(2, 15);
}

module.exports = { logger, requestLogger };

Alerting Rules

Set up intelligent alerting that catches real issues:

# alert-rules.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: application-alerts
spec:
  groups:
  - name: application.rules
    rules:
    - alert: HighErrorRate
      expr: |
        (
          sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
          /
          sum(rate(http_requests_total[5m])) by (service)
        ) > 0.05
      for: 2m
      labels:
        severity: warning
      annotations:
        summary: "High error rate detected"
        description: "Service {{ $labels.service }} has {{ $value | humanizePercentage }} error rate"

    - alert: HighLatency
      expr: |
        histogram_quantile(0.95, 
          sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
        ) > 1
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "High latency detected"
        description: "Service {{ $labels.service }} 95th percentile latency is {{ $value }}s"

    - alert: PodCrashLooping
      expr: rate(kube_pod_container_status_restarts_total[15m]) * 60 * 15 > 0
      for: 0m
      labels:
        severity: critical
      annotations:
        summary: "Pod is crash looping"
        description: "Pod {{ $labels.pod }} is restarting frequently"

Performance Monitoring

Monitor key performance indicators and set up automated responses:

#!/bin/bash
# performance-monitor.sh

check_performance() {
    local service=$1
    local threshold_p95=1.0
    local threshold_error_rate=0.05
    
    # Get 95th percentile latency
    p95_latency=$(curl -s "http://prometheus:9090/api/v1/query?query=histogram_quantile(0.95,sum(rate(http_request_duration_seconds_bucket{service=\"$service\"}[5m]))by(le))" | jq -r '.data.result[0].value[1]')
    
    echo "Service: $service, P95 latency: ${p95_latency}s"
    
    if (( $(echo "$p95_latency > $threshold_p95" | bc -l) )); then
        echo "WARNING: High latency detected"
        kubectl scale deployment $service --replicas=6
    fi
}

# Monitor all services
for service in web-app api-service; do
    check_performance $service
done

Comprehensive monitoring provides the visibility needed to maintain performance and debug issues in production. In the final part, I’ll cover troubleshooting techniques and operational practices for production container environments.