Advanced Patterns and Techniques

Monitoring Strategies

Health Monitoring

Ensuring service availability and proper functioning:

Health Check Types:

Liveness probes (is the service running?)
Readiness probes (is the service ready for traffic?)
Startup probes (is the service initializing correctly?)
Dependency health checks
Synthetic transactions

Example Kubernetes Health Probes:

# Kubernetes deployment with health probes
apiVersion: apps/v1
kind: Deployment
metadata:
  name: order-service
spec:
  replicas: 3
  selector:
    matchLabels:
      app: order-service
  template:
    metadata:
      labels:
        app: order-service
    spec:
      containers:
      - name: order-service
        image: example/order-service:v1.2.3
        ports:
        - containerPort: 8080
        # Liveness probe - determines if the container should be restarted
        livenessProbe:
          httpGet:
            path: /health/live
            port: 8080
          initialDelaySeconds: 30
          periodSeconds: 10
          timeoutSeconds: 5
          failureThreshold: 3
        # Readiness probe - determines if the container should receive traffic
        readinessProbe:
          httpGet:
            path: /health/ready
            port: 8080
          initialDelaySeconds: 5
          periodSeconds: 10

Health Monitoring Best Practices:

Implement meaningful health checks
Include dependency health in readiness
Use appropriate timeouts and thresholds
Monitor health check results
Implement circuit breakers for dependencies

Performance Monitoring

Tracking system performance and resource utilization:

Key Performance Metrics:

Request rate (throughput)
Error rate
Latency (p50, p90, p99)
Resource utilization (CPU, memory)
Saturation (queue depth, thread pool utilization)

The RED Method:

Rate: Requests per second
Errors: Failed requests per second
Duration: Distribution of request latencies

The USE Method:

Utilization: Percentage of resource used
Saturation: Amount of work queued
Errors: Error events

Alerting and Incident Response

Detecting and responding to issues:

Alerting Best Practices:

Alert on symptoms, not causes
Define clear alert thresholds
Reduce alert noise and fatigue
Implement alert severity levels
Provide actionable context

Example Prometheus Alert Rules:

# Prometheus alert rules
groups:
- name: service-alerts
  rules:
  - alert: HighErrorRate
    expr: sum(rate(http_requests_total{status=~"5.."}[5m])) by (service) / sum(rate(http_requests_total[5m])) by (service) > 0.05
    for: 2m
    labels:
      severity: critical
    annotations:
      summary: "High error rate on {{ $labels.service }}"
      description: "Service {{ $labels.service }} has error rate above 5% (current value: {{ $value | humanizePercentage }})"
      
  - alert: SlowResponseTime
    expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (service, le)) > 2
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Slow response time on {{ $labels.service }}"
      description: "Service {{ $labels.service }} has 95th percentile response time above 2 seconds (current value: {{ $value | humanizeDuration }})"

Incident Response Process:

Automated detection and alerting
On-call rotation and escalation
Incident classification and prioritization
Communication and coordination
Post-incident review and learning

Advanced Monitoring Techniques

Service Level Objectives (SLOs)

Defining and measuring service reliability:

SLO Components:

Service Level Indicators (SLIs)
Service Level Objectives (SLOs)
Error budgets
Burn rate alerts
SLO reporting

Example SLO Definition:

# SLO definition
service: order-service
slo:
  name: availability
  target: 99.9%
  window: 30d
sli:
  metric: http_requests_total{status=~"5.."}
  success_criteria: status=~"2..|3.."
  total_criteria: status=~"2..|3..|4..|5.."
alerting:
  page_alert:
    threshold: 2%    # 2% of error budget consumed
    window: 1h
  ticket_alert:
    threshold: 5%    # 5% of error budget consumed
    window: 6h

SLO Implementation Best Practices:

Focus on user-centric metrics
Start with a few critical SLOs
Set realistic and achievable targets
Use error budgets to balance reliability and innovation
Review and refine SLOs regularly