Introduction | Andrew Odendaal

The Foundations of Effective Incident Management

Before diving into specific practices, let’s establish the core principles that underpin effective incident management:

Key Principles

Blameless Culture: Focus on systems and processes, not individuals
Preparedness: Plan and practice for incidents before they occur
Clear Ownership: Define roles and responsibilities clearly
Proportional Response: Match the response to the severity of the incident
Continuous Learning: Use incidents as opportunities to improve

The Incident Lifecycle

Understanding the complete incident lifecycle helps teams develop comprehensive management strategies:

┌─────────────┐     ┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│             │     │             │     │             │     │             │
│  Detection  │────▶│  Response   │────▶│ Resolution  │────▶│ Postmortem  │
│             │     │             │     │             │     │             │
└─────────────┘     └─────────────┘     └─────────────┘     └─────────────┘
                                                                  │
                                                                  │
                                                                  ▼
                                                           ┌─────────────┐
                                                           │             │
                                                           │ Improvement │
                                                           │             │
                                                           └─────────────┘

Detection: Identifying that an incident is occurring
Response: Assembling the right team and beginning mitigation
Resolution: Implementing fixes to restore service
Postmortem: Analyzing what happened and why
Improvement: Implementing changes to prevent recurrence

Let’s explore each phase in detail.

Incident Detection and Classification

Effective incident management begins with prompt detection and accurate classification.

Detection Mechanisms

Implement multiple layers of detection to catch incidents early:

Monitoring and Alerting: Automated systems that detect anomalies
User Reports: Channels for users to report issues
Business Metrics: Tracking business impact metrics (e.g., order rate)
Synthetic Monitoring: Simulated user journeys to detect issues proactively

Example Prometheus Alert Rule:

groups:
- name: availability
  rules:
  - alert: HighErrorRate
    expr: sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) > 0.05
    for: 2m
    labels:
      severity: critical
    annotations:
      summary: "High HTTP error rate"
      description: "Error rate is {{ $value | humanizePercentage }} for the past 2 minutes (threshold: 5%)"

Incident Classification

Classify incidents to ensure proportional response:

Example Severity Levels:

Level	Name	Description	Examples	Response Time	Communication
P1	Critical	Complete service outage or severe business impact	- Payment system down- Data loss- Security breach	Immediate	Executive updates every 30 min
P2	High	Partial service outage or significant degradation	- Checkout slow- Feature unavailable- Performance degradation	< 15 min	Stakeholder updates hourly
P3	Medium	Minor service degradation	- Non-critical feature issue- Isolated errors- Slow performance in one area	< 1 hour	Daily summary
P4	Low	Minimal or no user impact	- Cosmetic issues- Internal tooling issues- Technical debt	< 1 day	Weekly report

Example Classification Decision Tree:

Is there complete loss of service?
├── Yes → P1
└── No → Is there significant degradation affecting most users?
    ├── Yes → P2
    └── No → Is there partial degradation affecting some users?
        ├── Yes → P3
        └── No → P4

Automated Triage

Implement automated triage to speed up classification:

Example Automated Triage System:

def classify_incident(alert_data):
    """Automatically classify incident severity based on alert data."""
    
    # Extract metrics from alert
    service = alert_data.get('service')
    error_rate = alert_data.get('error_rate', 0)
    affected_users = alert_data.get('affected_users', 0)
    is_revenue_impacting = alert_data.get('is_revenue_impacting', False)
    
    # Critical services list
    critical_services = ['payments', 'checkout', 'authentication', 'database']
    
    # Classification logic
    if service in critical_services and error_rate > 0.20:
        return 'P1"
    elif is_revenue_impacting or error_rate > 0.10 or affected_users > 1000:
        return 'P2"
    elif error_rate > 0.05 or affected_users > 100:
        return 'P3"
    else:
        return 'P4"

On-Call Systems and Rotations

A well-designed on-call system is essential for effective incident response.

On-Call Best Practices

Sustainable Rotations: Design rotations that don’t burn out your team
Clear Escalation Paths: Define who to contact when additional help is needed
Adequate Training: Ensure on-call engineers have the knowledge they need
Fair Compensation: Compensate engineers for on-call duties
Continuous Improvement: Regularly review and improve the on-call experience

Rotation Structures

Different rotation structures work for different team sizes and distributions:

Example Rotation Patterns:

Primary/Secondary Model:
- Primary: First responder for all alerts
- Secondary: Backup if primary is unavailable or needs assistance
Follow-the-Sun Model:
- Teams in different time zones handle on-call during their daytime
- Minimizes night shifts but requires distributed teams
Specialty-Based Model:
- Different rotations for different systems (e.g., database, frontend)
- Engineers only on-call for systems they’re familiar with

Example PagerDuty Schedule Configuration:

# Follow-the-Sun rotation with 3 teams
schedules:
  - name: "Global SRE On-Call"
    time_zone: "UTC"
    layers:
      - name: "APAC Team"
        start: "2023-01-01T00:00:00Z"
        rotation_virtual_start: "2023-01-01T00:00:00Z"
        rotation_turn_length_seconds: 86400  # 24 hours
        users:
          - user_id: "PXXXXX1"  # Tokyo
          - user_id: "PXXXXX2"  # Singapore
          - user_id: "PXXXXX3"  # Sydney
        restrictions:
          - type: "daily"
            start_time_of_day: "22:00:00"
            duration_seconds: 32400  # 9 hours (22:00 - 07:00 UTC)
            
      - name: "EMEA Team"
        start: "2023-01-01T00:00:00Z"
        rotation_virtual_start: "2023-01-01T00:00:00Z"
        rotation_turn_length_seconds: 86400  # 24 hours
        users:
          - user_id: "PXXXXX4"  # London
          - user_id: "PXXXXX5"  # Berlin
          - user_id: "PXXXXX6"  # Tel Aviv
        restrictions:
          - type: "daily"
            start_time_of_day: "06:00:00"
            duration_seconds: 32400  # 9 hours (06:00 - 15:00 UTC)
            
      - name: "Americas Team"
        start: "2023-01-01T00:00:00Z"
        rotation_virtual_start: "2023-01-01T00:00:00Z"
        rotation_turn_length_seconds: 86400  # 24 hours
        users:
          - user_id: "PXXXXX7"  # New York
          - user_id: "PXXXXX8"  # San Francisco
          - user_id: "PXXXXX9"  # São Paulo
        restrictions:
          - type: "daily"
            start_time_of_day: "14:00:00"
            duration_seconds: 32400  # 9 hours (14:00 - 23:00 UTC)

On-Call Tooling

Equip your on-call engineers with the right tools:

Alerting System: PagerDuty, OpsGenie, or VictorOps
Runbooks: Documented procedures for common incidents
Communication Tools: Slack, Teams, or dedicated incident channels
Dashboards: Real-time visibility into system health
Access Management: Just-in-time access to production systems

Example Runbook Template:

# Service Outage Runbook: Payment Processing System

## Quick Reference
- **Service Owner**: Payments Team
- **Service Dashboard**: [Link to Dashboard](https://grafana.example.com/d/payments)
- **Repository**: [GitHub Link](https://github.com/example/payments-service)
- **Architecture Diagram**: [Link to Diagram](https://wiki.example.com/payments/architecture)

## Symptoms
- Payment failure rate > 5%
- Increased latency in payment processing (> 2s)
- Error logs showing connection timeouts to payment gateway

## Initial Assessment
1. Check the service dashboard for error rates and latency
2. Verify payment gateway status: [Gateway Status Page](https://status.paymentprovider.com)
3. Check recent deployments: `kubectl get deployments -n payments --sort-by=.metadata.creationTimestamp`

## Diagnosis Steps
1. **Check for increased error rates**:

kubectl logs -n payments -l app=payment-service –tail=100 | grep ERROR


2. **Check database connectivity**:

kubectl exec -it -n payments $(kubectl get pods -n payments -l app=payment-service -o jsonpath=’{.items[0].metadata.name}’) – pg_isready -h payments-db


3. **Verify payment gateway connectivity**:

kubectl exec -it -n payments $(kubectl get pods -n payments -l app=payment-service -o jsonpath=’{.items[0].metadata.name}’) – curl -v https://api.paymentprovider.com/health


## Resolution Steps

### Scenario 1: Database Connection Issues
1. Check database pod status:

kubectl get pods -n payments -l app=payments-db

2. Check database logs:

kubectl logs -n payments -l app=payments-db

3. If database is down, check for resource constraints:

kubectl describe node $(kubectl get pods -n payments -l app=payments-db -o jsonpath=’{.items[0].spec.nodeName}')

4. Restart database if necessary:

kubectl rollout restart statefulset payments-db -n payments

Continue Your Learning

This is part 1 of 5 in the comprehensive guide.

Guide Overview See all 5 parts Next → Fundamentals and Core Concepts