Advanced Patterns and Techniques

Blameless Postmortems

After an incident is resolved, a thorough postmortem helps teams learn and improve.

Postmortem Philosophy

Effective postmortems follow these principles:

Blameless: Focus on systems and processes, not individuals
Thorough: Dig deep to find root causes
Action-Oriented: Identify concrete improvements
Transparent: Share findings widely
Timely: Conduct while details are fresh

Postmortem Template

Use a consistent template for all postmortems:

# Incident Postmortem: [Brief Incident Description]

## Incident Summary
- **Date**: [Date of incident]
- **Duration**: [HH:MM] to [HH:MM] (X hours Y minutes)
- **Severity**: [P1/P2/P3/P4]
- **Service(s) Affected**: [List of affected services]
- **Customer Impact**: [Description of customer impact]

## Timeline
| Time (UTC) | Event |
|------------|-------|
| 14:32 | Alert triggered: High error rate on payment service |
| 14:35 | On-call engineer acknowledged alert |
| 14:42 | Identified database connection pool exhaustion |
| 15:10 | Implemented mitigation: increased connection pool size |
| 15:15 | Service recovered, error rates returned to normal |
| 16:00 | Incident closed |

## Root Cause Analysis
[Detailed explanation of what caused the incident. Avoid blaming individuals; focus on systems, processes, and technical factors.]

## Contributing Factors
- [Factor 1]
- [Factor 2]
- [Factor 3]

## What Went Well
- [Positive aspect 1]
- [Positive aspect 2]
- [Positive aspect 3]

## What Went Poorly
- [Area for improvement 1]
- [Area for improvement 2]
- [Area for improvement 3]

## Action Items
| Action | Type | Owner | Due Date | Status |
|--------|------|-------|----------|--------|
| Increase default connection pool size | Prevent | @alice | 2025-04-15 | In Progress |
| Add monitoring for connection pool utilization | Detect | @bob | 2025-04-20 | Not Started |
| Update runbook with connection pool troubleshooting steps | Respond | @charlie | 2025-04-12 | Completed |

## Lessons Learned
[Key takeaways and broader lessons that can be applied to other systems or processes]

The Five Whys Technique

Use the “Five Whys” technique to identify root causes:

Example:

Problem: The payment service experienced high error rates.

Why? Database connections were being rejected.
Why? The connection pool was exhausted.
Why? The service was creating more connections than expected.
Why? A recent code change removed connection pooling.
Why? The code review process didn't catch the removal of connection pooling.

Root cause: The code review process lacks specific checks for critical resource management patterns.

Tracking Improvements

Create systems to track and implement improvements identified in postmortems:

Example Improvement Tracking:

# Example improvement tracking system
class ImprovementItem:
    def __init__(self, description, incident_id, improvement_type, owner, due_date):
        self.description = description
        self.incident_id = incident_id
        self.improvement_type = improvement_type  # "prevent", "detect", or "respond"
        self.owner = owner
        self.due_date = due_date
        self.status = "Not Started"
        self.completion_date = None
    
    def update_status(self, status):
        valid_statuses = ["Not Started", "In Progress", "Completed", "Deferred"]
        if status not in valid_statuses:
            raise ValueError(f"Status must be one of: {valid_statuses}")
        
        self.status = status
        if status == "Completed":
            self.completion_date = datetime.now()
    
    def is_overdue(self):
        return self.status != "Completed" and datetime.now() > self.due_date

# Example usage
improvements = [
    ImprovementItem(
        "Increase default connection pool size",
        "INC-2025-042",
        "prevent",
        "[email protected]",
        datetime(2025, 4, 15)
    ),
    ImprovementItem(
        "Add monitoring for connection pool utilization",
        "INC-2025-042",
        "detect",
        "[email protected]",
        datetime(2025, 4, 20)
    )
]

# Generate report of overdue items
overdue_items = [item for item in improvements if item.is_overdue()]

Building a Learning Culture

Effective incident management goes beyond processes and tools—it requires a culture that values learning and improvement.

Incident Reviews

Hold regular incident reviews to share learnings:

Example Incident Review Format:

# Monthly Incident Review: April 2025

## Incident Summary
- Total incidents: 12 (3 P1, 4 P2, 5 P3)
- Average time to detection: 4.2 minutes
- Average time to mitigation: 32 minutes
- Average time to resolution: 78 minutes

## Top Impacted Services
1. Payment Processing (3 incidents)
2. User Authentication (2 incidents)
3. Search Service (2 incidents)