Cancellation Patterns | Andrew Odendaal

When Things Need to Stop (And How to Make Them)

Cancellation is where context really shines, but it’s also where I see the most confusion. Too many developers think cancellation is just about timeouts—press a button, operation stops. In reality, cancellation in distributed systems is more like conducting an orchestra: you need to coordinate multiple moving parts to stop gracefully at the same time.

The trick isn’t just stopping work—it’s stopping work cleanly, without leaving your system in a weird state or leaking resources all over the place.

The Cascade Effect

One of the coolest things about Go’s context model is how cancellation cascades down through derived contexts. Cancel a parent, and all the children stop automatically:

func runComplexWorkflow(ctx context.Context) error {
    // Create a workflow-specific context
    workflowCtx, cancel := context.WithCancel(ctx)
    defer cancel()
    
    // Channel to collect errors from goroutines
    errChan := make(chan error, 3)
    
    // Start three concurrent operations
    go func() {
        errChan <- fetchUserProfile(workflowCtx)
    }()
    
    go func() {
        errChan <- generateAnalytics(workflowCtx)
    }()
    
    go func() {
        errChan <- updateRecommendations(workflowCtx)
    }()
    
    // Wait for first completion or error
    for i := 0; i < 3; i++ {
        select {
        case err := <-errChan:
            if err != nil {
                // Something failed - cancel everything else
                cancel()
                return fmt.Errorf("workflow failed: %w", err)
            }
        case <-ctx.Done():
            // Parent context cancelled - we're done here
            return ctx.Err()
        }
    }
    
    return nil
}

What I love about this pattern is that one failure automatically stops all related work. No need to manually track and cancel individual operations—the context tree handles it for you.

Selective Cancellation (When You Need More Control)

Sometimes you don’t want to cancel everything. Maybe the user data fetch failed, but you still want to show cached recommendations. Here’s how I handle selective cancellation:

type WorkManager struct {
    operations map[string]context.CancelFunc
    mu         sync.RWMutex
}

func NewWorkManager() *WorkManager {
    return &WorkManager{
        operations: make(map[string]context.CancelFunc),
    }
}

func (wm *WorkManager) StartOperation(parent context.Context, name string) context.Context {
    wm.mu.Lock()
    defer wm.mu.Unlock()
    
    ctx, cancel := context.WithCancel(parent)
    wm.operations[name] = cancel
    return ctx
}

func (wm *WorkManager) CancelOperation(name string) {
    wm.mu.Lock()
    defer wm.mu.Unlock()
    
    if cancel, exists := wm.operations[name]; exists {
        cancel()
        delete(wm.operations, name)
    }
}

func (wm *WorkManager) CancelAll() {
    wm.mu.Lock()
    defer wm.mu.Unlock()
    
    for _, cancel := range wm.operations {
        cancel()
    }
    wm.operations = make(map[string]context.CancelFunc)
}

This gives you fine-grained control over what gets cancelled when. I use this pattern in systems where different operations have different criticality levels.

Smart Timeout Coordination

Here’s something that took me a while to figure out: not all operations should have the same timeout. A cache lookup should fail fast, but a complex calculation might need more time:

func processRequestWithSmartTimeouts(ctx context.Context, req *Request) error {
    // Fast operations get short timeouts
    cacheCtx, cacheCancel := context.WithTimeout(ctx, 100*time.Millisecond)
    defer cacheCancel()
    
    // Slow operations get longer timeouts  
    dbCtx, dbCancel := context.WithTimeout(ctx, 5*time.Second)
    defer dbCancel()
    
    // Try cache first
    if data, err := getFromCache(cacheCtx, req.Key); err == nil {
        return processData(ctx, data)
    }
    
    // Cache miss - hit the database
    data, err := getFromDatabase(dbCtx, req.Key)
    if err != nil {
        return err
    }
    
    // Update cache in background (with its own timeout)
    go func() {
        updateCtx, updateCancel := context.WithTimeout(context.Background(), 2*time.Second)
        defer updateCancel()
        updateCache(updateCtx, req.Key, data)
    }()
    
    return processData(ctx, data)
}

Notice how the cache update runs in a background goroutine with its own context? That’s because we don’t want cache update failures to affect the main request.

Cancellation with Cleanup

This is where things get tricky. When an operation gets cancelled, you often need to clean up resources, but the cleanup itself might take time:

func processWithCleanup(ctx context.Context) error {
    // Track resources that need cleanup
    var resources []io.Closer
    defer func() {
        // Clean up resources even if context is cancelled
        cleanupCtx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
        defer cancel()
        
        for _, resource := range resources {
            if err := resource.Close(); err != nil {
                log.Printf("Failed to close resource: %v", err)
            }
        }
    }()
    
    // Acquire resources
    db, err := openDatabase(ctx)
    if err != nil {
        return err
    }
    resources = append(resources, db)
    
    cache, err := openCache(ctx)
    if err != nil {
        return err
    }
    resources = append(resources, cache)
    
    // Do the actual work
    return performWork(ctx, db, cache)
}

The key insight here is using a separate context for cleanup. Even if the main context is cancelled, you still want to clean up properly.

Handling Different Cancellation Reasons

Not all cancellations are created equal. User cancellation is different from timeout, which is different from system shutdown:

func handleCancellation(ctx context.Context, operation string) error {
    err := doSomeWork(ctx)
    
    if err == nil {
        return nil
    }
    
    // Figure out why we were cancelled
    switch {
    case errors.Is(err, context.Canceled):
        // User hit the cancel button - that's fine
        log.Printf("User cancelled %s operation", operation)
        return nil
        
    case errors.Is(err, context.DeadlineExceeded):
        // Operation timed out - might be a problem
        log.Printf("Operation %s timed out", operation)
        return fmt.Errorf("operation timeout: %w", err)
        
    default:
        // Some other error occurred
        return fmt.Errorf("operation failed: %w", err)
    }
}

I treat user cancellation as success (they got what they wanted—the operation stopped), but timeouts might indicate a performance problem that needs investigation.

The “Cancel Everything” Pattern

Sometimes you need a nuclear option—cancel all ongoing work immediately. Here’s how I implement that:

type CancellationManager struct {
    rootCancel context.CancelFunc
    mu         sync.RWMutex
}

func NewCancellationManager() *CancellationManager {
    ctx, cancel := context.WithCancel(context.Background())
    
    return &CancellationManager{
        rootCancel: cancel,
    }
}

func (cm *CancellationManager) CreateContext() context.Context {
    cm.mu.RLock()
    defer cm.mu.RUnlock()
    
    // All contexts derive from the same root
    return context.WithValue(context.Background(), "root", cm.rootCancel)
}

func (cm *CancellationManager) CancelEverything() {
    cm.mu.Lock()
    defer cm.mu.Unlock()
    
    if cm.rootCancel != nil {
        cm.rootCancel()
        cm.rootCancel = nil
    }
}

This is useful for graceful shutdown scenarios where you want to stop all ongoing work before the process exits.

The thing about cancellation patterns is that they’re not just about stopping work—they’re about stopping work in a way that leaves your system in a consistent state. Master these patterns, and you’ll build systems that handle failures gracefully instead of falling over in a heap.

Next, we’ll dive into timeout and deadline management. You’ll learn how to set intelligent timeouts that adapt to system conditions, coordinate deadlines across service boundaries, and handle the tricky edge cases that come up in distributed systems.