Implement proper shutdown procedures for Go applications.

Signal Handling Fundamentals

At the core of graceful shutdown is the ability to detect and respond to termination signals. Before diving into complex implementations, let’s establish a solid understanding of signal handling in Go.

Understanding OS Signals

Operating systems communicate with processes through signals. The most common signals relevant to application lifecycle management include:

  1. SIGINT (Ctrl+C): Interrupt signal, typically sent when a user presses Ctrl+C
  2. SIGTERM: Termination signal, the standard way to request graceful termination
  3. SIGKILL: Kill signal, forces immediate termination (cannot be caught or ignored)
  4. SIGHUP: Hangup signal, traditionally used to indicate a controlling terminal has closed

In Go, we can capture and handle these signals using the os/signal package and channels:

package main

import (
	"fmt"
	"os"
	"os/signal"
	"syscall"
	"time"
)

func main() {
	// Create a channel to receive OS signals
	sigs := make(chan os.Signal, 1)
	
	// Register for specific signals
	signal.Notify(sigs, syscall.SIGINT, syscall.SIGTERM)
	
	// Create a channel to indicate when processing is done
	done := make(chan bool, 1)
	
	// Start a goroutine to handle signals
	go func() {
		// Block until a signal is received
		sig := <-sigs
		fmt.Printf("Received signal: %s\n", sig)
		
		// Perform cleanup operations
		fmt.Println("Starting graceful shutdown...")
		time.Sleep(2 * time.Second) // Simulate cleanup work
		fmt.Println("Cleanup completed, shutting down...")
		
		// Signal completion
		done <- true
	}()
	
	fmt.Println("Application running... Press Ctrl+C to terminate")
	
	// Block until done signal is received
	<-done
	fmt.Println("Application stopped")
}

This simple example demonstrates the basic pattern for signal handling in Go:

  1. Create a channel to receive signals
  2. Register for specific signals using signal.Notify()
  3. Start a goroutine to handle signals and perform cleanup
  4. Block the main goroutine until cleanup is complete

Context-Based Cancellation

Go’s context package provides a powerful mechanism for propagating cancellation signals throughout your application. This is particularly useful for graceful shutdown scenarios:

package main

import (
	"context"
	"fmt"
	"os"
	"os/signal"
	"sync"
	"syscall"
	"time"
)

func main() {
	// Create a base context with cancellation capability
	ctx, cancel := context.WithCancel(context.Background())
	
	// Create a channel to receive OS signals
	sigs := make(chan os.Signal, 1)
	signal.Notify(sigs, syscall.SIGINT, syscall.SIGTERM)
	
	// Create a WaitGroup to track active workers
	var wg sync.WaitGroup
	
	// Start some worker goroutines
	for i := 1; i <= 3; i++ {
		wg.Add(1)
		go worker(ctx, i, &wg)
	}
	
	// Handle signals
	go func() {
		sig := <-sigs
		fmt.Printf("\nReceived signal: %s\n", sig)
		fmt.Println("Cancelling context...")
		cancel() // This will propagate cancellation to all workers
	}()
	
	fmt.Println("Application running with workers... Press Ctrl+C to terminate")
	
	// Wait for all workers to finish
	wg.Wait()
	fmt.Println("All workers have completed, shutting down...")
}

func worker(ctx context.Context, id int, wg *sync.WaitGroup) {
	defer wg.Done()
	
	fmt.Printf("Worker %d starting\n", id)
	
	// Simulate work with context awareness
	for {
		select {
		case <-time.After(time.Second):
			fmt.Printf("Worker %d performing task\n", id)
		case <-ctx.Done():
			fmt.Printf("Worker %d received cancellation signal, cleaning up...\n", id)
			// Simulate cleanup work
			time.Sleep(time.Duration(id) * 500 * time.Millisecond)
			fmt.Printf("Worker %d cleanup complete\n", id)
			return
		}
	}
}

This example demonstrates how to use context cancellation to coordinate shutdown across multiple goroutines:

  1. Create a cancellable context
  2. Pass the context to all workers
  3. When a termination signal is received, call cancel() to notify all workers
  4. Use a WaitGroup to ensure all workers complete their cleanup before the application exits

Fundamentals and Core Concepts

Graceful Shutdown Patterns

With the fundamentals established, let’s explore more sophisticated patterns for implementing graceful shutdown in different types of Go applications.

HTTP Server Graceful Shutdown

Go’s standard library provides built-in support for graceful shutdown of HTTP servers since Go 1.8. This allows existing connections to complete their requests before the server shuts down:

package main

import (
	"context"
	"fmt"
	"log"
	"net/http"
	"os"
	"os/signal"
	"syscall"
	"time"
)

func main() {
	// Create a new server
	server := &http.Server{
		Addr: ":8080",
		Handler: http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
			// Simulate a long-running request
			time.Sleep(5 * time.Second)
			fmt.Fprintf(w, "Hello, World!")
		}),
	}
	
	// Channel to listen for errors coming from the listener
	serverErrors := make(chan error, 1)
	
	// Start the server in a goroutine
	go func() {
		log.Printf("Server listening on %s", server.Addr)
		serverErrors <- server.ListenAndServe()
	}()
	
	// Channel to listen for interrupt signals
	shutdown := make(chan os.Signal, 1)
	signal.Notify(shutdown, os.Interrupt, syscall.SIGTERM)
	
	// Block until we receive a signal or an error
	select {
	case err := <-serverErrors:
		log.Fatalf("Error starting server: %v", err)
		
	case sig := <-shutdown:
		log.Printf("Received signal: %v", sig)
		
		// Create a deadline for graceful shutdown
		ctx, cancel := context.WithTimeout(context.Background(), 15*time.Second)
		defer cancel()
		
		// Gracefully shutdown the server
		log.Printf("Shutting down server gracefully with timeout: %s", ctx.Deadline())
		
		if err := server.Shutdown(ctx); err != nil {
			log.Printf("Server shutdown error: %v", err)
			// Force close if graceful shutdown fails
			if err := server.Close(); err != nil {
				log.Printf("Server close error: %v", err)
			}
		}
		
		log.Println("Server shutdown complete")
	}
}

Key aspects of this pattern:

  1. Start the HTTP server in a separate goroutine
  2. Wait for termination signals
  3. When a signal is received, call server.Shutdown() with a timeout context
  4. If graceful shutdown fails within the timeout, force close the server

Multiple Server Coordination

In real-world applications, you might need to coordinate the shutdown of multiple servers or services:

package main

import (
	"context"
	"fmt"
	"log"
	"net/http"
	"os"
	"os/signal"
	"sync"
	"syscall"
	"time"
)

type Server struct {
	name       string
	httpServer *http.Server
}

func NewServer(name string, addr string, handler http.Handler) *Server {
	return &Server{
		name: name,
		httpServer: &http.Server{
			Addr:    addr,
			Handler: handler,
		},
	}
}

func (s *Server) Start(wg *sync.WaitGroup) {
	defer wg.Done()
	
	log.Printf("%s server starting on %s", s.name, s.httpServer.Addr)
	
	if err := s.httpServer.ListenAndServe(); err != http.ErrServerClosed {
		log.Printf("%s server error: %v", s.name, err)
	}
	
	log.Printf("%s server stopped", s.name)
}

func (s *Server) Shutdown(ctx context.Context) error {
	log.Printf("Shutting down %s server...", s.name)
	return s.httpServer.Shutdown(ctx)
}

func main() {
	// Create API and metrics servers
	apiServer := NewServer("API", ":8080", http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
		time.Sleep(2 * time.Second) // Simulate work
		fmt.Fprintf(w, "API response")
	}))
	
	metricsServer := NewServer("Metrics", ":9090", http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
		fmt.Fprintf(w, "Metrics data")
	}))
	
	// WaitGroup for tracking running servers
	var wg sync.WaitGroup
	
	// Start servers
	wg.Add(2)
	go apiServer.Start(&wg)
	go metricsServer.Start(&wg)
	
	// Channel to listen for interrupt signals
	shutdown := make(chan os.Signal, 1)
	signal.Notify(shutdown, os.Interrupt, syscall.SIGTERM)
	
	// Wait for shutdown signal
	sig := <-shutdown
	log.Printf("Received signal: %v", sig)
	
	// Create a deadline for graceful shutdown
	ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
	defer cancel()
	
	// Shutdown servers in order (metrics first, then API)
	if err := metricsServer.Shutdown(ctx); err != nil {
		log.Printf("Metrics server shutdown error: %v", err)
	}
	
	if err := apiServer.Shutdown(ctx); err != nil {
		log.Printf("API server shutdown error: %v", err)
	}
	
	// Wait for servers to finish
	log.Println("Waiting for servers to complete shutdown...")
	wg.Wait()
	log.Println("All servers shutdown complete")
}

This pattern demonstrates:

  1. Encapsulating servers in a common interface
  2. Starting each server in its own goroutine
  3. Coordinating shutdown in a specific order
  4. Using a WaitGroup to ensure all servers have fully stopped

Advanced Patterns and Techniques

Resource Cleanup and Management

Proper resource management is critical during shutdown. Let’s explore patterns for cleaning up various types of resources.

Database Connection Cleanup

Ensuring database connections are properly closed prevents connection leaks and allows transactions to complete:

package main

import (
	"context"
	"database/sql"
	"fmt"
	"log"
	"os"
	"os/signal"
	"syscall"
	"time"
	
	_ "github.com/go-sql-driver/mysql"
)

type App struct {
	db *sql.DB
}

func NewApp() (*App, error) {
	// Open database connection
	db, err := sql.Open("mysql", "user:password@tcp(127.0.0.1:3306)/dbname")
	if err != nil {
		return nil, fmt.Errorf("failed to open database: %w", err)
	}
	
	// Configure connection pool
	db.SetMaxOpenConns(25)
	db.SetMaxIdleConns(25)
	db.SetConnMaxLifetime(5 * time.Minute)
	
	// Verify connection
	if err := db.Ping(); err != nil {
		db.Close() // Close on error
		return nil, fmt.Errorf("failed to ping database: %w", err)
	}
	
	return &App{db: db}, nil
}

func (a *App) Shutdown(ctx context.Context) error {
	log.Println("Closing database connections...")
	
	// Create a timeout for DB shutdown if not already set
	dbCtx, cancel := context.WithTimeout(ctx, 10*time.Second)
	defer cancel()
	
	// Use a channel to signal completion or timeout
	done := make(chan struct{})
	var err error
	
	go func() {
		// Close the database connection
		err = a.db.Close()
		close(done)
	}()
	
	// Wait for completion or timeout
	select {
	case <-done:
		if err != nil {
			return fmt.Errorf("error closing database: %w", err)
		}
		log.Println("Database connections closed successfully")
		return nil
	case <-dbCtx.Done():
		return fmt.Errorf("database shutdown timed out: %w", dbCtx.Err())
	}
}

func main() {
	// Initialize application
	app, err := NewApp()
	if err != nil {
		log.Fatalf("Failed to initialize app: %v", err)
	}
	
	// Channel to listen for interrupt signals
	shutdown := make(chan os.Signal, 1)
	signal.Notify(shutdown, os.Interrupt, syscall.SIGTERM)
	
	// Block until we receive a signal
	sig := <-shutdown
	log.Printf("Received signal: %v", sig)
	
	// Create a deadline for graceful shutdown
	ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
	defer cancel()
	
	// Perform application shutdown
	if err := app.Shutdown(ctx); err != nil {
		log.Printf("Error during shutdown: %v", err)
		os.Exit(1)
	}
	
	log.Println("Application shutdown complete")
}

This pattern demonstrates:

  1. Proper database connection pool configuration
  2. Graceful shutdown with timeout handling
  3. Error handling during shutdown

Worker Pool Graceful Shutdown

Worker pools are common in Go applications. Here’s a pattern for gracefully shutting down a worker pool:

package main

import (
	"context"
	"fmt"
	"log"
	"os"
	"os/signal"
	"sync"
	"syscall"
	"time"
)

// Job represents a unit of work
type Job struct {
	ID int
}

// WorkerPool manages a pool of workers
type WorkerPool struct {
	jobs        chan Job
	results     chan Result
	workerCount int
	shutdown    chan struct{}
	wg          sync.WaitGroup
}

// Result represents the outcome of a job
type Result struct {
	JobID  int
	Output string
	Error  error
}

// NewWorkerPool creates a new worker pool
func NewWorkerPool(workerCount int, queueSize int) *WorkerPool {
	return &WorkerPool{
		jobs:        make(chan Job, queueSize),
		results:     make(chan Result, queueSize),
		workerCount: workerCount,
		shutdown:    make(chan struct{}),
	}
}

// Start launches the worker pool
func (p *WorkerPool) Start() {
	// Start workers
	for i := 1; i <= p.workerCount; i++ {
		p.wg.Add(1)
		go p.worker(i)
	}
	
	log.Printf("Started worker pool with %d workers", p.workerCount)
}

// worker processes jobs
func (p *WorkerPool) worker(id int) {
	defer p.wg.Done()
	
	log.Printf("Worker %d starting", id)
	
	for {
		select {
		case job, ok := <-p.jobs:
			if !ok {
				log.Printf("Worker %d shutting down: job channel closed", id)
				return
			}
			
			// Process job
			log.Printf("Worker %d processing job %d", id, job.ID)
			
			// Simulate work
			time.Sleep(time.Duration(job.ID%3+1) * time.Second)
			
			// Send result
			p.results <- Result{
				JobID:  job.ID,
				Output: fmt.Sprintf("Result for job %d", job.ID),
			}
			
		case <-p.shutdown:
			log.Printf("Worker %d received shutdown signal", id)
			return
		}
	}
}

// Submit adds a job to the pool
func (p *WorkerPool) Submit(job Job) {
	p.jobs <- job
}

// Results returns the results channel
func (p *WorkerPool) Results() <-chan Result {
	return p.results
}

// Shutdown gracefully shuts down the worker pool
func (p *WorkerPool) Shutdown(ctx context.Context) {
	log.Println("Worker pool shutting down...")
	
	// Signal all workers to stop
	close(p.shutdown)
	
	// Close the jobs channel to prevent new jobs
	close(p.jobs)
	
	// Create a channel to signal when workers are done
	done := make(chan struct{})
	
	go func() {
		// Wait for all workers to finish
		p.wg.Wait()
		close(done)
	}()
	
	// Wait for workers to finish or timeout
	select {
	case <-done:
		log.Println("All workers have stopped")
	case <-ctx.Done():
		log.Printf("Worker pool shutdown timed out: %v", ctx.Err())
	}
	
	// Close the results channel
	close(p.results)
}

func main() {
	// Create a worker pool with 5 workers and a queue size of 10
	pool := NewWorkerPool(5, 10)
	pool.Start()
	
	// Start a goroutine to process results
	go func() {
		for result := range pool.Results() {
			log.Printf("Got result: %s (error: %v)", result.Output, result.Error)
		}
		log.Println("Results channel closed")
	}()
	
	// Submit some jobs
	for i := 1; i <= 10; i++ {
		pool.Submit(Job{ID: i})
	}
	
	// Channel to listen for interrupt signals
	shutdown := make(chan os.Signal, 1)
	signal.Notify(shutdown, os.Interrupt, syscall.SIGTERM)
	
	// Block until we receive a signal
	sig := <-shutdown
	log.Printf("Received signal: %v", sig)
	
	// Create a deadline for graceful shutdown
	ctx, cancel := context.WithTimeout(context.Background(), 10*time.Second)
	defer cancel()
	
	// Shutdown the worker pool
	pool.Shutdown(ctx)
	
	log.Println("Application shutdown complete")
}

This pattern demonstrates:

  1. Creating a worker pool with controlled concurrency
  2. Signaling workers to stop processing
  3. Waiting for in-progress work to complete
  4. Handling shutdown timeouts

Implementation Strategies

Coordinating Multiple Services

In microservice architectures, coordinating shutdown across multiple services requires careful orchestration.

Dependency-Aware Shutdown

Services often have dependencies that dictate the order of shutdown. Here’s a pattern for dependency-aware shutdown:

package main

import (
	"context"
	"fmt"
	"log"
	"os"
	"os/signal"
	"sync"
	"syscall"
	"time"
)

// Service represents a component that can be started and stopped
type Service interface {
	Name() string
	Start() error
	Stop(ctx context.Context) error
	Dependencies() []Service
}

// BaseService provides common functionality for services
type BaseService struct {
	name         string
	dependencies []Service
}

func (s *BaseService) Name() string {
	return s.name
}

func (s *BaseService) Dependencies() []Service {
	return s.dependencies
}

// DatabaseService represents a database connection
type DatabaseService struct {
	BaseService
}

func NewDatabaseService() *DatabaseService {
	return &DatabaseService{
		BaseService: BaseService{
			name:         "database",
			dependencies: []Service{},
		},
	}
}

func (s *DatabaseService) Start() error {
	log.Printf("Starting %s service", s.Name())
	time.Sleep(1 * time.Second) // Simulate startup
	return nil
}

func (s *DatabaseService) Stop(ctx context.Context) error {
	log.Printf("Stopping %s service", s.Name())
	time.Sleep(2 * time.Second) // Simulate cleanup
	return nil
}

// CacheService represents a cache service
type CacheService struct {
	BaseService
}

func NewCacheService() *CacheService {
	return &CacheService{
		BaseService: BaseService{
			name:         "cache",
			dependencies: []Service{},
		},
	}
}

func (s *CacheService) Start() error {
	log.Printf("Starting %s service", s.Name())
	time.Sleep(500 * time.Millisecond) // Simulate startup
	return nil
}

func (s *CacheService) Stop(ctx context.Context) error {
	log.Printf("Stopping %s service", s.Name())
	time.Sleep(1 * time.Second) // Simulate cleanup
	return nil
}

// APIService represents an API server
type APIService struct {
	BaseService
}

func NewAPIService(db *DatabaseService, cache *CacheService) *APIService {
	return &APIService{
		BaseService: BaseService{
			name:         "api",
			dependencies: []Service{db, cache},
		},
	}
}

func (s *APIService) Start() error {
	log.Printf("Starting %s service", s.Name())
	time.Sleep(1 * time.Second) // Simulate startup
	return nil
}

func (s *APIService) Stop(ctx context.Context) error {
	log.Printf("Stopping %s service", s.Name())
	time.Sleep(3 * time.Second) // Simulate cleanup
	return nil
}

// Application coordinates all services
type Application struct {
	services []Service
	mu       sync.Mutex
}

func NewApplication(services ...Service) *Application {
	return &Application{
		services: services,
	}
}

// Start starts all services in dependency order
func (a *Application) Start() error {
	started := make(map[string]bool)
	
	var startService func(Service) error
	startService = func(s Service) error {
		a.mu.Lock()
		if started[s.Name()] {
			a.mu.Unlock()
			return nil
		}
		a.mu.Unlock()
		
		// Start dependencies first
		for _, dep := range s.Dependencies() {
			if err := startService(dep); err != nil {
				return fmt.Errorf("failed to start dependency %s: %w", dep.Name(), err)
			}
		}
		
		// Start the service
		if err := s.Start(); err != nil {
			return fmt.Errorf("failed to start service %s: %w", s.Name(), err)
		}
		
		a.mu.Lock()
		started[s.Name()] = true
		a.mu.Unlock()
		
		return nil
	}
	
	// Start all services
	for _, s := range a.services {
		if err := startService(s); err != nil {
			return err
		}
	}
	
	return nil
}

// Stop stops all services in reverse dependency order
func (a *Application) Stop(ctx context.Context) error {
	// Build a reverse dependency graph
	dependedOnBy := make(map[string][]Service)
	
	for _, s := range a.services {
		for _, dep := range s.Dependencies() {
			dependedOnBy[dep.Name()] = append(dependedOnBy[dep.Name()], s)
		}
	}
	
	// Find services with no dependents (leaf nodes)
	var leaves []Service
	for _, s := range a.services {
		if len(dependedOnBy[s.Name()]) == 0 {
			leaves = append(leaves, s)
		}
	}
	
	// Stop services in reverse dependency order
	stopped := make(map[string]bool)
	
	var wg sync.WaitGroup
	errCh := make(chan error, len(a.services))
	
	var stopService func(Service)
	stopService = func(s Service) {
		defer wg.Done()
		
		a.mu.Lock()
		if stopped[s.Name()] {
			a.mu.Unlock()
			return
		}
		stopped[s.Name()] = true
		a.mu.Unlock()
		
		// Stop the service
		if err := s.Stop(ctx); err != nil {
			errCh <- fmt.Errorf("failed to stop service %s: %w", s.Name(), err)
			return
		}
		
		// Stop dependencies after dependents
		for _, dep := range s.Dependencies() {
			// Check if all services depending on this dependency have been stopped
			canStopDep := true
			for _, depDependent := range dependedOnBy[dep.Name()] {
				if !stopped[depDependent.Name()] {
					canStopDep = false
					break
				}
			}
			
			if canStopDep {
				wg.Add(1)
				go stopService(dep)
			}
		}
	}
	
	// Start stopping leaf services
	for _, s := range leaves {
		wg.Add(1)
		go stopService(s)
	}
	
	// Wait for all services to stop or context to be cancelled
	done := make(chan struct{})
	go func() {
		wg.Wait()
		close(done)
	}()
	
	select {
	case <-done:
		// Check for errors
		close(errCh)
		var errs []error
		for err := range errCh {
			errs = append(errs, err)
		}
		
		if len(errs) > 0 {
			return fmt.Errorf("errors during shutdown: %v", errs)
		}
		return nil
		
	case <-ctx.Done():
		return ctx.Err()
	}
}

func main() {
	// Create services
	db := NewDatabaseService()
	cache := NewCacheService()
	api := NewAPIService(db, cache)
	
	// Create application
	app := NewApplication(api, db, cache)
	
	// Start application
	if err := app.Start(); err != nil {
		log.Fatalf("Failed to start application: %v", err)
	}
	
	log.Println("Application started successfully")
	
	// Channel to listen for interrupt signals
	shutdown := make(chan os.Signal, 1)
	signal.Notify(shutdown, os.Interrupt, syscall.SIGTERM)
	
	// Block until we receive a signal
	sig := <-shutdown
	log.Printf("Received signal: %v", sig)
	
	// Create a deadline for graceful shutdown
	ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
	defer cancel()
	
	// Stop application
	if err := app.Stop(ctx); err != nil {
		log.Printf("Error during shutdown: %v", err)
		os.Exit(1)
	}
	
	log.Println("Application shutdown complete")
}

This sophisticated pattern demonstrates:

  1. Modeling service dependencies explicitly
  2. Starting services in dependency order
  3. Stopping services in reverse dependency order
  4. Parallel shutdown where possible
  5. Timeout handling for the entire shutdown process

Health Checks and Readiness Probes

Health checks and readiness probes are essential for coordinating with orchestration systems like Kubernetes.

Performance and Optimization

Implementing Health and Readiness Endpoints

package main

import (
	"context"
	"encoding/json"
	"fmt"
	"log"
	"net/http"
	"os"
	"os/signal"
	"sync"
	"syscall"
	"time"
)

// HealthStatus represents the health state of a component
type HealthStatus string

const (
	StatusHealthy   HealthStatus = "healthy"
	StatusDegraded  HealthStatus = "degraded"
	StatusUnhealthy HealthStatus = "unhealthy"
	StatusShutdown  HealthStatus = "shutdown"
)

// HealthCheck represents a component that can report its health
type HealthCheck interface {
	Name() string
	Check() HealthStatus
}

// Component represents a service component with health reporting
type Component struct {
	name   string
	status HealthStatus
	mu     sync.RWMutex
}

func NewComponent(name string) *Component {
	return &Component{
		name:   name,
		status: StatusHealthy,
	}
}

func (c *Component) Name() string {
	return c.name
}

func (c *Component) Check() HealthStatus {
	c.mu.RLock()
	defer c.mu.RUnlock()
	return c.status
}

func (c *Component) SetStatus(status HealthStatus) {
	c.mu.Lock()
	defer c.mu.Unlock()
	c.status = status
}

// HealthServer provides health and readiness endpoints
type HealthServer struct {
	components []HealthCheck
	server     *http.Server
	isShutdown bool
	mu         sync.RWMutex
}

func NewHealthServer(addr string) *HealthServer {
	return &HealthServer{
		components: []HealthCheck{},
		server: &http.Server{
			Addr: addr,
		},
	}
}

// AddComponent adds a component to health monitoring
func (hs *HealthServer) AddComponent(component HealthCheck) {
	hs.components = append(hs.components, component)
}

// Start begins serving health and readiness endpoints
func (hs *HealthServer) Start() error {
	mux := http.NewServeMux()
	
	// Health endpoint returns overall system health
	mux.HandleFunc("/health", func(w http.ResponseWriter, r *http.Request) {
		hs.mu.RLock()
		if hs.isShutdown {
			hs.mu.RUnlock()
			w.WriteHeader(http.StatusServiceUnavailable)
			json.NewEncoder(w).Encode(map[string]string{
				"status": string(StatusShutdown),
			})
			return
		}
		hs.mu.RUnlock()
		
		overallStatus := StatusHealthy
		componentStatuses := make(map[string]string)
		
		for _, component := range hs.components {
			status := component.Check()
			componentStatuses[component.Name()] = string(status)
			
			if status == StatusUnhealthy {
				overallStatus = StatusUnhealthy
			} else if status == StatusDegraded && overallStatus != StatusUnhealthy {
				overallStatus = StatusDegraded
			}
		}
		
		response := map[string]interface{}{
			"status":     string(overallStatus),
			"components": componentStatuses,
			"timestamp":  time.Now().Format(time.RFC3339),
		}
		
		if overallStatus != StatusHealthy {
			w.WriteHeader(http.StatusServiceUnavailable)
		}
		
		json.NewEncoder(w).Encode(response)
	})
	
	// Readiness endpoint indicates if the service is ready to receive traffic
	mux.HandleFunc("/ready", func(w http.ResponseWriter, r *http.Request) {
		hs.mu.RLock()
		isShutdown := hs.isShutdown
		hs.mu.RUnlock()
		
		if isShutdown {
			w.WriteHeader(http.StatusServiceUnavailable)
			json.NewEncoder(w).Encode(map[string]string{
				"status": "not ready - shutting down",
			})
			return
		}
		
		w.WriteHeader(http.StatusOK)
		json.NewEncoder(w).Encode(map[string]string{
			"status": "ready",
		})
	})
	
	hs.server.Handler = mux
	
	go func() {
		log.Printf("Health server listening on %s", hs.server.Addr)
		if err := hs.server.ListenAndServe(); err != http.ErrServerClosed {
			log.Printf("Health server error: %v", err)
		}
	}()
	
	return nil
}

// BeginShutdown marks the service as shutting down
func (hs *HealthServer) BeginShutdown() {
	hs.mu.Lock()
	defer hs.mu.Unlock()
	hs.isShutdown = true
	log.Println("Health server marked as shutting down")
}

// Shutdown stops the health server
func (hs *HealthServer) Shutdown(ctx context.Context) error {
	log.Println("Shutting down health server...")
	return hs.server.Shutdown(ctx)
}

func main() {
	// Create components
	dbComponent := NewComponent("database")
	apiComponent := NewComponent("api")
	
	// Create health server
	healthServer := NewHealthServer(":8081")
	healthServer.AddComponent(dbComponent)
	healthServer.AddComponent(apiComponent)
	
	// Create API server
	apiServer := &http.Server{
		Addr: ":8080",
		Handler: http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
			time.Sleep(2 * time.Second) // Simulate work
			fmt.Fprintf(w, "API response")
		}),
	}
	
	// Start servers
	if err := healthServer.Start(); err != nil {
		log.Fatalf("Failed to start health server: %v", err)
	}
	
	go func() {
		log.Printf("API server listening on %s", apiServer.Addr)
		if err := apiServer.ListenAndServe(); err != http.ErrServerClosed {
			log.Printf("API server error: %v", err)
		}
	}()
	
	// Channel to listen for interrupt signals
	shutdown := make(chan os.Signal, 1)
	signal.Notify(shutdown, os.Interrupt, syscall.SIGTERM)
	
	// Block until we receive a signal
	sig := <-shutdown
	log.Printf("Received signal: %v", sig)
	
	// Mark as shutting down in health checks
	healthServer.BeginShutdown()
	
	// Simulate degraded status during shutdown
	dbComponent.SetStatus(StatusDegraded)
	
	// Create a deadline for graceful shutdown
	ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
	defer cancel()
	
	// Shutdown API server first
	log.Println("Shutting down API server...")
	if err := apiServer.Shutdown(ctx); err != nil {
		log.Printf("API server shutdown error: %v", err)
	}
	
	// Update component status
	apiComponent.SetStatus(StatusShutdown)
	
	// Shutdown health server last
	if err := healthServer.Shutdown(ctx); err != nil {
		log.Printf("Health server shutdown error: %v", err)
	}
	
	log.Println("Application shutdown complete")
}

This pattern demonstrates:

  1. Implementing health and readiness endpoints
  2. Tracking component health status
  3. Updating health status during shutdown
  4. Using health checks to coordinate with orchestration systems

Production Deployment Strategies

Graceful shutdown is particularly important in production environments, especially when dealing with orchestration systems like Kubernetes.

Connection Draining for Zero-Downtime Deployments

In production environments, you often need to ensure that in-flight requests are completed before shutting down:

package main

import (
	"context"
	"fmt"
	"log"
	"net/http"
	"os"
	"os/signal"
	"sync"
	"syscall"
	"time"
)

// ConnectionTracker keeps track of active connections
type ConnectionTracker struct {
	activeConnections int
	mu                sync.Mutex
	drainComplete     chan struct{}
}

func NewConnectionTracker() *ConnectionTracker {
	return &ConnectionTracker{
		drainComplete: make(chan struct{}),
	}
}

// ConnectionStarted increments the active connection counter
func (ct *ConnectionTracker) ConnectionStarted() {
	ct.mu.Lock()
	defer ct.mu.Unlock()
	ct.activeConnections++
	log.Printf("Connection started. Active connections: %d", ct.activeConnections)
}

// ConnectionFinished decrements the active connection counter
func (ct *ConnectionTracker) ConnectionFinished() {
	ct.mu.Lock()
	defer ct.mu.Unlock()
	ct.activeConnections--
	log.Printf("Connection finished. Active connections: %d", ct.activeConnections)
	
	// If we're draining and this was the last connection, signal completion
	if ct.activeConnections == 0 && ct.drainComplete != nil {
		select {
		case <-ct.drainComplete:
			// Channel already closed
		default:
			close(ct.drainComplete)
		}
	}
}

// WaitForDrain waits for all connections to finish
func (ct *ConnectionTracker) WaitForDrain(ctx context.Context) error {
	ct.mu.Lock()
	if ct.activeConnections == 0 {
		ct.mu.Unlock()
		return nil
	}
	ct.mu.Unlock()
	
	select {
	case <-ct.drainComplete:
		return nil
	case <-ctx.Done():
		return ctx.Err()
	}
}

// ConnectionDrainingHandler wraps an HTTP handler to track connections
func ConnectionDrainingHandler(tracker *ConnectionTracker, next http.Handler) http.Handler {
	return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
		tracker.ConnectionStarted()
		defer tracker.ConnectionFinished()
		next.ServeHTTP(w, r)
	})
}

func main() {
	// Create connection tracker
	tracker := NewConnectionTracker()
	
	// Create server with connection tracking
	server := &http.Server{
		Addr: ":8080",
		Handler: ConnectionDrainingHandler(tracker, http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
			// Simulate a long-running request
			duration := time.Duration(2+time.Now().Second()%3) * time.Second
			log.Printf("Handling request, will take %v", duration)
			time.Sleep(duration)
			fmt.Fprintf(w, "Request processed after %v", duration)
		})),
	}
	
	// Start server
	go func() {
		log.Printf("Server listening on %s", server.Addr)
		if err := server.ListenAndServe(); err != http.ErrServerClosed {
			log.Printf("Server error: %v", err)
		}
	}()
	
	// Channel to listen for interrupt signals
	shutdown := make(chan os.Signal, 1)
	signal.Notify(shutdown, os.Interrupt, syscall.SIGTERM)
	
	// Block until we receive a signal
	sig := <-shutdown
	log.Printf("Received signal: %v", sig)
	
	// Create a deadline for graceful shutdown
	ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
	defer cancel()
	
	// Step 1: Stop accepting new connections
	log.Println("Shutting down server - no longer accepting new connections")
	if err := server.Shutdown(ctx); err != nil {
		log.Printf("Server shutdown error: %v", err)
	}
	
	// Step 2: Wait for existing connections to drain
	log.Println("Waiting for active connections to complete...")
	if err := tracker.WaitForDrain(ctx); err != nil {
		log.Printf("Connection draining error: %v", err)
	} else {
		log.Println("All connections drained successfully")
	}
	
	log.Println("Server shutdown complete")
}

This pattern demonstrates:

  1. Tracking active connections
  2. Gracefully rejecting new connections while allowing existing ones to complete
  3. Waiting for all in-flight requests to finish before final shutdown

Kubernetes-Ready Graceful Shutdown

When running in Kubernetes, you need to handle termination signals and coordinate with the container lifecycle:

package main

import (
	"context"
	"fmt"
	"log"
	"net/http"
	"os"
	"os/signal"
	"sync"
	"syscall"
	"time"
)

// ShutdownManager coordinates the shutdown process
type ShutdownManager struct {
	shutdownTimeout time.Duration
	preStopTimeout  time.Duration
	server          *http.Server
	readyToShutdown bool
	mu              sync.RWMutex
}

func NewShutdownManager(server *http.Server, shutdownTimeout, preStopTimeout time.Duration) *ShutdownManager {
	return &ShutdownManager{
		server:          server,
		shutdownTimeout: shutdownTimeout,
		preStopTimeout:  preStopTimeout,
	}
}

// StartPreStop marks the service as no longer ready and waits for the preStop hook duration
func (sm *ShutdownManager) StartPreStop() {
	sm.mu.Lock()
	sm.readyToShutdown = true
	sm.mu.Unlock()
	
	log.Printf("PreStop hook received, waiting %v before starting shutdown", sm.preStopTimeout)
	time.Sleep(sm.preStopTimeout)
}

// IsReady returns whether the service is ready to receive traffic
func (sm *ShutdownManager) IsReady() bool {
	sm.mu.RLock()
	defer sm.mu.RUnlock()
	return !sm.readyToShutdown
}

// Shutdown performs the actual server shutdown
func (sm *ShutdownManager) Shutdown() error {
	log.Println("Starting graceful shutdown...")
	
	ctx, cancel := context.WithTimeout(context.Background(), sm.shutdownTimeout)
	defer cancel()
	
	return sm.server.Shutdown(ctx)
}

func main() {
	// Create server
	server := &http.Server{
		Addr: ":8080",
		Handler: http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
			time.Sleep(2 * time.Second) // Simulate work
			fmt.Fprintf(w, "Hello, World!")
		}),
	}
	
	// Create shutdown manager
	shutdownManager := NewShutdownManager(
		server,
		30*time.Second, // Shutdown timeout
		5*time.Second,  // PreStop hook duration
	)
	
	// Create health server
	healthServer := &http.Server{
		Addr: ":8081",
		Handler: http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
			// Kubernetes readiness probe
			if r.URL.Path == "/ready" {
				if shutdownManager.IsReady() {
					w.WriteHeader(http.StatusOK)
					fmt.Fprintln(w, "Ready")
				} else {
					// Return not ready during shutdown
					w.WriteHeader(http.StatusServiceUnavailable)
					fmt.Fprintln(w, "Not Ready - Shutting Down")
				}
				return
			}
			
			// Kubernetes liveness probe
			if r.URL.Path == "/health" {
				w.WriteHeader(http.StatusOK)
				fmt.Fprintln(w, "Healthy")
				return
			}
			
			w.WriteHeader(http.StatusNotFound)
		}),
	}
	
	// Start servers
	go func() {
		log.Printf("Main server listening on %s", server.Addr)
		if err := server.ListenAndServe(); err != http.ErrServerClosed {
			log.Printf("Main server error: %v", err)
		}
	}()
	
	go func() {
		log.Printf("Health server listening on %s", healthServer.Addr)
		if err := healthServer.ListenAndServe(); err != http.ErrServerClosed {
			log.Printf("Health server error: %v", err)
		}
	}()
	
	// Channel to listen for interrupt signals
	shutdown := make(chan os.Signal, 1)
	signal.Notify(shutdown, syscall.SIGINT, syscall.SIGTERM)
	
	// Block until we receive a signal
	sig := <-shutdown
	log.Printf("Received signal: %v", sig)
	
	// Start the pre-stop process
	// This simulates the Kubernetes preStop hook
	shutdownManager.StartPreStop()
	
	// Shutdown the main server
	if err := shutdownManager.Shutdown(); err != nil {
		log.Printf("Main server shutdown error: %v", err)
	}
	
	// Shutdown the health server last
	ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
	defer cancel()
	
	if err := healthServer.Shutdown(ctx); err != nil {
		log.Printf("Health server shutdown error: %v", err)
	}
	
	log.Println("Application shutdown complete")
}

This pattern demonstrates:

  1. Coordinating with Kubernetes lifecycle hooks
  2. Implementing readiness probes that reflect shutdown state
  3. Using a preStop hook delay to allow for load balancer reconfiguration
  4. Proper sequencing of shutdown steps

Monitoring and Logging During Shutdown

Proper monitoring and logging during shutdown is essential for troubleshooting and ensuring clean termination.

Structured Shutdown Logging

package main

import (
	"context"
	"encoding/json"
	"fmt"
	"log"
	"net/http"
	"os"
	"os/signal"
	"sync"
	"syscall"
	"time"
)

// LogLevel represents the severity of a log message
type LogLevel string

const (
	LogLevelInfo    LogLevel = "INFO"
	LogLevelWarning LogLevel = "WARNING"
	LogLevelError   LogLevel = "ERROR"
)

// StructuredLogger provides structured logging
type StructuredLogger struct {
	mu sync.Mutex
}

// Log outputs a structured log message
func (l *StructuredLogger) Log(level LogLevel, message string, fields map[string]interface{}) {
	l.mu.Lock()
	defer l.mu.Unlock()
	
	if fields == nil {
		fields = make(map[string]interface{})
	}
	
	fields["timestamp"] = time.Now().Format(time.RFC3339)
	fields["level"] = level
	fields["message"] = message
	
	jsonData, err := json.Marshal(fields)
	if err != nil {
		log.Printf("Error marshaling log: %v", err)
		return
	}
	
	fmt.Println(string(jsonData))
}

// ShutdownMonitor tracks the shutdown process
type ShutdownMonitor struct {
	logger           *StructuredLogger
	startTime        time.Time
	shutdownSteps    map[string]ShutdownStepStatus
	mu               sync.Mutex
}

// ShutdownStepStatus represents the status of a shutdown step
type ShutdownStepStatus struct {
	Status    string
	StartTime time.Time
	EndTime   time.Time
	Duration  time.Duration
	Error     error
}

func NewShutdownMonitor(logger *StructuredLogger) *ShutdownMonitor {
	return &ShutdownMonitor{
		logger:        logger,
		shutdownSteps: make(map[string]ShutdownStepStatus),
	}
}

// StartShutdown begins the shutdown process
func (sm *ShutdownMonitor) StartShutdown() {
	sm.mu.Lock()
	defer sm.mu.Unlock()
	
	sm.startTime = time.Now()
	sm.logger.Log(LogLevelInfo, "Starting application shutdown", map[string]interface{}{
		"shutdown_id": sm.startTime.UnixNano(),
	})
}

// BeginStep marks the beginning of a shutdown step
func (sm *ShutdownMonitor) BeginStep(stepName string) {
	sm.mu.Lock()
	defer sm.mu.Unlock()
	
	sm.shutdownSteps[stepName] = ShutdownStepStatus{
		Status:    "in_progress",
		StartTime: time.Now(),
	}
	
	sm.logger.Log(LogLevelInfo, fmt.Sprintf("Beginning shutdown step: %s", stepName), map[string]interface{}{
		"step":        stepName,
		"status":      "in_progress",
		"shutdown_id": sm.startTime.UnixNano(),
	})
}

// EndStep marks the end of a shutdown step
func (sm *ShutdownMonitor) EndStep(stepName string, err error) {
	sm.mu.Lock()
	defer sm.mu.Unlock()
	
	step, exists := sm.shutdownSteps[stepName]
	if !exists {
		sm.logger.Log(LogLevelWarning, fmt.Sprintf("Ending unknown shutdown step: %s", stepName), map[string]interface{}{
			"step":        stepName,
			"shutdown_id": sm.startTime.UnixNano(),
		})
		return
	}
	
	step.EndTime = time.Now()
	step.Duration = step.EndTime.Sub(step.StartTime)
	
	if err != nil {
		step.Status = "failed"
		step.Error = err
		sm.logger.Log(LogLevelError, fmt.Sprintf("Shutdown step failed: %s", stepName), map[string]interface{}{
			"step":        stepName,
			"status":      "failed",
			"duration_ms": step.Duration.Milliseconds(),
			"error":       err.Error(),
			"shutdown_id": sm.startTime.UnixNano(),
		})
	} else {
		step.Status = "completed"
		sm.logger.Log(LogLevelInfo, fmt.Sprintf("Shutdown step completed: %s", stepName), map[string]interface{}{
			"step":        stepName,
			"status":      "completed",
			"duration_ms": step.Duration.Milliseconds(),
			"shutdown_id": sm.startTime.UnixNano(),
		})
	}
	
	sm.shutdownSteps[stepName] = step
}

// CompleteShutdown finalizes the shutdown process
func (sm *ShutdownMonitor) CompleteShutdown() {
	sm.mu.Lock()
	defer sm.mu.Unlock()
	
	duration := time.Since(sm.startTime)
	
	// Count successes and failures
	successes := 0
	failures := 0
	for _, step := range sm.shutdownSteps {
		if step.Status == "completed" {
			successes++
		} else if step.Status == "failed" {
			failures++
		}
	}
	
	sm.logger.Log(LogLevelInfo, "Application shutdown complete", map[string]interface{}{
		"shutdown_id":   sm.startTime.UnixNano(),
		"duration_ms":   duration.Milliseconds(),
		"total_steps":   len(sm.shutdownSteps),
		"success_steps": successes,
		"failed_steps":  failures,
	})
}

func main() {
	// Create structured logger
	logger := &StructuredLogger{}
	
	// Create shutdown monitor
	monitor := NewShutdownMonitor(logger)
	
	// Create server
	server := &http.Server{
		Addr: ":8080",
		Handler: http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
			time.Sleep(1 * time.Second) // Simulate work
			fmt.Fprintf(w, "Hello, World!")
		}),
	}
	
	// Start server
	go func() {
		logger.Log(LogLevelInfo, "Starting server", map[string]interface{}{
			"address": server.Addr,
		})
		
		if err := server.ListenAndServe(); err != http.ErrServerClosed {
			logger.Log(LogLevelError, "Server error", map[string]interface{}{
				"error": err.Error(),
			})
		}
	}()
	
	// Channel to listen for interrupt signals
	shutdown := make(chan os.Signal, 1)
	signal.Notify(shutdown, os.Interrupt, syscall.SIGTERM)
	
	// Block until we receive a signal
	sig := <-shutdown
	logger.Log(LogLevelInfo, "Received termination signal", map[string]interface{}{
		"signal": sig.String(),
	})
	
	// Start the shutdown process
	monitor.StartShutdown()
	
	// Step 1: Stop accepting new connections
	monitor.BeginStep("server_shutdown")
	ctx, cancel := context.WithTimeout(context.Background(), 15*time.Second)
	defer cancel()
	
	err := server.Shutdown(ctx)
	monitor.EndStep("server_shutdown", err)
	
	// Step 2: Close database connections (simulated)
	monitor.BeginStep("database_shutdown")
	time.Sleep(2 * time.Second) // Simulate DB shutdown
	monitor.EndStep("database_shutdown", nil)
	
	// Step 3: Flush metrics (simulated)
	monitor.BeginStep("metrics_flush")
	time.Sleep(1 * time.Second) // Simulate metrics flush
	// Simulate an error
	monitor.EndStep("metrics_flush", fmt.Errorf("failed to flush metrics: connection timeout"))
	
	// Complete the shutdown process
	monitor.CompleteShutdown()
}

This pattern demonstrates:

  1. Structured logging during shutdown
  2. Tracking individual shutdown steps
  3. Measuring shutdown duration
  4. Reporting success and failure metrics

The Bottom Line

Implementing robust graceful shutdown patterns is not just a best practice—it’s a critical requirement for production-grade Go applications. By properly handling termination signals, coordinating resource cleanup, and managing connection draining, you can ensure that your services terminate cleanly without disrupting users or compromising data integrity.

The patterns we’ve explored in this guide provide a comprehensive toolkit for implementing graceful shutdown in various contexts:

  1. Signal Handling: Capturing OS signals to trigger controlled shutdown
  2. Context-Based Cancellation: Propagating shutdown signals throughout your application
  3. HTTP Server Shutdown: Allowing in-flight requests to complete before termination
  4. Resource Cleanup: Properly closing database connections and other resources
  5. Worker Pool Management: Gracefully stopping worker pools and background tasks
  6. Service Coordination: Shutting down services in the correct order based on dependencies
  7. Health Checks: Integrating with orchestration systems through health and readiness endpoints
  8. Connection Draining: Ensuring zero-downtime deployments through proper connection handling
  9. Kubernetes Integration: Coordinating with container lifecycle hooks
  10. Monitoring and Logging: Tracking and troubleshooting the shutdown process

When implementing these patterns, remember these key principles:

  • Timeout Everything: Always use timeouts to prevent indefinite blocking during shutdown
  • Order Matters: Shut down services in the reverse order of their dependencies
  • Be Defensive: Handle errors during shutdown gracefully
  • Monitor and Log: Track the shutdown process for troubleshooting
  • Test Thoroughly: Verify shutdown behavior under various conditions

By applying these patterns and principles, you can build Go applications that not only perform well during normal operation but also terminate gracefully when needed, ensuring reliability and data integrity even during deployments, scaling events, or unexpected failures.