Real-World Projects and Implementation

After five parts of building up your operator development skills, it’s time to put everything together into a production-ready project. I’ll walk you through building a complete e-commerce platform operator that demonstrates all the patterns we’ve covered while solving real business problems.

Building a Complete E-commerce Platform Operator

The most valuable operators I’ve built manage entire application stacks rather than individual components. Users want to deploy a “shopping platform” or “blog system,” not worry about coordinating databases, caches, message queues, and multiple services. Let’s build an operator that does exactly that.

First, let’s define what our e-commerce platform needs. From my experience working with development teams, they want to specify high-level requirements like “I need a shopping platform for the staging environment” and have the operator figure out all the details.

apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
  name: ecommerceplatforms.platform.example.com
spec:
  group: platform.example.com
  versions:
  - name: v1
    served: true
    storage: true
    schema:
      openAPIV3Schema:
        type: object
        properties:
          spec:
            type: object
            properties:
              environment:
                type: string
                enum: ["development", "staging", "production"]
              frontend:
                type: object
                properties:
                  replicas:
                    type: integer
                    minimum: 1
                  image:
                    type: string
              backend:
                type: object
                properties:
                  replicas:
                    type: integer
                    minimum: 1
                  image:
                    type: string
              database:
                type: object
                properties:
                  type:
                    type: string
                    enum: ["postgresql", "mysql"]
                  version:
                    type: string
                  storage:
                    type: string
              cache:
                type: object
                properties:
                  enabled:
                    type: boolean
                    default: true
              monitoring:
                type: object
                properties:
                  enabled:
                    type: boolean
                    default: true
  scope: Namespaced
  names:
    plural: ecommerceplatforms
    singular: ecommerceplatform
    kind: EcommercePlatform

The key insight here is that the CRD captures business intent rather than technical implementation details. Users specify what environment they’re targeting and what components they need, and the operator translates that into appropriate Kubernetes resources.

The controller needs to handle complex dependencies between components. You can’t start the backend until the database is ready, and you shouldn’t expose the frontend until the backend is healthy. Here’s how I structure the reconciliation logic:

func (r *EcommercePlatformReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
    platform := &platformv1.EcommercePlatform{}
    err := r.Get(ctx, req.NamespacedName, platform)
    if err != nil {
        return ctrl.Result{}, client.IgnoreNotFound(err)
    }
    
    // Update status to show we're working
    platform.Status.Phase = "Deploying"
    r.Status().Update(ctx, platform)
    
    // Deploy components in dependency order
    if err := r.ensureDatabase(ctx, platform); err != nil {
        return r.handleError(ctx, platform, "Database deployment failed", err)
    }
    
    if platform.Spec.Cache.Enabled {
        if err := r.ensureCache(ctx, platform); err != nil {
            return r.handleError(ctx, platform, "Cache deployment failed", err)
        }
    }
    
    // Wait for database to be ready before starting backend
    if !r.isDatabaseReady(ctx, platform) {
        return ctrl.Result{RequeueAfter: time.Second * 30}, nil
    }
    
    if err := r.ensureBackend(ctx, platform); err != nil {
        return r.handleError(ctx, platform, "Backend deployment failed", err)
    }
    
    // Wait for backend to be ready before starting frontend
    if !r.isBackendReady(ctx, platform) {
        return ctrl.Result{RequeueAfter: time.Second * 30}, nil
    }
    
    if err := r.ensureFrontend(ctx, platform); err != nil {
        return r.handleError(ctx, platform, "Frontend deployment failed", err)
    }
    
    // Setup monitoring and ingress last
    if platform.Spec.Monitoring.Enabled {
        if err := r.ensureMonitoring(ctx, platform); err != nil {
            return r.handleError(ctx, platform, "Monitoring setup failed", err)
        }
    }
    
    if err := r.ensureIngress(ctx, platform); err != nil {
        return r.handleError(ctx, platform, "Ingress setup failed", err)
    }
    
    // Update final status
    platform.Status.Phase = "Running"
    platform.Status.Endpoints = r.buildEndpointStatus(platform)
    
    return ctrl.Result{RequeueAfter: time.Minute * 10}, r.Status().Update(ctx, platform)
}

This approach ensures that components are deployed in the right order and that the operator waits for dependencies to be ready before proceeding. The handleError function provides consistent error handling and status updates across all components.

Environment-Specific Configuration

One of the most powerful features of this operator is how it adapts behavior based on the target environment. Production deployments need different security settings, resource limits, and monitoring than development environments.

func (r *EcommercePlatformReconciler) ensureDatabase(ctx context.Context, platform *platformv1.EcommercePlatform) error {
    // Environment-specific configuration
    var replicas int32 = 1
    var storage string = "10Gi"
    var backupEnabled bool = false
    
    switch platform.Spec.Environment {
    case "production":
        replicas = 3
        storage = "100Gi"
        backupEnabled = true
    case "staging":
        replicas = 2
        storage = "50Gi"
        backupEnabled = true
    }
    
    database := &databasev1.PostgreSQL{
        ObjectMeta: metav1.ObjectMeta{
            Name:      platform.Name + "-db",
            Namespace: platform.Namespace,
        },
        Spec: databasev1.PostgreSQLSpec{
            Version:  platform.Spec.Database.Version,
            Replicas: replicas,
            Storage:  storage,
            Backup: databasev1.BackupSpec{
                Enabled:   backupEnabled,
                Schedule:  "0 2 * * *",
                Retention: "30d",
            },
        },
    }
    
    ctrl.SetControllerReference(platform, database, r.Scheme)
    return r.ensureResource(ctx, database)
}

This pattern of environment-specific defaults means that developers can deploy to different environments without having to understand all the operational differences between them. The operator encodes that knowledge and applies it automatically.

CI/CD Integration

Production operators need robust CI/CD pipelines that test thoroughly before deployment. Here’s the GitHub Actions workflow I use for operator projects:

name: Operator CI/CD

on:
  push:
    branches: [main]
  pull_request:
    branches: [main]

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
    - uses: actions/checkout@v3
    
    - name: Set up Go
      uses: actions/setup-go@v3
      with:
        go-version: 1.19
    
    - name: Run unit tests
      run: make test
    
    - name: Run integration tests
      run: |
        kind create cluster
        make test-integration
    
    - name: Security scan
      uses: securecodewarrior/github-action-add-sarif@v1
      with:
        sarif-file: security-scan-results.sarif

  deploy-staging:
    needs: test
    runs-on: ubuntu-latest
    if: github.ref == 'refs/heads/main'
    steps:
    - uses: actions/checkout@v3
    
    - name: Build and push image
      env:
        IMAGE_TAG: ${{ github.sha }}
      run: |
        docker build -t ecommerce-operator:$IMAGE_TAG .
        docker push ecommerce-operator:$IMAGE_TAG
    
    - name: Deploy to staging
      run: |
        helm upgrade --install ecommerce-operator ./charts/ecommerce-operator \
          --set image.tag=${{ github.sha }} \
          --namespace ecommerce-operator-system \
          --create-namespace
    
    - name: Run smoke tests
      run: |
        kubectl apply -f config/samples/staging-platform.yaml
        kubectl wait --for=condition=Ready ecommerceplatform/staging-platform --timeout=600s

The key aspects of this pipeline are comprehensive testing (including integration tests against a real Kubernetes cluster), security scanning, and automated deployment to staging with smoke tests to verify everything works.

Production Deployment Example

Here’s how teams actually use the e-commerce platform operator in production:

apiVersion: platform.example.com/v1
kind: EcommercePlatform
metadata:
  name: production-store
  namespace: ecommerce-prod
spec:
  environment: production
  frontend:
    replicas: 5
    image: "registry.company.com/ecommerce-frontend:v2.1.0"
  backend:
    replicas: 3
    image: "registry.company.com/ecommerce-backend:v2.1.0"
  database:
    type: "postgresql"
    version: "14"
    storage: "500Gi"
  cache:
    enabled: true
  monitoring:
    enabled: true

The beauty of this approach is that the same YAML works across environments - the operator automatically applies production-appropriate settings like high availability, backups, monitoring, and security policies.

Deploy it with these commands:

# Deploy the operator
helm install ecommerce-operator ./charts/ecommerce-operator \
  --namespace ecommerce-operator-system \
  --create-namespace \
  --set image.tag=v1.0.0

# Create the platform
kubectl apply -f production-store.yaml

# Monitor the deployment
kubectl get ecommerceplatform production-store -w

# Check all components
kubectl get all -l platform=production-store

# View the operator logs
kubectl logs -l app.kubernetes.io/name=ecommerce-operator -f

Lessons Learned

Building production operators has taught me several important lessons that I wish I’d known when starting out. First, start simple and add complexity gradually. It’s tempting to try to handle every possible scenario from the beginning, but that leads to operators that are hard to understand and maintain.

Second, invest heavily in testing and observability from day one. Operators manage critical infrastructure, and debugging issues in production is much harder than preventing them with good tests and monitoring.

Third, think carefully about your API design. The CRD is the interface that users interact with, and changing it later is difficult. Spend time understanding what users actually need rather than just exposing all the technical knobs.

Finally, remember that operators are long-lived infrastructure components. They need to handle upgrades gracefully, maintain backward compatibility, and provide clear migration paths when breaking changes are necessary.

Summary

This guide has taken you from basic CRD concepts to building production-ready operators that can manage complex application stacks. You’ve learned about reconciliation loops, error handling, security, performance optimization, and operational practices.

The key takeaways are:

  • Operators encode operational knowledge in software
  • Good API design focuses on user intent, not implementation details
  • Comprehensive testing and monitoring are essential for production use
  • Environment-specific behavior makes operators more valuable to development teams
  • CI/CD integration ensures reliable deployments

You now have the knowledge and tools to build operators that solve real problems and provide genuine value to development teams. The patterns and practices in this guide will serve you well as you tackle increasingly complex operational challenges with Kubernetes operators.