Best Practices and Optimization
Moving from development to production with operators requires careful attention to deployment strategies, monitoring, and operational practices. I’ve learned these lessons through managing operators in production environments where downtime isn’t acceptable and reliability is paramount.
Production Deployment Strategies
The biggest mistake I see teams make is deploying operators the same way they deploy applications. Operators are infrastructure components that manage other workloads, so they need different deployment patterns and safety measures.
First, let’s talk about packaging. Helm charts provide the flexibility needed for production deployments while maintaining consistency across environments. Here’s how I structure operator Helm charts:
# values.yaml
replicaCount: 2
image:
repository: webapp-operator
tag: "1.0.0"
pullPolicy: IfNotPresent
resources:
limits:
cpu: 500m
memory: 512Mi
requests:
cpu: 100m
memory: 128Mi
# Anti-affinity to spread replicas across nodes
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app.kubernetes.io/name
operator: In
values:
- webapp-operator
topologyKey: kubernetes.io/hostname
leaderElection:
enabled: true
leaseDuration: 15s
renewDeadline: 10s
retryPeriod: 2s
webhook:
enabled: true
port: 9443
certManager:
enabled: true
The anti-affinity rules ensure that operator replicas run on different nodes, preventing a single node failure from taking down your entire operator. Leader election ensures only one instance is active at a time, while the others stand ready to take over.
Health checks are critical for production operators. They need to verify not just that the process is running, but that it can actually perform its core functions:
func setupHealthChecks(mgr manager.Manager) error {
if err := mgr.AddHealthzCheck("healthz", healthz.Ping); err != nil {
return err
}
// Custom readiness check that verifies API connectivity
if err := mgr.AddReadyzCheck("readyz", func(req *http.Request) error {
// Verify we can connect to the Kubernetes API
if !mgr.GetCache().WaitForCacheSync(req.Context()) {
return fmt.Errorf("cache not synced")
}
// Test a simple API call
var webapps examplev1.WebAppList
if err := mgr.GetClient().List(req.Context(), &webapps, client.Limit(1)); err != nil {
return fmt.Errorf("failed to list WebApps: %w", err)
}
return nil
}); err != nil {
return err
}
return nil
}
This readiness check ensures that Kubernetes won’t route traffic to operator instances that can’t actually process requests, which is crucial during rolling updates or when recovering from failures.
Comprehensive Error Handling
Production operators must handle failures gracefully and provide clear feedback about what went wrong. I use a layered approach to error handling that distinguishes between different types of failures and responds appropriately.
The key insight is that not all errors should be treated the same way. Some errors indicate temporary problems that will resolve themselves, while others require immediate attention or user intervention:
func (r *WebAppReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
log := r.Log.WithValues("webapp", req.NamespacedName)
webapp := &examplev1.WebApp{}
if err := r.Get(ctx, req.NamespacedName, webapp); err != nil {
if errors.IsNotFound(err) {
return ctrl.Result{}, nil
}
return ctrl.Result{}, err
}
// Handle deletion with proper cleanup
if !webapp.ObjectMeta.DeletionTimestamp.IsZero() {
return r.handleDeletion(ctx, webapp)
}
// Reconcile with comprehensive error handling
result, err := r.reconcileWebApp(ctx, webapp)
if err != nil {
// Update status with error information
r.updateStatusWithError(ctx, webapp, err)
// Determine retry strategy based on error type
if isRetryableError(err) {
log.Info("Retryable error occurred, requeuing", "error", err)
return ctrl.Result{RequeueAfter: calculateBackoff(webapp)}, nil
}
// For permanent errors, don't retry but record the failure
log.Error(err, "Permanent error occurred")
r.recordEvent(webapp, "Warning", "ReconcileFailed", err.Error())
return ctrl.Result{}, nil // Don't return error to avoid infinite retries
}
return result, nil
}
The calculateBackoff
function implements exponential backoff with jitter to prevent thundering herd problems when many resources fail simultaneously:
func calculateBackoff(webapp *examplev1.WebApp) time.Duration {
// Get failure count from status or annotations
failureCount := getFailureCount(webapp)
// Exponential backoff: 1s, 2s, 4s, 8s, max 5 minutes
backoff := time.Duration(1<<failureCount) * time.Second
if backoff > 5*time.Minute {
backoff = 5 * time.Minute
}
// Add jitter to prevent thundering herd
jitter := time.Duration(rand.Intn(1000)) * time.Millisecond
return backoff + jitter
}
This approach ensures that temporary failures don’t overwhelm your cluster with retry attempts while still providing timely recovery when conditions improve.
Monitoring and Observability
Effective monitoring is what separates operators that work in demos from those that work in production. You need visibility into both the operator’s own health and the health of the resources it manages. I’ve learned that the key is building monitoring into the operator from the beginning, not adding it as an afterthought.
Start with structured logging that provides context about what the operator is doing and why. Here’s how I implement contextual logging that makes troubleshooting much easier:
type ContextualLogger struct {
logr.Logger
}
func (l *ContextualLogger) WithWebApp(webapp *examplev1.WebApp) logr.Logger {
return l.WithValues(
"webapp.name", webapp.Name,
"webapp.namespace", webapp.Namespace,
"webapp.generation", webapp.Generation,
"webapp.resourceVersion", webapp.ResourceVersion,
)
}
func (r *WebAppReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
logger := NewContextualLogger("webapp-controller")
start := time.Now()
webapp := &examplev1.WebApp{}
if err := r.Get(ctx, req.NamespacedName, webapp); err != nil {
return ctrl.Result{}, client.IgnoreNotFound(err)
}
logger.WithWebApp(webapp).Info("Starting reconciliation",
"phase", webapp.Status.Phase,
"replicas.desired", webapp.Spec.Replicas,
"replicas.ready", webapp.Status.ReadyReplicas,
)
defer func() {
duration := time.Since(start)
logger.WithWebApp(webapp).Info("Reconciliation completed", "duration", duration)
}()
// Reconciliation logic...
return ctrl.Result{}, nil
}
This logging approach includes all the context needed to understand what happened during reconciliation, making it much easier to debug issues in production.
For metrics, focus on both technical performance and business outcomes. Here are the key metrics I include in every operator:
var (
webappReconcileTotal = prometheus.NewCounterVec(
prometheus.CounterOpts{
Name: "webapp_reconcile_total",
Help: "Total number of WebApp reconciliations",
},
[]string{"namespace", "name", "result"},
)
webappReconcileDuration = prometheus.NewHistogramVec(
prometheus.HistogramOpts{
Name: "webapp_reconcile_duration_seconds",
Help: "Duration of WebApp reconciliations",
Buckets: []float64{0.1, 0.5, 1.0, 2.5, 5.0, 10.0},
},
[]string{"namespace", "name"},
)
webappCreationTime = prometheus.NewHistogramVec(
prometheus.HistogramOpts{
Name: "webapp_creation_duration_seconds",
Help: "Time from WebApp creation to ready state",
Buckets: []float64{1, 5, 10, 30, 60, 120, 300},
},
[]string{"namespace"},
)
)
The creation time metric is particularly valuable because it measures the end-to-end user experience - how long it takes from when someone applies a WebApp resource to when it’s actually serving traffic.
Set up alerting rules that catch problems before they impact users:
groups:
- name: webapp-operator
rules:
- alert: WebAppOperatorDown
expr: up{job="webapp-operator"} == 0
for: 5m
labels:
severity: critical
annotations:
summary: "WebApp Operator is down"
description: "WebApp Operator has been down for more than 5 minutes"
- alert: WebAppReconcileErrors
expr: rate(webapp_errors_total[5m]) > 0.1
for: 2m
labels:
severity: warning
annotations:
summary: "High WebApp reconcile error rate"
description: "WebApp reconcile error rate is {{ $value }} errors/sec"
These alerts focus on the operator’s ability to perform its core function rather than just whether the process is running.
Performance Optimization
As your operator manages more resources, performance becomes critical. The most important optimization is reducing unnecessary API calls and reconciliation loops. Here’s how I structure efficient reconciliation:
func (r *WebAppReconciler) reconcileEfficiently(ctx context.Context, webapp *examplev1.WebApp) error {
// Batch fetch all related resources concurrently
resources, err := r.getAllRelatedResources(ctx, webapp)
if err != nil {
return err
}
// Calculate desired state once
desiredDeployment := r.buildDeployment(webapp)
desiredService := r.buildService(webapp)
// Only update resources that have actually changed
updates := []client.Object{}
if !r.deploymentMatches(resources.Deployment, desiredDeployment) {
updates = append(updates, desiredDeployment)
}
if !r.serviceMatches(resources.Service, desiredService) {
updates = append(updates, desiredService)
}
// Batch apply updates
return r.batchUpdate(ctx, updates)
}
The key insight is to fetch all related resources concurrently, compare them with the desired state, and only make changes when necessary. This dramatically reduces API server load compared to naive approaches that recreate resources on every reconciliation.
Testing Strategies
Testing operators requires a different approach than testing typical applications. You need to verify that your operator correctly manages Kubernetes resources across various scenarios, including failure conditions.
I use a layered testing approach that starts with unit tests for individual functions and builds up to integration tests that run against real Kubernetes clusters:
func TestWebAppController(t *testing.T) {
scheme := runtime.NewScheme()
_ = examplev1.AddToScheme(scheme)
_ = appsv1.AddToScheme(scheme)
webapp := &examplev1.WebApp{
ObjectMeta: metav1.ObjectMeta{
Name: "test-webapp",
Namespace: "default",
},
Spec: examplev1.WebAppSpec{
Replicas: 3,
Image: "nginx:1.21",
Port: 80,
},
}
client := fake.NewClientBuilder().WithScheme(scheme).WithObjects(webapp).Build()
reconciler := &WebAppReconciler{Client: client, Scheme: scheme}
_, err := reconciler.Reconcile(context.TODO(), reconcile.Request{
NamespacedName: types.NamespacedName{Name: "test-webapp", Namespace: "default"},
})
assert.NoError(t, err)
// Verify deployment was created with correct configuration
deployment := &appsv1.Deployment{}
err = client.Get(context.TODO(), types.NamespacedName{Name: "test-webapp", Namespace: "default"}, deployment)
assert.NoError(t, err)
assert.Equal(t, int32(3), *deployment.Spec.Replicas)
}
For integration testing, I create test scenarios that verify the operator works correctly in real environments:
#!/bin/bash
echo "Running operator integration tests..."
# Deploy the operator
kubectl apply -f config/crd/bases/
kubectl apply -f config/rbac/
kubectl apply -f config/manager/
# Wait for operator to be ready
kubectl wait --for=condition=Available deployment/webapp-operator-controller-manager -n webapp-operator-system --timeout=300s
# Test basic functionality
kubectl apply -f - <<EOF
apiVersion: example.com/v1
kind: WebApp
metadata:
name: test-webapp
spec:
replicas: 2
image: nginx:alpine
port: 80
EOF
# Verify resources are created
kubectl wait --for=condition=Available deployment/test-webapp --timeout=300s
kubectl get service test-webapp
# Test scaling
kubectl patch webapp test-webapp --type='merge' -p='{"spec":{"replicas":3}}'
kubectl wait --for=jsonpath='{.spec.replicas}'=3 deployment/test-webapp --timeout=300s
# Test deletion
kubectl delete webapp test-webapp
kubectl wait --for=delete deployment/test-webapp --timeout=300s
echo "All integration tests passed!"
These tests give you confidence that your operator works correctly in real environments and help catch issues that unit tests might miss.
In Part 6, we’ll bring everything together by building a complete, production-ready operator that demonstrates all the patterns and practices we’ve covered. You’ll see how to structure a complex operator project, integrate it with CI/CD pipelines, and deploy it safely to production environments.