Best Practices and Optimization

The difference between a deployment that works and one that works reliably in production comes down to the systematic application of best practices. I’ve learned this through painful experience - deployments that seemed perfect in staging but failed under real load, configurations that worked for months before causing mysterious outages.

The most important insight I’ve gained: production optimization isn’t just about performance - it’s about building systems that remain stable, secure, and maintainable as they scale and evolve.

Resource Management Strategy

Proper resource management prevents the most common production issues I’ve encountered. Under-provisioned applications fail under load, while over-provisioned applications waste money.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-app-optimized
spec:
  template:
    spec:
      containers:
      - name: web-app
        image: web-app:v2.1.0
        resources:
          requests:
            # Set requests based on baseline usage
            memory: "512Mi"
            cpu: "250m"
          limits:
            # Set limits with headroom for spikes
            memory: "1Gi"
            cpu: "500m"
        env:
        - name: NODE_OPTIONS
          value: "--max-old-space-size=768"  # 75% of memory limit

I use Vertical Pod Autoscaler to right-size resources automatically:

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: web-app-vpa
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: web-app
  updatePolicy:
    updateMode: "Auto"
  resourcePolicy:
    containerPolicies:
    - containerName: web-app
      minAllowed:
        cpu: 100m
        memory: 128Mi
      maxAllowed:
        cpu: 2
        memory: 4Gi

Performance Optimization

I optimize performance at multiple levels, from container configuration to application architecture:

# Multi-stage build for optimal runtime image
FROM node:18-alpine AS builder
WORKDIR /app
COPY package*.json ./
RUN npm ci --only=production && npm cache clean --force
COPY . .
RUN npm run build

FROM node:18-alpine AS runtime
RUN apk add --no-cache dumb-init
RUN addgroup -g 1001 -S nodejs && \
    adduser -S nextjs -u 1001 -G nodejs

WORKDIR /app
USER nextjs

COPY --from=builder --chown=nextjs:nodejs /app/node_modules ./node_modules
COPY --from=builder --chown=nextjs:nodejs /app/dist ./dist

ENV NODE_ENV=production
ENV NODE_OPTIONS="--max-old-space-size=768 --optimize-for-size"

ENTRYPOINT ["dumb-init", "--"]
CMD ["node", "dist/server.js"]

Security Best Practices

Security must be built into every layer of production deployments:

apiVersion: v1
kind: Pod
metadata:
  name: secure-app
spec:
  securityContext:
    runAsNonRoot: true
    runAsUser: 1000
    runAsGroup: 1000
    fsGroup: 1000
    seccompProfile:
      type: RuntimeDefault
  containers:
  - name: app
    image: app:v1.0.0
    securityContext:
      allowPrivilegeEscalation: false
      readOnlyRootFilesystem: true
      capabilities:
        drop:
        - ALL
    volumeMounts:
    - name: tmp
      mountPath: /tmp
    - name: var-cache
      mountPath: /var/cache
  volumes:
  - name: tmp
    emptyDir: {}
  - name: var-cache
    emptyDir: {}

Network security policies:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: production-security-policy
spec:
  podSelector: {}
  policyTypes:
  - Ingress
  - Egress
  ingress:
  - from:
    - namespaceSelector:
        matchLabels:
          name: ingress-nginx
    ports:
    - protocol: TCP
      port: 8080
  egress:
  - to: []
    ports:
    - protocol: UDP
      port: 53
    - protocol: TCP
      port: 443
  - to:
    - namespaceSelector:
        matchLabels:
          name: database
    ports:
    - protocol: TCP
      port: 5432

Monitoring Excellence

Comprehensive observability enables proactive issue resolution:

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: production-alerts
spec:
  groups:
  - name: production.rules
    rules:
    - alert: HighErrorRate
      expr: |
        (
          sum(rate(http_requests_total{status=~"5.."}[5m])) by (service) /
          sum(rate(http_requests_total[5m])) by (service)
        ) > 0.01
      for: 2m
      labels:
        severity: critical
      annotations:
        summary: "High error rate for {{ $labels.service }}"
        description: "Error rate is {{ $value | humanizePercentage }}"
    
    - alert: PodCrashLooping
      expr: rate(kube_pod_container_status_restarts_total[15m]) > 0
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "Pod {{ $labels.pod }} is crash looping"
    
    - alert: HighMemoryUsage
      expr: |
        (
          container_memory_working_set_bytes{container!=""} /
          container_spec_memory_limit_bytes{container!=""} * 100
        ) > 90
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "High memory usage for {{ $labels.container }}"

Deployment Pipeline Optimization

I optimize CI/CD pipelines for speed, reliability, and security:

name: Optimized Production Deploy
on:
  push:
    tags: ['v*']

jobs:
  build:
    runs-on: ubuntu-latest
    steps:
    - uses: actions/checkout@v4
    
    - name: Set up Docker Buildx
      uses: docker/setup-buildx-action@v3
    
    - name: Build and push
      uses: docker/build-push-action@v5
      with:
        context: .
        push: true
        tags: ${{ steps.meta.outputs.tags }}
        cache-from: type=gha
        cache-to: type=gha,mode=max
  
  security-scan:
    needs: build
    runs-on: ubuntu-latest
    steps:
    - name: Run Trivy scanner
      uses: aquasecurity/trivy-action@master
      with:
        image-ref: ${{ needs.build.outputs.image-tag }}
        format: 'sarif'
        output: 'trivy-results.sarif'
  
  deploy:
    needs: [build, security-scan]
    runs-on: ubuntu-latest
    steps:
    - name: Deploy to production
      run: |
        kubectl set image deployment/web-app \
          web-app=${{ needs.build.outputs.image-digest }}
        kubectl rollout status deployment/web-app --timeout=600s
    
    - name: Verify deployment
      run: |
        kubectl wait --for=condition=ready pod -l app=web-app --timeout=300s
        kubectl run smoke-test --rm -i --restart=Never \
          --image=curlimages/curl -- \
          curl -f http://web-app-service/health

Cost Optimization

I implement cost optimization without compromising reliability:

#!/usr/bin/env python3
import kubernetes

class CostOptimizer:
    def __init__(self):
        kubernetes.config.load_incluster_config()
        self.v1 = kubernetes.client.CoreV1Api()
    
    def analyze_resource_usage(self, namespace="production"):
        """Analyze actual vs requested resources"""
        pods = self.v1.list_namespaced_pod(namespace)
        recommendations = []
        
        for pod in pods.items:
            if pod.status.phase != "Running":
                continue
            
            for container in pod.spec.containers:
                requests = container.resources.requests or {}
                
                # Get actual usage from metrics
                actual_usage = self.get_container_metrics(
                    pod.metadata.name, 
                    container.name
                )
                
                cpu_util = self.calculate_utilization(
                    requests.get('cpu', '0'), 
                    actual_usage.get('cpu', '0')
                )
                
                if cpu_util < 50:
                    recommendations.append({
                        'pod': pod.metadata.name,
                        'container': container.name,
                        'type': 'cpu_reduction',
                        'current': requests.get('cpu', '0'),
                        'recommended': f"{int(cpu_util * 1.2)}m"
                    })
        
        return recommendations

Spot instance integration for non-critical workloads:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: batch-processor
spec:
  template:
    spec:
      tolerations:
      - key: spot-instance
        operator: Equal
        value: "true"
        effect: NoSchedule
      nodeSelector:
        node-type: spot
      containers:
      - name: processor
        image: batch-processor:v1.0.0
        resources:
          requests:
            memory: "256Mi"
            cpu: "200m"

Operational Excellence

I implement practices that make systems reliable and maintainable:

Health Check Standards:

  • Liveness probes detect when containers need restarting
  • Readiness probes ensure traffic only goes to healthy pods
  • Startup probes handle slow-starting applications

Graceful Shutdown:

spec:
  containers:
  - name: app
    lifecycle:
      preStop:
        exec:
          command: ["/bin/sh", "-c", "sleep 15"]
  terminationGracePeriodSeconds: 30

Resource Quotas:

apiVersion: v1
kind: ResourceQuota
metadata:
  name: production-quota
spec:
  hard:
    requests.cpu: "100"
    requests.memory: 200Gi
    limits.cpu: "200"
    limits.memory: 400Gi
    persistentvolumeclaims: "50"

These best practices and optimizations have evolved from managing production systems at scale. They address the real challenges that emerge when moving from proof-of-concept deployments to systems that serve real users reliably and cost-effectively.

The key insight: optimization is an ongoing process, not a one-time activity. The best production systems continuously monitor, measure, and improve their performance, security, and cost efficiency.

Next, we’ll explore real-world projects and implementation strategies that demonstrate how to apply all these concepts together in complete production deployment scenarios.