Best Practices and Optimization
The difference between a deployment that works and one that works reliably in production comes down to the systematic application of best practices. I’ve learned this through painful experience - deployments that seemed perfect in staging but failed under real load, configurations that worked for months before causing mysterious outages.
The most important insight I’ve gained: production optimization isn’t just about performance - it’s about building systems that remain stable, secure, and maintainable as they scale and evolve.
Resource Management Strategy
Proper resource management prevents the most common production issues I’ve encountered. Under-provisioned applications fail under load, while over-provisioned applications waste money.
apiVersion: apps/v1
kind: Deployment
metadata:
name: web-app-optimized
spec:
template:
spec:
containers:
- name: web-app
image: web-app:v2.1.0
resources:
requests:
# Set requests based on baseline usage
memory: "512Mi"
cpu: "250m"
limits:
# Set limits with headroom for spikes
memory: "1Gi"
cpu: "500m"
env:
- name: NODE_OPTIONS
value: "--max-old-space-size=768" # 75% of memory limit
I use Vertical Pod Autoscaler to right-size resources automatically:
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: web-app-vpa
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: web-app
updatePolicy:
updateMode: "Auto"
resourcePolicy:
containerPolicies:
- containerName: web-app
minAllowed:
cpu: 100m
memory: 128Mi
maxAllowed:
cpu: 2
memory: 4Gi
Performance Optimization
I optimize performance at multiple levels, from container configuration to application architecture:
# Multi-stage build for optimal runtime image
FROM node:18-alpine AS builder
WORKDIR /app
COPY package*.json ./
RUN npm ci --only=production && npm cache clean --force
COPY . .
RUN npm run build
FROM node:18-alpine AS runtime
RUN apk add --no-cache dumb-init
RUN addgroup -g 1001 -S nodejs && \
adduser -S nextjs -u 1001 -G nodejs
WORKDIR /app
USER nextjs
COPY --from=builder --chown=nextjs:nodejs /app/node_modules ./node_modules
COPY --from=builder --chown=nextjs:nodejs /app/dist ./dist
ENV NODE_ENV=production
ENV NODE_OPTIONS="--max-old-space-size=768 --optimize-for-size"
ENTRYPOINT ["dumb-init", "--"]
CMD ["node", "dist/server.js"]
Security Best Practices
Security must be built into every layer of production deployments:
apiVersion: v1
kind: Pod
metadata:
name: secure-app
spec:
securityContext:
runAsNonRoot: true
runAsUser: 1000
runAsGroup: 1000
fsGroup: 1000
seccompProfile:
type: RuntimeDefault
containers:
- name: app
image: app:v1.0.0
securityContext:
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
capabilities:
drop:
- ALL
volumeMounts:
- name: tmp
mountPath: /tmp
- name: var-cache
mountPath: /var/cache
volumes:
- name: tmp
emptyDir: {}
- name: var-cache
emptyDir: {}
Network security policies:
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: production-security-policy
spec:
podSelector: {}
policyTypes:
- Ingress
- Egress
ingress:
- from:
- namespaceSelector:
matchLabels:
name: ingress-nginx
ports:
- protocol: TCP
port: 8080
egress:
- to: []
ports:
- protocol: UDP
port: 53
- protocol: TCP
port: 443
- to:
- namespaceSelector:
matchLabels:
name: database
ports:
- protocol: TCP
port: 5432
Monitoring Excellence
Comprehensive observability enables proactive issue resolution:
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: production-alerts
spec:
groups:
- name: production.rules
rules:
- alert: HighErrorRate
expr: |
(
sum(rate(http_requests_total{status=~"5.."}[5m])) by (service) /
sum(rate(http_requests_total[5m])) by (service)
) > 0.01
for: 2m
labels:
severity: critical
annotations:
summary: "High error rate for {{ $labels.service }}"
description: "Error rate is {{ $value | humanizePercentage }}"
- alert: PodCrashLooping
expr: rate(kube_pod_container_status_restarts_total[15m]) > 0
for: 5m
labels:
severity: warning
annotations:
summary: "Pod {{ $labels.pod }} is crash looping"
- alert: HighMemoryUsage
expr: |
(
container_memory_working_set_bytes{container!=""} /
container_spec_memory_limit_bytes{container!=""} * 100
) > 90
for: 5m
labels:
severity: warning
annotations:
summary: "High memory usage for {{ $labels.container }}"
Deployment Pipeline Optimization
I optimize CI/CD pipelines for speed, reliability, and security:
name: Optimized Production Deploy
on:
push:
tags: ['v*']
jobs:
build:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3
- name: Build and push
uses: docker/build-push-action@v5
with:
context: .
push: true
tags: ${{ steps.meta.outputs.tags }}
cache-from: type=gha
cache-to: type=gha,mode=max
security-scan:
needs: build
runs-on: ubuntu-latest
steps:
- name: Run Trivy scanner
uses: aquasecurity/trivy-action@master
with:
image-ref: ${{ needs.build.outputs.image-tag }}
format: 'sarif'
output: 'trivy-results.sarif'
deploy:
needs: [build, security-scan]
runs-on: ubuntu-latest
steps:
- name: Deploy to production
run: |
kubectl set image deployment/web-app \
web-app=${{ needs.build.outputs.image-digest }}
kubectl rollout status deployment/web-app --timeout=600s
- name: Verify deployment
run: |
kubectl wait --for=condition=ready pod -l app=web-app --timeout=300s
kubectl run smoke-test --rm -i --restart=Never \
--image=curlimages/curl -- \
curl -f http://web-app-service/health
Cost Optimization
I implement cost optimization without compromising reliability:
#!/usr/bin/env python3
import kubernetes
class CostOptimizer:
def __init__(self):
kubernetes.config.load_incluster_config()
self.v1 = kubernetes.client.CoreV1Api()
def analyze_resource_usage(self, namespace="production"):
"""Analyze actual vs requested resources"""
pods = self.v1.list_namespaced_pod(namespace)
recommendations = []
for pod in pods.items:
if pod.status.phase != "Running":
continue
for container in pod.spec.containers:
requests = container.resources.requests or {}
# Get actual usage from metrics
actual_usage = self.get_container_metrics(
pod.metadata.name,
container.name
)
cpu_util = self.calculate_utilization(
requests.get('cpu', '0'),
actual_usage.get('cpu', '0')
)
if cpu_util < 50:
recommendations.append({
'pod': pod.metadata.name,
'container': container.name,
'type': 'cpu_reduction',
'current': requests.get('cpu', '0'),
'recommended': f"{int(cpu_util * 1.2)}m"
})
return recommendations
Spot instance integration for non-critical workloads:
apiVersion: apps/v1
kind: Deployment
metadata:
name: batch-processor
spec:
template:
spec:
tolerations:
- key: spot-instance
operator: Equal
value: "true"
effect: NoSchedule
nodeSelector:
node-type: spot
containers:
- name: processor
image: batch-processor:v1.0.0
resources:
requests:
memory: "256Mi"
cpu: "200m"
Operational Excellence
I implement practices that make systems reliable and maintainable:
Health Check Standards:
- Liveness probes detect when containers need restarting
- Readiness probes ensure traffic only goes to healthy pods
- Startup probes handle slow-starting applications
Graceful Shutdown:
spec:
containers:
- name: app
lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "sleep 15"]
terminationGracePeriodSeconds: 30
Resource Quotas:
apiVersion: v1
kind: ResourceQuota
metadata:
name: production-quota
spec:
hard:
requests.cpu: "100"
requests.memory: 200Gi
limits.cpu: "200"
limits.memory: 400Gi
persistentvolumeclaims: "50"
These best practices and optimizations have evolved from managing production systems at scale. They address the real challenges that emerge when moving from proof-of-concept deployments to systems that serve real users reliably and cost-effectively.
The key insight: optimization is an ongoing process, not a one-time activity. The best production systems continuously monitor, measure, and improve their performance, security, and cost efficiency.
Next, we’ll explore real-world projects and implementation strategies that demonstrate how to apply all these concepts together in complete production deployment scenarios.