Kubernetes Horizontal Pod Autoscaling with Custom Metrics

CPU-based autoscaling is a lie for most web services. There, I said it.

I spent a painful week last year watching an HPA scale our API pods from 3 to 15 based on CPU utilization. The dashboards looked great — CPU was being “managed.” Meanwhile, the service was falling over because every single one of those 15 pods was fighting over a connection pool limited to 50 database connections. More pods made the problem worse. We were autoscaling ourselves into an outage.

That experience changed how I think about HPA entirely. CPU and memory are infrastructure metrics. They tell you how the machine is doing, not how your application is doing. If you want autoscaling that actually helps, you need to scale on what matters to your users — request latency, queue depth, active connections, error rates. Custom metrics.

This is the guide I wish I’d had before that incident. We’ll go from the default CPU-based HPA through the full custom metrics pipeline with Prometheus, Datadog, and CloudWatch. If you’re coming from my earlier piece on K8s scaling fundamentals, this picks up right where that left off.

Why Default HPA Falls Short

The out-of-the-box HPA uses the metrics.k8s.io API, which gives you CPU and memory from the metrics-server. For a batch processing job that’s genuinely CPU-bound, this works fine. For everything else, it’s a blunt instrument.

Here’s the standard HPA most tutorials show you:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-hpa
  namespace: production
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-server
  minReplicas: 3
  maxReplicas: 20
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70

This tells Kubernetes: “keep average CPU at 70% across pods.” Sounds reasonable until you think about what actually happens during a traffic spike on a typical web service. Your pods receive more requests, they spend most of their time waiting on I/O — database queries, external API calls, cache lookups. CPU barely moves. The HPA sits there doing nothing while your p99 latency climbs through the roof.

Or the opposite happens, like my database connection disaster. Some background job kicks off, CPU spikes, HPA adds pods, and now you’ve got a thundering herd problem on a shared resource that doesn’t scale horizontally.

The fix isn’t to abandon HPA. It’s to feed it metrics that actually represent your application’s health. That’s where the custom and external metrics APIs come in.

The Kubernetes Metrics Architecture

Before diving into implementations, it helps to understand the three metrics APIs that HPA can consume:

metrics.k8s.io — Resource metrics (CPU, memory). Served by metrics-server. This is what you get by default.
custom.metrics.k8s.io — Custom metrics tied to Kubernetes objects. Things like requests-per-second on a specific deployment, or queue length on a specific pod. Served by an adapter you install.
external.metrics.k8s.io — Metrics from outside the cluster entirely. CloudWatch metrics, SQS queue depth, a third-party API’s response time. Also served by an adapter.

The key insight: HPA doesn’t care where the numbers come from. It just queries these APIs and does math. Your job is to get the right numbers into the right API.

I covered the monitoring foundations in my K8s monitoring and logging guide — that’s the observability side. This article is about closing the loop and feeding those observations back into autoscaling decisions.

Prometheus-Based Custom Metrics

Prometheus is the most common path here, and for good reason. If you’re already running Prometheus (and you probably are), you’re halfway there.

You need the Prometheus Adapter, which translates PromQL queries into the custom metrics API format that HPA understands.

Install it with Helm:

helm install prometheus-adapter prometheus-community/prometheus-adapter \
  --namespace monitoring \
  --set prometheus.url=http://prometheus-server.monitoring.svc \
  --set prometheus.port=9090

The adapter needs rules that map Prometheus metrics to Kubernetes custom metrics. Here’s a configuration that exposes HTTP request rate per pod:

apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-adapter
  namespace: monitoring
data:
  config.yaml: |
    rules:
      - seriesQuery: 'http_requests_total{namespace!="",pod!=""}'
        resources:
          overrides:
            namespace: {resource: "namespace"}
            pod: {resource: "pod"}
        name:
          matches: "^(.*)_total$"
          as: "${1}_per_second"
        metricsQuery: 'rate(<<.Series>>{<<.LabelMatchers>>}[2m])'
      - seriesQuery: 'http_request_duration_seconds_bucket{namespace!="",pod!=""}'
        resources:
          overrides:
            namespace: {resource: "namespace"}
            pod: {resource: "pod"}
        name:
          as: "http_request_duration_p99"
        metricsQuery: 'histogram_quantile(0.99, rate(<<.Series>>{<<.LabelMatchers>>}[2m]))'

That second rule is the one I care about most. It exposes p99 latency as a metric HPA can act on. Now you can write an HPA that scales based on what users actually experience:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-hpa
  namespace: production
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-server
  minReplicas: 3
  maxReplicas: 20
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
        - type: Percent
          value: 10
          periodSeconds: 60
    scaleUp:
      stabilizationWindowSeconds: 30
      policies:
        - type: Percent
          value: 50
          periodSeconds: 60
  metrics:
    - type: Pods
      pods:
        metric:
          name: http_requests_per_second
        target:
          type: AverageValue
          averageValue: "100"
    - type: Pods
      pods:
        metric:
          name: http_request_duration_p99
        target:
          type: AverageValue
          averageValue: "0.5"

A few things worth noting here. I’m using autoscaling/v2 — if you’re still on v2beta2, upgrade. The behavior section is critical and most people skip it. Without it, HPA will flap wildly on spiky metrics. The stabilizationWindowSeconds on scale-down prevents the classic scenario where traffic dips for 30 seconds and HPA yanks away half your pods right before the next wave hits.

I’m also using multiple metrics. HPA evaluates all of them and picks the one that recommends the highest replica count. So if request rate says you need 5 pods but latency says you need 8, you get 8. This is exactly the behavior you want — scale on whichever dimension is hurting most.

Verify your custom metrics are flowing:

kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1/namespaces/production/pods/*/http_requests_per_second" | jq .

If that returns empty or errors, check the adapter logs. Nine times out of ten it’s a PromQL query that doesn’t match any series, usually because label names don’t line up.

Datadog as a Metrics Source

If you’re a Datadog shop, the Datadog Cluster Agent can serve both the custom and external metrics APIs directly. No Prometheus adapter needed.

First, enable the external metrics provider in your Datadog Cluster Agent config:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: datadog-cluster-agent
  namespace: datadog
spec:
  template:
    spec:
      containers:
        - name: cluster-agent
          env:
            - name: DD_EXTERNAL_METRICS_PROVIDER_ENABLED
              value: "true"
            - name: DD_EXTERNAL_METRICS_PROVIDER_WPA_CONTROLLER
              value: "false"
            - name: DD_APP_KEY
              valueFrom:
                secretKeyRef:
                  name: datadog-keys
                  key: app-key

Then reference Datadog metrics directly in your HPA using the External metric type:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: worker-hpa
  namespace: production
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: queue-worker
  minReplicas: 2
  maxReplicas: 30
  metrics:
    - type: External
      external:
        metric:
          name: "aws.sqs.approximate_number_of_messages_visible"
          selector:
            matchLabels:
              queuename: "production-jobs"
        target:
          type: AverageValue
          averageValue: "5"

This scales your worker pods based on SQS queue depth as reported through Datadog. Each pod should handle roughly 5 messages — if the queue grows, more pods spin up. This is dramatically more useful than CPU for a queue consumer.

The External type is the key difference from Prometheus custom metrics. External metrics aren’t tied to any Kubernetes object. They come from outside the cluster entirely. This is powerful for scaling based on upstream load — queue depth, incoming webhook rate, even business metrics if you’re creative about it.

CloudWatch Metrics and EKS

Running on EKS? You can scale on CloudWatch metrics using the CloudWatch Metrics Adapter. This is particularly useful when your bottleneck lives in an AWS managed service — RDS connection count, ElastiCache evictions, ALB request count.

Deploy the adapter:

helm install cloudwatch-adapter \
  --namespace kube-system \
  oci://public.ecr.aws/awslabs/k8s-cloudwatch-adapter-chart

Then define an ExternalMetric custom resource that maps a CloudWatch metric:

apiVersion: metrics.aws/v1alpha1
kind: ExternalMetric
metadata:
  name: rds-connections
  namespace: production
spec:
  name: rds-connections
  queries:
    - id: connections
      metricStat:
        metric:
          namespace: "AWS/RDS"
          metricName: "DatabaseConnections"
          dimensions:
            - name: DBInstanceIdentifier
              value: "prod-primary"
        period: 60
        stat: Average
        unit: Count

Now use it in an HPA:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-hpa-cloudwatch
  namespace: production
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-server
  minReplicas: 3
  maxReplicas: 15
  metrics:
    - type: External
      external:
        metric:
          name: rds-connections
        target:
          type: Value
          value: "40"

This is exactly what would’ve saved me during that database connection incident. Instead of blindly adding pods when CPU spiked, the HPA would’ve seen that RDS connections were already at the limit and held steady. Better yet, it could’ve scaled down to reduce connection pressure.

Multi-Metric Strategies That Actually Work

The real power comes from combining metrics. I’ve settled on a pattern I use for most web services: one infrastructure metric as a safety net, one application metric as the primary driver, and one external metric for upstream pressure.

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-hpa-combined
  namespace: production
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-server
  minReplicas: 3
  maxReplicas: 25
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
        - type: Pods
          value: 2
          periodSeconds: 60
    scaleUp:
      stabilizationWindowSeconds: 0
      policies:
        - type: Percent
          value: 100
          periodSeconds: 30
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 80
    - type: Pods
      pods:
        metric:
          name: http_request_duration_p99
        target:
          type: AverageValue
          averageValue: "0.5"
    - type: External
      external:
        metric:
          name: "aws.sqs.approximate_number_of_messages_visible"
          selector:
            matchLabels:
              queuename: "api-async-jobs"
        target:
          type: AverageValue
          averageValue: "10"

Notice the asymmetric behavior config. Scale-up has zero stabilization window and allows doubling — when things go bad, react fast. Scale-down is conservative: wait 5 minutes, remove at most 2 pods per minute. I’ve been burned too many times by aggressive scale-down to do it any other way.

The CPU metric at 80% is there as a ceiling, not a primary signal. If something goes wrong with your custom metrics pipeline and latency data stops flowing, CPU will still prevent pods from melting. Defense in depth applies to autoscaling too.

This ties directly into the SLO/SLI framework I wrote about previously. Your HPA targets should derive from your SLOs. If your SLO says p99 latency under 500ms, your HPA target for http_request_duration_p99 should be 0.5. The autoscaler becomes an automated SLO enforcement mechanism.

Debugging HPA When It Misbehaves

HPA will confuse you. It’ll refuse to scale when you think it should, or scale when it shouldn’t. Here’s my debugging checklist.

Check what HPA currently sees:

kubectl describe hpa api-hpa -n production

Look at the Conditions section. The most common issues:

AbleToScale: False — usually means the metrics API isn’t responding. Check your adapter pods.
ScalingLimited — you’ve hit maxReplicas or minReplicas. Obvious but easy to miss.
Metrics showing <unknown> — the adapter can’t find the metric. Label mismatch, wrong metric name, or the series doesn’t exist in Prometheus yet.

Query the metrics API directly:

# Custom metrics
kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1" | jq .

# External metrics
kubectl get --raw "/apis/external.metrics.k8s.io/v1beta1" | jq .

If those 404, your adapter isn’t registered as an API service. Check:

kubectl get apiservices | grep metrics

You should see entries for v1beta1.custom.metrics.k8s.io and v1beta1.external.metrics.k8s.io with Available: True.

One gotcha that cost me hours: the HPA controller runs on a default 15-second sync period. If your metric changes faster than that, HPA won’t see every fluctuation. This is usually fine — you don’t want HPA reacting to sub-second spikes — but it’s worth knowing when you’re testing.

For resource optimization, make sure your pods have proper resource requests set. HPA’s CPU percentage calculation uses requests as the denominator. No requests defined means HPA can’t calculate utilization and will ignore the CPU metric entirely.

KEDA: When HPA Isn’t Enough

I should mention KEDA because it solves a specific problem that native HPA can’t: scaling to zero. Standard HPA has a minimum of 1 replica. KEDA wraps HPA and adds scale-to-zero capability plus a huge library of built-in scalers for common event sources.

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: worker-scaledobject
  namespace: production
spec:
  scaleTargetRef:
    name: queue-worker
  minReplicaCount: 0
  maxReplicaCount: 50
  cooldownPeriod: 300
  triggers:
    - type: aws-sqs-queue
      metadata:
        queueURL: https://sqs.eu-west-1.amazonaws.com/123456789/production-jobs
        queueLength: "5"
        awsRegion: eu-west-1
      authenticationRef:
        name: keda-aws-credentials

KEDA creates and manages the HPA for you. Under the hood it’s still the same autoscaling machinery, but with a nicer interface and that critical scale-to-zero feature. For event-driven workloads — queue consumers, webhook processors, scheduled batch jobs — it’s become my default choice over raw HPA.

Lessons from Production

After running custom metrics HPA across a dozen services for over a year, here’s what I’ve learned:

Start with one custom metric alongside CPU, not instead of it. Get comfortable with how HPA behaves before removing the safety net. I covered the progression from manual scaling to HPA in my scaling mastery guide — don’t skip steps.

Your metrics pipeline is now in the critical path for autoscaling. If Prometheus goes down, your custom metrics disappear, and HPA falls back to whatever resource metrics are still available. Monitor your monitoring. Seriously.

Test scale-up under realistic conditions, not just synthetic load. The difference between “1000 requests per second of GET /health” and “1000 requests per second of actual user traffic with database queries and external calls” is enormous. Your HPA targets need to reflect real workload characteristics.

Don’t forget about services and networking. Scaling pods is only half the story. If your Service isn’t distributing traffic to new pods quickly enough, or if your readiness probes are too slow, you’ll have pods running but not receiving traffic during the critical scale-up window.

The database connection problem I mentioned at the start? We solved it with a combination of PgBouncer for connection pooling and an HPA that watches both request latency and a custom db_pool_utilization metric. When pool utilization crosses 70%, HPA stops adding pods even if latency is climbing. Sometimes the right scaling decision is to not scale.

That’s the real lesson here. Custom metrics don’t just let you scale better — they let you make smarter decisions about when not to scale. And that’s worth more than any amount of CPU-based autoscaling will ever give you.