Kubernetes eBPF: Next-Generation Observability and Security

eBPF is the biggest shift in Linux observability since strace. I don’t say that lightly. I’ve spent years wiring up monitoring stacks, bolting sidecars onto every pod, and watching resource requests balloon because each workload needed its own little proxy just to get visibility into what was happening on the network. eBPF changes the game entirely — it moves instrumentation into the kernel, where it belongs, and it does it safely, efficiently, and without touching your application code.

If you’re running Kubernetes and you haven’t looked seriously at eBPF-based tooling yet, this is your sign. I’m going to walk through what eBPF actually is, how I replaced a sidecar-heavy service mesh with Cilium, and how tools like Tetragon and Pixie give you observability and security capabilities that were genuinely impossible two years ago.

What eBPF Actually Is (and Why You Should Care)

eBPF — extended Berkeley Packet Filter — lets you run sandboxed programs inside the Linux kernel without changing kernel source code or loading kernel modules. Think of it as a safe, programmable hook into virtually every layer of the operating system: networking, file I/O, syscalls, scheduling, you name it.

The key properties that matter for Kubernetes:

No sidecar required. eBPF programs run in the kernel on each node. They see all traffic and all syscalls for every pod on that node without injecting anything into the pod itself.
Near-zero overhead. eBPF programs are JIT-compiled and verified before execution. The performance cost is negligible compared to userspace proxies.
Deep visibility. You’re not limited to L7 proxy data. You can observe DNS lookups, TCP retransmits, file access patterns, process execution — all from the kernel’s perspective.

I’d been hearing about eBPF for years in the networking community, but it wasn’t until I saw what Cilium could do in a production Kubernetes cluster that it clicked for me. This isn’t academic — it’s practical, and it’s ready.

The War Story: Replacing Our Sidecar Mesh with Cilium

Here’s the context. We had a mid-size Kubernetes cluster — around 200 microservices, running on EKS. Every pod had an Envoy sidecar injected by our service mesh. The mesh gave us mTLS, traffic policies, and some observability. It also gave us headaches.

Each sidecar consumed roughly 50-100MB of memory and a non-trivial slice of CPU. Multiply that across hundreds of pods with multiple replicas and you’re burning real money on infrastructure that exists purely for plumbing. Sidecar injection failures caused deployment rollbacks. Envoy version upgrades were a coordination nightmare. Debug sessions regularly turned into “is this a mesh issue or an app issue?” investigations.

I’d been tracking Cilium’s progress and decided to run a proof of concept. The pitch was simple: replace the sidecar data plane with eBPF-based networking at the kernel level.

Here’s what the migration looked like at a high level:

Install Cilium as the CNI plugin (replacing our existing CNI).
Enable Cilium’s transparent encryption (WireGuard-based) to replace sidecar mTLS.
Migrate network policies from the mesh format to Cilium’s CiliumNetworkPolicy CRDs.
Enable Hubble for L3/L4/L7 observability.
Remove the sidecar injector and watch resource usage drop.

Installing Cilium on an EKS cluster:

cilium install --version 1.15.4 \
  --set encryption.enabled=true \
  --set encryption.type=wireguard \
  --set hubble.enabled=true \
  --set hubble.relay.enabled=true \
  --set hubble.ui.enabled=true

After the rollout, I verified the status:

cilium status --wait

The results were dramatic. Pod memory usage across the cluster dropped by roughly 50%. We reclaimed enough capacity to defer a node group scale-up we’d been planning. Cold start times improved because pods no longer waited for sidecar readiness. And the observability we got from Hubble was, honestly, better than what the mesh had given us.

Cilium Network Policies: Beyond Standard Kubernetes

Standard Kubernetes network policies are useful but limited. They operate at L3/L4 — you can allow or deny traffic based on IP, port, and pod selectors. Cilium extends this to L7, which means you can write policies based on HTTP methods, paths, headers, DNS names, and even Kafka topics.

Here’s a CiliumNetworkPolicy that restricts an API service to only accept GET and POST requests on specific paths:

apiVersion: cilium.io/v2
kind: CiliumNetworkPolicy
metadata:
  name: api-l7-policy
  namespace: production
spec:
  endpointSelector:
    matchLabels:
      app: api-gateway
  ingress:
    - fromEndpoints:
        - matchLabels:
            app: frontend
      toPorts:
        - ports:
            - port: "8080"
              protocol: TCP
          rules:
            http:
              - method: GET
                path: "/api/v1/.*"
              - method: POST
                path: "/api/v1/orders"

Try doing that with vanilla Kubernetes NetworkPolicy. You can’t. This level of granularity used to require a full service mesh. Now it’s a CRD and a kernel program.

For DNS-based policies — say you want to restrict egress to only specific external domains:

apiVersion: cilium.io/v2
kind: CiliumNetworkPolicy
metadata:
  name: restrict-external-dns
  namespace: production
spec:
  endpointSelector:
    matchLabels:
      app: payment-service
  egress:
    - toFQDNs:
        - matchName: "api.stripe.com"
        - matchName: "api.paypal.com"
      toPorts:
        - ports:
            - port: "443"
              protocol: TCP

This is huge for security. Instead of hoping your payment service doesn’t talk to unexpected endpoints, you enforce it at the kernel level.

Hubble: Observability Without the Overhead

Hubble is Cilium’s observability layer. It taps into the same eBPF data plane to give you flow visibility, service maps, and metrics — all without sidecars, all without application changes.

After enabling Hubble (which I did during the Cilium install above), you can observe flows in real time:

hubble observe --namespace production --protocol http

Or filter for DNS queries from a specific pod:

hubble observe --pod production/payment-service --type l7 --protocol dns

Hubble also exports Prometheus metrics out of the box. I wired these into our existing Grafana dashboards and suddenly had per-service HTTP request rates, error rates, and latency distributions — the golden signals — without touching a single line of application code. If you’re already invested in observability patterns for distributed systems, Hubble slots right in.

The service dependency map Hubble generates is genuinely useful too. We discovered two services that were making unexpected cross-namespace calls that nobody on the team knew about. That kind of visibility is what you need when you’re operating at scale.

Tetragon: Runtime Security at the Kernel Level

Observability is one side of the eBPF coin. Security is the other. Tetragon, also from the Cilium project (now a CNCF project), uses eBPF to enforce security policies at the kernel level in real time.

Traditional runtime security tools work by monitoring syscalls from userspace and reacting after the fact. Tetragon hooks directly into the kernel and can block malicious actions as they happen — before the syscall completes. That’s a fundamental difference.

Install Tetragon on your cluster:

helm repo add cilium https://helm.cilium.io
helm install tetragon cilium/tetragon -n kube-system

Tetragon uses TracingPolicy CRDs to define what to monitor and enforce. Here’s a policy that detects and logs any process execution inside a container — useful for catching reverse shells or unexpected binaries:

apiVersion: cilium.io/v1alpha1
kind: TracingPolicy
metadata:
  name: detect-process-exec
spec:
  kprobes:
    - call: sys_execve
      syscall: true
      args:
        - index: 0
          type: string
      selectors:
        - matchNamespaces:
            - namespace: production
              operator: In

Want to go further and block writes to sensitive paths? Here’s a policy that prevents any container in the production namespace from writing to /etc/:

apiVersion: cilium.io/v1alpha1
kind: TracingPolicy
metadata:
  name: block-etc-writes
spec:
  kprobes:
    - call: sys_write
      syscall: true
      args:
        - index: 0
          type: fd
      selectors:
        - matchNamespaces:
            - namespace: production
              operator: In
          matchArgs:
            - index: 0
              operator: Prefix
              values:
                - "/etc/"
          matchActions:
            - action: Sigkill

That Sigkill action means the process gets terminated immediately. No waiting for an alert, no hoping someone’s watching the dashboard. The kernel stops it. This pairs well with container security scanning in your CI/CD pipelines — you catch known vulnerabilities at build time and enforce runtime behavior with Tetragon.

To see Tetragon events in real time:

kubectl logs -n kube-system -l app.kubernetes.io/name=tetragon -c export-stdout -f | \
  tetra getevents --output compact

I’ve found Tetragon particularly valuable for compliance. When auditors ask “how do you prevent unauthorized process execution in production containers?” you can show them a TracingPolicy and the enforcement logs. It’s concrete, it’s verifiable, and it’s enforced at the kernel level.

Pixie: Auto-Instrumented Application Profiling

Pixie (now part of the CNCF as a sandbox project) takes a different angle on eBPF observability. Where Hubble focuses on network flows, Pixie auto-instruments your applications to capture request traces, CPU profiles, and even full request/response bodies — all using eBPF, all without code changes.

Deploy Pixie to your cluster:

px deploy

Then query your cluster’s HTTP traffic using Pixie’s PxL scripting language:

px run px/http_data -- --start_time=-5m --namespace=production

Or get a flamegraph-style CPU profile for a specific pod:

px run px/pod_cpu_flamegraph -- --pod=production/api-gateway-7b4d9f8c6-x2k9m

What makes Pixie special is that it captures full protocol traces — HTTP, gRPC, MySQL, PostgreSQL, Kafka, DNS — without any instrumentation libraries. It does this by hooking into the kernel’s networking stack with eBPF and reconstructing protocol messages. If you’re already using distributed tracing with OpenTelemetry, Pixie complements it by filling in the gaps where you haven’t added manual instrumentation yet.

I used Pixie to debug a latency issue that had stumped us for days. Our traces showed the request was slow, but the spans didn’t cover the actual database call because that particular service hadn’t been instrumented yet. Pixie showed me the full MySQL query, the response time, and the exact query that was doing a full table scan. Ten minutes from “let me try Pixie” to root cause. That’s the power of kernel-level observability.

Putting It All Together: The eBPF Stack

Here’s the stack I run today and recommend:

Layer	Tool	What It Does
CNI + Networking	Cilium	Pod networking, load balancing, encryption
Network Policy	CiliumNetworkPolicy	L3/L4/L7 traffic control, FQDN filtering
Network Observability	Hubble	Flow logs, service maps, golden signals
Runtime Security	Tetragon	Process/file/network enforcement at kernel level
App Profiling	Pixie	Auto-instrumented traces, CPU profiles, protocol capture

This replaces what used to require a service mesh (Istio/Linkerd), a sidecar proxy (Envoy), a separate network policy engine, a runtime security agent (Falco or similar), and a profiling tool. It’s fewer moving parts, less resource consumption, and deeper visibility.

If you’re running ingress controllers like NGINX, Traefik, or Istio, Cilium can also handle ingress natively with its Gateway API implementation — another component you might be able to consolidate.

Getting Started: Practical Advice

If you’re convinced and want to start adopting eBPF tooling, here’s my advice:

Start with Cilium as your CNI. This is the foundation. If you’re on a managed Kubernetes service, check whether eBPF-based networking is available natively — EKS, GKE, and AKS all have options now. On EKS, you can use Cilium as a replacement CNI or run it in chaining mode alongside the VPC CNI.

Enable Hubble immediately. There’s no reason not to. The overhead is minimal and the visibility is immediate. Wire the Prometheus metrics into your existing dashboards.

Add Tetragon for runtime security. Start with detection-only policies (logging, not killing) until you’re confident in your policy definitions. Then graduate to enforcement.

Try Pixie for debugging. You don’t need to run it permanently. Deploy it when you’re investigating an issue, use it to get the data you need, and scale it down if resources are tight.

Check your kernel version. eBPF capabilities depend on kernel version. You want 5.10+ for most features, and 5.15+ for the full Tetragon and Cilium feature set. Most current managed Kubernetes offerings meet this requirement, but verify.

uname -r

The Shift Is Happening

eBPF isn’t a future technology — it’s a present one. Every major cloud provider is investing in it. Cilium is the default CNI for GKE. Meta, Netflix, and Cloudflare run eBPF in production at massive scale. The CNCF has multiple eBPF projects in its ecosystem.

The sidecar model served us well, but it was always a workaround for the fact that we couldn’t safely instrument the kernel. Now we can. The result is less infrastructure overhead, deeper observability, and security enforcement that actually happens at the right layer.

If you’re still running a sidecar-heavy mesh and wondering whether the complexity is worth it, I’d encourage you to spin up a test cluster with Cilium and Hubble. Give it an afternoon. I think you’ll see what I saw — and you won’t want to go back.