eBPF in Production: Observability Without the 40% CPU Overhead

Replacing traditional APM tools with eBPF-based observability - how we eliminated 40% monitoring overhead, debugged a kernel panic, and why you can't just 'enable eBPF'.

The $18K/Month APM Bill That Started Everything

Q1 2024. Our DataDog bill: $18,400/month. And climbing.

Worse: Our monitoring was killing our performance.

  • APM agents: 15-40% CPU overhead per pod
  • Metric collection: 2-8% memory overhead
  • Network impact: 12MB/s egress per node (just telemetry!)
  • Kubernetes performance: 25% degradation from sidecar overhead

Our VP of Engineering: “We’re spending $220K/year to slow down our platform. Find a better way.”

After reading about eBPF transforming cloud observability, I proposed a radical idea: Replace traditional APM with eBPF-based observability.

My team thought I was insane. They were partially right.

What eBPF Actually Is (And Isn’t)

The “Extended Berkeley Packet Filter” Explained

Traditional monitoring: Inject agents into applications, collect data, export to backend

eBPF approach: Run sandboxed programs inside the Linux kernel that observe everything

Key insight: eBPF sees what the kernel sees - every syscall, every network packet, every function call.

The magic:

  • No application instrumentation needed
  • Near-zero overhead (<2% CPU)
  • Kernel-level visibility
  • Safe execution (verified before loading)

What eBPF Can Observe

// Example: Track all TCP connections
BPF_HASH(connections, struct sock *, u64);

int trace_tcp_connect(struct pt_regs *ctx, struct sock *sk) {
    u64 ts = bpf_ktime_get_ns();
    connections.update(&sk, &ts);
    
    // Extract connection details
    u16 dport = sk->__sk_common.skc_dport;
    u32 saddr = sk->__sk_common.skc_rcv_saddr;
    u32 daddr = sk->__sk_common.skc_daddr;
    
    // Submit event to userspace
    struct event_t event = {
        .timestamp = ts,
        .src_addr = saddr,
        .dst_addr = daddr,
        .dst_port = ntohs(dport)
    };
    events.perf_submit(ctx, &event, sizeof(event));
    
    return 0;
}

This runs in the kernel. No agent required.

Phase 1: The Proof of Concept (Weeks 1-2)

Challenge: Prove eBPF Can Replace DataDog

We picked a single service: payment-api (high-value, high-traffic).

Metrics we needed to replicate:

  • Request rate, latency, error rate (RED metrics)
  • CPU, memory, network usage
  • Custom business metrics (payment success rate)

Attempt 1: BPFTrace (Educational, Not Production)

# Track HTTP requests (quick prototype)
bpftrace -e '
  kprobe:tcp_sendmsg /comm == "payment-api"/ {
    @bytes[comm] = hist(arg2);
  }
  
  interval:s:5 {
    print(@bytes);
    clear(@bytes);
  }
'

Result: Worked! But BPFTrace is not production-ready:

  • No persistent storage
  • Manual script execution
  • No multi-node aggregation
  • No alerting

Attempt 2: Custom eBPF + Prometheus Exporter

We wrote a proper eBPF program using libbpf:

// payment_observer.bpf.c
#include <vmlinux.h>
#include <bpf/bpf_helpers.h>
#include <bpf/bpf_tracing.h>

struct http_event {
    __u32 pid;
    __u64 timestamp;
    __u32 status_code;
    __u64 duration_ns;
    char path[64];
};

struct {
    __uint(type, BPF_MAP_TYPE_RINGBUF);
    __uint(max_entries, 256 * 1024);
} events SEC(".maps");

// Trace HTTP request start
SEC("uprobe/http_server_start")
int trace_request_start(struct pt_regs *ctx) {
    struct http_event *event;
    event = bpf_ringbuf_reserve(&events, sizeof(*event), 0);
    if (!event)
        return 0;
        
    event->pid = bpf_get_current_pid_tgid() >> 32;
    event->timestamp = bpf_ktime_get_ns();
    
    // Read request path from userspace memory
    void *path_ptr = (void *)PT_REGS_PARM1(ctx);
    bpf_probe_read_user_str(&event->path, sizeof(event->path), path_ptr);
    
    bpf_ringbuf_submit(event, 0);
    return 0;
}

// Trace HTTP request end
SEC("uprobe/http_server_end")
int trace_request_end(struct pt_regs *ctx) {
    struct http_event *event;
    event = bpf_ringbuf_reserve(&events, sizeof(*event), 0);
    if (!event)
        return 0;
        
    event->status_code = (u32)PT_REGS_PARM1(ctx);
    event->duration_ns = bpf_ktime_get_ns() - event->timestamp;
    
    bpf_ringbuf_submit(event, 0);
    return 0;
}

Userspace exporter (Go):

// prometheus_exporter.go
package main

import (
    "github.com/cilium/ebpf"
    "github.com/prometheus/client_golang/prometheus"
)

type MetricsCollector struct {
    requestDuration *prometheus.HistogramVec
    requestTotal    *prometheus.CounterVec
    errorTotal      *prometheus.CounterVec
}

func (m *MetricsCollector) collectFromeBPF() {
    // Read events from eBPF ringbuffer
    reader, err := ebpf.NewRingBufReader(eventsMap)
    if err != nil {
        log.Fatal(err)
    }
    defer reader.Close()
    
    for {
        record, err := reader.Read()
        if err != nil {
            continue
        }
        
        var event HttpEvent
        if err := binary.Read(bytes.NewBuffer(record.RawSample), binary.LittleEndian, &event); err != nil {
            continue
        }
        
        // Update Prometheus metrics
        labels := prometheus.Labels{
            "path": event.Path,
            "status": strconv.Itoa(int(event.StatusCode)),
        }
        
        m.requestTotal.With(labels).Inc()
        m.requestDuration.With(labels).Observe(float64(event.DurationNs) / 1e9)
        
        if event.StatusCode >= 500 {
            m.errorTotal.With(labels).Inc()
        }
    }
}

Result:

  • Worked! Replicated DataDog metrics
  • Overhead: 1.8% CPU (vs. 28% with DataDog agent)
  • Cost: $0 (self-hosted)

Phase 2: Production Rollout (Weeks 3-6)

The Kernel Compatibility Nightmare

Problem: Our production clusters ran 4 different kernel versions:

  • Kernel 4.15 (old but stable)
  • Kernel 5.4 (LTS)
  • Kernel 5.10 (newer LTS)
  • Kernel 5.15 (latest)

eBPF programs compiled for one kernel won’t work on another.

Solution: CO-RE (Compile Once, Run Everywhere)

// BPF CO-RE enabled code
#include <vmlinux.h>
#include <bpf/bpf_core_read.h>

SEC("kprobe/tcp_sendmsg")
int trace_tcp_send(struct pt_regs *ctx) {
    struct sock *sk = (struct sock *)PT_REGS_PARM1(ctx);
    
    // CO-RE: Automatically adapt to kernel structure changes
    u16 dport = BPF_CORE_READ(sk, __sk_common.skc_dport);
    u32 saddr = BPF_CORE_READ(sk, __sk_common.skc_rcv_saddr);
    
    // Rest of the code...
}

This single binary works across all our kernel versions. Magic!

Challenge: Multi-Node Aggregation

eBPF runs per-node. We needed cluster-wide metrics.

Architecture:

┌─────────────────────────────────────────────┐
│              Prometheus                      │
│          (Central Aggregation)               │
└─────────────────┬───────────────────────────┘
                  │ scrape
      ┌───────────┼───────────┐
      │           │           │
  ┌───▼────┐  ┌──▼─────┐  ┌──▼─────┐
  │ Node 1 │  │ Node 2 │  │ Node 3 │
  │ eBPF   │  │ eBPF   │  │ eBPF   │
  │Exporter│  │Exporter│  │Exporter│
  └────────┘  └────────┘  └────────┘
      │           │           │
  ┌───▼────┐  ┌──▼─────┐  ┌──▼─────┐
  │ Kernel │  │ Kernel │  │ Kernel │
  │ eBPF   │  │ eBPF   │  │ eBPF   │
  │Programs│  │Programs│  │Programs│
  └────────┘  └────────┘  └────────┘

Deployment: DaemonSet with one exporter per node.

The Performance Benchmark

We ran load tests comparing overhead:

Monitoring SolutionCPU OverheadMemory OverheadNetwork Overhead
DataDog APM28%8%12 MB/s/node
Dynatrace22%6%9 MB/s/node
New Relic31%11%15 MB/s/node
eBPF (ours)1.8%0.4%0.8 MB/s/node

Winner: eBPF by a landslide.

The Debugging Win: Kernel Panic Root Cause

2 months into eBPF deployment, we hit a mysterious kernel panic.

Symptom: Random node crashes, 1-2 per day. No pattern.

Traditional tools: Useless. Kernel panics don’t leave logs.

eBPF saved us: We had tracing enabled.

The Investigation

We analyzed eBPF traces from crashed nodes:

# Query eBPF events before crash
bpftool prog tracelog | tail -10000 | grep "before crash"

# Found this pattern right before every crash:
[1234567.890] tcp_sendmsg: invalid socket state
[1234567.891] tcp_sendmsg: sk_state=7 (CLOSED)
[1234567.892] WARN: use after free detected
[1234567.893] --- KERNEL PANIC ---

Root cause: A bug in our custom TCP connection pooling code was closing sockets that were still in use.

Fix: 10 lines of code. Would have taken weeks to debug without eBPF tracing.

Advanced Use Cases We Built

1. Network Latency Tracking

Traditional approach: Instrument every service eBPF approach: Track at kernel level

// Track TCP round-trip time
SEC("kprobe/tcp_rcv_established")
int trace_tcp_rtt(struct pt_regs *ctx) {
    struct sock *sk = (struct sock *)PT_REGS_PARM1(ctx);
    struct tcp_sock *tp = tcp_sk(sk);
    
    u32 srtt_us = BPF_CORE_READ(tp, srtt_us) >> 3;  // RTT in microseconds
    
    // Update histogram
    u64 slot = log2l(srtt_us);
    if (slot >= MAX_SLOTS)
        slot = MAX_SLOTS - 1;
    hist[slot]++;
    
    return 0;
}

Result: Network latency visibility for every TCP connection, with zero application changes.

2. Security Monitoring

// Detect suspicious syscalls
SEC("tracepoint/syscalls/sys_enter_execve")
int trace_exec(struct trace_event_raw_sys_enter* ctx) {
    char comm[16];
    bpf_get_current_comm(&comm, sizeof(comm));
    
    char *filename = (char *)ctx->args[0];
    char cmd[64];
    bpf_probe_read_user_str(&cmd, sizeof(cmd), filename);
    
    // Flag suspicious commands
    if (strstr(cmd, "/bin/sh") || strstr(cmd, "nc") || strstr(cmd, "curl")) {
        struct alert_t alert = {
            .pid = bpf_get_current_pid_tgid() >> 32,
            .timestamp = bpf_ktime_get_ns(),
        };
        strncpy(alert.command, cmd, sizeof(alert.command));
        strncpy(alert.process, comm, sizeof(alert.process));
        
        alerts.perf_submit(ctx, &alert, sizeof(alert));
    }
    
    return 0;
}

Caught: 3 cryptomining attempts in first month.

3. Custom Business Metrics

// Track payment success rate (application-specific)
SEC("uprobe/process_payment")
int trace_payment(struct pt_regs *ctx) {
    struct payment_info *payment = (struct payment_info *)PT_REGS_PARM1(ctx);
    
    u64 amount;
    bpf_probe_read_user(&amount, sizeof(amount), &payment->amount);
    
    // Track payment by amount bucket
    u64 bucket = amount / 10000;  // $100 buckets
    payment_counts[bucket]++;
    
    return 0;
}

SEC("uretprobe/process_payment")
int trace_payment_ret(struct pt_regs *ctx) {
    int result = PT_REGS_RC(ctx);
    
    if (result == 0) {
        successful_payments++;
    } else {
        failed_payments++;
    }
    
    return 0;
}

Visibility: Payment success rate without modifying application code.

The Challenges: What Nobody Tells You

Challenge 1: eBPF Development is HARD

Reality: Writing eBPF code is way harder than writing normal code.

Why:

  • Kernel programming (memory safety, no stdlib)
  • eBPF verifier restrictions (no loops, limited stack)
  • Debugging is painful (printk is your only friend)
  • Documentation is scattered

Our solution: Build abstractions. Don’t write raw eBPF unless necessary.

Challenge 2: Kernel Upgrades Break Things

Scenario: Upgraded from kernel 5.10 → 5.15

Result: Half our eBPF programs stopped working.

Cause: Kernel structure changes broke field offsets.

Solution: CO-RE helped, but we still had to update programs for significant kernel changes.

Challenge 3: Performance Can Degrade

Surprise: eBPF isn’t always fast.

Example: Our first network tracing program:

// BAD: This slowed packet processing by 15%
SEC("tc")
int trace_every_packet(struct __sk_buff *skb) {
    // Extract IP header
    struct iphdr ip;
    bpf_skb_load_bytes(skb, 0, &ip, sizeof(ip));
    
    // Extract TCP header  
    struct tcphdr tcp;
    bpf_skb_load_bytes(skb, sizeof(ip), &tcp, sizeof(tcp));
    
    // Create event
    struct event_t event = { /* ... */ };
    events.perf_submit(skb, &event, sizeof(event));
    
    return TC_ACT_OK;
}

Problem: Submitting event for every packet overloaded the ringbuffer.

Fixed version (sampling + filtering):

// GOOD: Sample 1% of packets, filter by port
SEC("tc")
int trace_packets_sampled(struct __sk_buff *skb) {
    // Sample 1% of traffic
    if (bpf_get_prandom_u32() % 100 != 0)
        return TC_ACT_OK;
        
    // Filter: only trace HTTP/HTTPS
    struct tcphdr tcp;
    bpf_skb_load_bytes(skb, sizeof(struct iphdr), &tcp, sizeof(tcp));
    
    u16 port = bpf_ntohs(tcp.dest);
    if (port != 80 && port != 443)
        return TC_ACT_OK;
    
    // Now create event (only for 1% of HTTP/HTTPS traffic)
    struct event_t event = { /* ... */ };
    events.perf_submit(skb, &event, sizeof(event));
    
    return TC_ACT_OK;
}

Result: Overhead dropped from 15% to <1%.

Production Results: 6 Months Later

Cost Savings

Before (DataDog):

  • Monthly cost: $18,400
  • Annual cost: $220,800

After (eBPF + Prometheus + Grafana):

  • Infrastructure cost: $1,200/month (storage + compute)
  • Development cost: $40K (amortized over 3 years)
  • Annual cost: $14,400 + $13K = $27,400

Savings: $193,400/year (88% reduction)

Performance Gains

Application performance:

  • CPU usage: Down 25% (no APM agent overhead)
  • Memory usage: Down 8%
  • Network traffic: Down 12 MB/s per node

Observability improvements:

  • Metric resolution: 1 second (was 10 seconds)
  • Custom metrics: Unlimited (was $500/metric/year)
  • Debugging depth: Kernel-level visibility

Incidents Detected

6 months of production eBPF:

  • 3 cryptomining attempts detected
  • 1 kernel panic root-caused (saved weeks of debugging)
  • 12 network issues identified (latency spikes, packet loss)
  • 47 application bugs found (memory leaks, slow queries)

Lessons for Teams Considering eBPF

✅ Do This:

  1. Start simple: Use existing tools (Cilium, Pixie) before writing custom eBPF
  2. Focus on high-value use cases: Network visibility, security monitoring
  3. Use CO-RE: Don’t compile per-kernel-version
  4. Sample aggressively: Don’t trace every event
  5. Invest in tooling: Build good development/debugging tools

❌ Don’t Do This:

  1. Rewrite everything in eBPF: Use it where it provides unique value
  2. Ignore kernel compatibility: Test across kernel versions
  3. Skip performance testing: eBPF can be slow if done wrong
  4. Write raw eBPF for everything: Use libraries/frameworks
  5. Neglect security: eBPF programs run in kernel - be careful

What’s Next?

We’re exploring:

  1. eBPF-based service mesh: Replace Envoy sidecars
  2. Runtime security: Detect and block threats at kernel level
  3. Performance profiling: Continuous profiling with eBPF
  4. Cost optimization: Track resource usage per tenant

eBPF transformed our observability from “expensive and slow” to “cheap and comprehensive.” But it requires deep Linux knowledge and careful engineering.

For more on eBPF’s role in cloud-native observability, see the comprehensive eBPF guide that helped inform our implementation.


Running eBPF in production? Connect on LinkedIn or share your eBPF stories on Twitter.