The $18K/Month APM Bill That Started Everything
Q1 2024. Our DataDog bill: $18,400/month. And climbing.
Worse: Our monitoring was killing our performance.
- APM agents: 15-40% CPU overhead per pod
- Metric collection: 2-8% memory overhead
- Network impact: 12MB/s egress per node (just telemetry!)
- Kubernetes performance: 25% degradation from sidecar overhead
Our VP of Engineering: “We’re spending $220K/year to slow down our platform. Find a better way.”
After reading about eBPF transforming cloud observability, I proposed a radical idea: Replace traditional APM with eBPF-based observability.
My team thought I was insane. They were partially right.
What eBPF Actually Is (And Isn’t)
The “Extended Berkeley Packet Filter” Explained
Traditional monitoring: Inject agents into applications, collect data, export to backend
eBPF approach: Run sandboxed programs inside the Linux kernel that observe everything
Key insight: eBPF sees what the kernel sees - every syscall, every network packet, every function call.
The magic:
- No application instrumentation needed
- Near-zero overhead (<2% CPU)
- Kernel-level visibility
- Safe execution (verified before loading)
What eBPF Can Observe
// Example: Track all TCP connections
BPF_HASH(connections, struct sock *, u64);
int trace_tcp_connect(struct pt_regs *ctx, struct sock *sk) {
u64 ts = bpf_ktime_get_ns();
connections.update(&sk, &ts);
// Extract connection details
u16 dport = sk->__sk_common.skc_dport;
u32 saddr = sk->__sk_common.skc_rcv_saddr;
u32 daddr = sk->__sk_common.skc_daddr;
// Submit event to userspace
struct event_t event = {
.timestamp = ts,
.src_addr = saddr,
.dst_addr = daddr,
.dst_port = ntohs(dport)
};
events.perf_submit(ctx, &event, sizeof(event));
return 0;
}
This runs in the kernel. No agent required.
Phase 1: The Proof of Concept (Weeks 1-2)
Challenge: Prove eBPF Can Replace DataDog
We picked a single service: payment-api
(high-value, high-traffic).
Metrics we needed to replicate:
- Request rate, latency, error rate (RED metrics)
- CPU, memory, network usage
- Custom business metrics (payment success rate)
Attempt 1: BPFTrace (Educational, Not Production)
# Track HTTP requests (quick prototype)
bpftrace -e '
kprobe:tcp_sendmsg /comm == "payment-api"/ {
@bytes[comm] = hist(arg2);
}
interval:s:5 {
print(@bytes);
clear(@bytes);
}
'
Result: Worked! But BPFTrace is not production-ready:
- No persistent storage
- Manual script execution
- No multi-node aggregation
- No alerting
Attempt 2: Custom eBPF + Prometheus Exporter
We wrote a proper eBPF program using libbpf:
// payment_observer.bpf.c
#include <vmlinux.h>
#include <bpf/bpf_helpers.h>
#include <bpf/bpf_tracing.h>
struct http_event {
__u32 pid;
__u64 timestamp;
__u32 status_code;
__u64 duration_ns;
char path[64];
};
struct {
__uint(type, BPF_MAP_TYPE_RINGBUF);
__uint(max_entries, 256 * 1024);
} events SEC(".maps");
// Trace HTTP request start
SEC("uprobe/http_server_start")
int trace_request_start(struct pt_regs *ctx) {
struct http_event *event;
event = bpf_ringbuf_reserve(&events, sizeof(*event), 0);
if (!event)
return 0;
event->pid = bpf_get_current_pid_tgid() >> 32;
event->timestamp = bpf_ktime_get_ns();
// Read request path from userspace memory
void *path_ptr = (void *)PT_REGS_PARM1(ctx);
bpf_probe_read_user_str(&event->path, sizeof(event->path), path_ptr);
bpf_ringbuf_submit(event, 0);
return 0;
}
// Trace HTTP request end
SEC("uprobe/http_server_end")
int trace_request_end(struct pt_regs *ctx) {
struct http_event *event;
event = bpf_ringbuf_reserve(&events, sizeof(*event), 0);
if (!event)
return 0;
event->status_code = (u32)PT_REGS_PARM1(ctx);
event->duration_ns = bpf_ktime_get_ns() - event->timestamp;
bpf_ringbuf_submit(event, 0);
return 0;
}
Userspace exporter (Go):
// prometheus_exporter.go
package main
import (
"github.com/cilium/ebpf"
"github.com/prometheus/client_golang/prometheus"
)
type MetricsCollector struct {
requestDuration *prometheus.HistogramVec
requestTotal *prometheus.CounterVec
errorTotal *prometheus.CounterVec
}
func (m *MetricsCollector) collectFromeBPF() {
// Read events from eBPF ringbuffer
reader, err := ebpf.NewRingBufReader(eventsMap)
if err != nil {
log.Fatal(err)
}
defer reader.Close()
for {
record, err := reader.Read()
if err != nil {
continue
}
var event HttpEvent
if err := binary.Read(bytes.NewBuffer(record.RawSample), binary.LittleEndian, &event); err != nil {
continue
}
// Update Prometheus metrics
labels := prometheus.Labels{
"path": event.Path,
"status": strconv.Itoa(int(event.StatusCode)),
}
m.requestTotal.With(labels).Inc()
m.requestDuration.With(labels).Observe(float64(event.DurationNs) / 1e9)
if event.StatusCode >= 500 {
m.errorTotal.With(labels).Inc()
}
}
}
Result:
- Worked! Replicated DataDog metrics
- Overhead: 1.8% CPU (vs. 28% with DataDog agent)
- Cost: $0 (self-hosted)
Phase 2: Production Rollout (Weeks 3-6)
The Kernel Compatibility Nightmare
Problem: Our production clusters ran 4 different kernel versions:
- Kernel 4.15 (old but stable)
- Kernel 5.4 (LTS)
- Kernel 5.10 (newer LTS)
- Kernel 5.15 (latest)
eBPF programs compiled for one kernel won’t work on another.
Solution: CO-RE (Compile Once, Run Everywhere)
// BPF CO-RE enabled code
#include <vmlinux.h>
#include <bpf/bpf_core_read.h>
SEC("kprobe/tcp_sendmsg")
int trace_tcp_send(struct pt_regs *ctx) {
struct sock *sk = (struct sock *)PT_REGS_PARM1(ctx);
// CO-RE: Automatically adapt to kernel structure changes
u16 dport = BPF_CORE_READ(sk, __sk_common.skc_dport);
u32 saddr = BPF_CORE_READ(sk, __sk_common.skc_rcv_saddr);
// Rest of the code...
}
This single binary works across all our kernel versions. Magic!
Challenge: Multi-Node Aggregation
eBPF runs per-node. We needed cluster-wide metrics.
Architecture:
┌─────────────────────────────────────────────┐
│ Prometheus │
│ (Central Aggregation) │
└─────────────────┬───────────────────────────┘
│ scrape
┌───────────┼───────────┐
│ │ │
┌───▼────┐ ┌──▼─────┐ ┌──▼─────┐
│ Node 1 │ │ Node 2 │ │ Node 3 │
│ eBPF │ │ eBPF │ │ eBPF │
│Exporter│ │Exporter│ │Exporter│
└────────┘ └────────┘ └────────┘
│ │ │
┌───▼────┐ ┌──▼─────┐ ┌──▼─────┐
│ Kernel │ │ Kernel │ │ Kernel │
│ eBPF │ │ eBPF │ │ eBPF │
│Programs│ │Programs│ │Programs│
└────────┘ └────────┘ └────────┘
Deployment: DaemonSet with one exporter per node.
The Performance Benchmark
We ran load tests comparing overhead:
Monitoring Solution | CPU Overhead | Memory Overhead | Network Overhead |
---|---|---|---|
DataDog APM | 28% | 8% | 12 MB/s/node |
Dynatrace | 22% | 6% | 9 MB/s/node |
New Relic | 31% | 11% | 15 MB/s/node |
eBPF (ours) | 1.8% | 0.4% | 0.8 MB/s/node |
Winner: eBPF by a landslide.
The Debugging Win: Kernel Panic Root Cause
2 months into eBPF deployment, we hit a mysterious kernel panic.
Symptom: Random node crashes, 1-2 per day. No pattern.
Traditional tools: Useless. Kernel panics don’t leave logs.
eBPF saved us: We had tracing enabled.
The Investigation
We analyzed eBPF traces from crashed nodes:
# Query eBPF events before crash
bpftool prog tracelog | tail -10000 | grep "before crash"
# Found this pattern right before every crash:
[1234567.890] tcp_sendmsg: invalid socket state
[1234567.891] tcp_sendmsg: sk_state=7 (CLOSED)
[1234567.892] WARN: use after free detected
[1234567.893] --- KERNEL PANIC ---
Root cause: A bug in our custom TCP connection pooling code was closing sockets that were still in use.
Fix: 10 lines of code. Would have taken weeks to debug without eBPF tracing.
Advanced Use Cases We Built
1. Network Latency Tracking
Traditional approach: Instrument every service eBPF approach: Track at kernel level
// Track TCP round-trip time
SEC("kprobe/tcp_rcv_established")
int trace_tcp_rtt(struct pt_regs *ctx) {
struct sock *sk = (struct sock *)PT_REGS_PARM1(ctx);
struct tcp_sock *tp = tcp_sk(sk);
u32 srtt_us = BPF_CORE_READ(tp, srtt_us) >> 3; // RTT in microseconds
// Update histogram
u64 slot = log2l(srtt_us);
if (slot >= MAX_SLOTS)
slot = MAX_SLOTS - 1;
hist[slot]++;
return 0;
}
Result: Network latency visibility for every TCP connection, with zero application changes.
2. Security Monitoring
// Detect suspicious syscalls
SEC("tracepoint/syscalls/sys_enter_execve")
int trace_exec(struct trace_event_raw_sys_enter* ctx) {
char comm[16];
bpf_get_current_comm(&comm, sizeof(comm));
char *filename = (char *)ctx->args[0];
char cmd[64];
bpf_probe_read_user_str(&cmd, sizeof(cmd), filename);
// Flag suspicious commands
if (strstr(cmd, "/bin/sh") || strstr(cmd, "nc") || strstr(cmd, "curl")) {
struct alert_t alert = {
.pid = bpf_get_current_pid_tgid() >> 32,
.timestamp = bpf_ktime_get_ns(),
};
strncpy(alert.command, cmd, sizeof(alert.command));
strncpy(alert.process, comm, sizeof(alert.process));
alerts.perf_submit(ctx, &alert, sizeof(alert));
}
return 0;
}
Caught: 3 cryptomining attempts in first month.
3. Custom Business Metrics
// Track payment success rate (application-specific)
SEC("uprobe/process_payment")
int trace_payment(struct pt_regs *ctx) {
struct payment_info *payment = (struct payment_info *)PT_REGS_PARM1(ctx);
u64 amount;
bpf_probe_read_user(&amount, sizeof(amount), &payment->amount);
// Track payment by amount bucket
u64 bucket = amount / 10000; // $100 buckets
payment_counts[bucket]++;
return 0;
}
SEC("uretprobe/process_payment")
int trace_payment_ret(struct pt_regs *ctx) {
int result = PT_REGS_RC(ctx);
if (result == 0) {
successful_payments++;
} else {
failed_payments++;
}
return 0;
}
Visibility: Payment success rate without modifying application code.
The Challenges: What Nobody Tells You
Challenge 1: eBPF Development is HARD
Reality: Writing eBPF code is way harder than writing normal code.
Why:
- Kernel programming (memory safety, no stdlib)
- eBPF verifier restrictions (no loops, limited stack)
- Debugging is painful (printk is your only friend)
- Documentation is scattered
Our solution: Build abstractions. Don’t write raw eBPF unless necessary.
Challenge 2: Kernel Upgrades Break Things
Scenario: Upgraded from kernel 5.10 → 5.15
Result: Half our eBPF programs stopped working.
Cause: Kernel structure changes broke field offsets.
Solution: CO-RE helped, but we still had to update programs for significant kernel changes.
Challenge 3: Performance Can Degrade
Surprise: eBPF isn’t always fast.
Example: Our first network tracing program:
// BAD: This slowed packet processing by 15%
SEC("tc")
int trace_every_packet(struct __sk_buff *skb) {
// Extract IP header
struct iphdr ip;
bpf_skb_load_bytes(skb, 0, &ip, sizeof(ip));
// Extract TCP header
struct tcphdr tcp;
bpf_skb_load_bytes(skb, sizeof(ip), &tcp, sizeof(tcp));
// Create event
struct event_t event = { /* ... */ };
events.perf_submit(skb, &event, sizeof(event));
return TC_ACT_OK;
}
Problem: Submitting event for every packet overloaded the ringbuffer.
Fixed version (sampling + filtering):
// GOOD: Sample 1% of packets, filter by port
SEC("tc")
int trace_packets_sampled(struct __sk_buff *skb) {
// Sample 1% of traffic
if (bpf_get_prandom_u32() % 100 != 0)
return TC_ACT_OK;
// Filter: only trace HTTP/HTTPS
struct tcphdr tcp;
bpf_skb_load_bytes(skb, sizeof(struct iphdr), &tcp, sizeof(tcp));
u16 port = bpf_ntohs(tcp.dest);
if (port != 80 && port != 443)
return TC_ACT_OK;
// Now create event (only for 1% of HTTP/HTTPS traffic)
struct event_t event = { /* ... */ };
events.perf_submit(skb, &event, sizeof(event));
return TC_ACT_OK;
}
Result: Overhead dropped from 15% to <1%.
Production Results: 6 Months Later
Cost Savings
Before (DataDog):
- Monthly cost: $18,400
- Annual cost: $220,800
After (eBPF + Prometheus + Grafana):
- Infrastructure cost: $1,200/month (storage + compute)
- Development cost: $40K (amortized over 3 years)
- Annual cost: $14,400 + $13K = $27,400
Savings: $193,400/year (88% reduction)
Performance Gains
Application performance:
- CPU usage: Down 25% (no APM agent overhead)
- Memory usage: Down 8%
- Network traffic: Down 12 MB/s per node
Observability improvements:
- Metric resolution: 1 second (was 10 seconds)
- Custom metrics: Unlimited (was $500/metric/year)
- Debugging depth: Kernel-level visibility
Incidents Detected
6 months of production eBPF:
- 3 cryptomining attempts detected
- 1 kernel panic root-caused (saved weeks of debugging)
- 12 network issues identified (latency spikes, packet loss)
- 47 application bugs found (memory leaks, slow queries)
Lessons for Teams Considering eBPF
✅ Do This:
- Start simple: Use existing tools (Cilium, Pixie) before writing custom eBPF
- Focus on high-value use cases: Network visibility, security monitoring
- Use CO-RE: Don’t compile per-kernel-version
- Sample aggressively: Don’t trace every event
- Invest in tooling: Build good development/debugging tools
❌ Don’t Do This:
- Rewrite everything in eBPF: Use it where it provides unique value
- Ignore kernel compatibility: Test across kernel versions
- Skip performance testing: eBPF can be slow if done wrong
- Write raw eBPF for everything: Use libraries/frameworks
- Neglect security: eBPF programs run in kernel - be careful
What’s Next?
We’re exploring:
- eBPF-based service mesh: Replace Envoy sidecars
- Runtime security: Detect and block threats at kernel level
- Performance profiling: Continuous profiling with eBPF
- Cost optimization: Track resource usage per tenant
eBPF transformed our observability from “expensive and slow” to “cheap and comprehensive.” But it requires deep Linux knowledge and careful engineering.
For more on eBPF’s role in cloud-native observability, see the comprehensive eBPF guide that helped inform our implementation.
Running eBPF in production? Connect on LinkedIn or share your eBPF stories on Twitter.