The $850K Rate Limiting Mistake: When Token Buckets Aren't Enough

The Day Our Rate Limiting Failed Spectacularly

Black Friday 2024. 2:17 AM EST. Our monitoring dashboard lit up like a Christmas tree. API requests: 2.4M/minute. Normal peak traffic: 180K/minute.

We had rate limiting. Token bucket algorithm. Redis-backed. Industry standard. It was completely useless.

By the time we understood what was happening, the damage was done:

$127K in AWS overage charges (4 hours)
$340K in customer credits (SLA violations)
$383K in lost revenue (checkout failures)
47% of customers affected
12-hour incident response

Total cost: $850,000

This is the story of how “good enough” rate limiting nearly destroyed our business, and what we built to replace it.

What We Had: The Industry Standard Approach

Before that night, we were proud of our rate limiting:

-- Token bucket in Redis (Lua script)
local key = KEYS[1]
local tokens = tonumber(ARGV[1])
local rate = tonumber(ARGV[2])
local capacity = tonumber(ARGV[3])

local current = redis.call('GET', key)
if current == false then
  redis.call('SET', key, capacity - tokens)
  return 1
end

if tonumber(current) >= tokens then
  redis.call('DECRBY', key, tokens)
  return 1
end

return 0

Our configuration:

100 requests/second per API key
1,000 burst capacity
60-second refill window
Redis Cluster with 3 masters, 3 replicas

This worked perfectly for 18 months. Until it didn’t.

The Attack: Sophisticated Bot Network

The attackers were smart. They didn’t just overwhelm our rate limiter—they understood its weaknesses and exploited them systematically.

What They Did

1. Distributed Attack Pattern

47,000 unique IP addresses
890 compromised AWS accounts (stolen credentials)
3,200+ unique API keys (legitimate trial accounts)
Geographic distribution across 23 countries

2. Burst Timing They timed their bursts to align with our token refill windows:

Send 1,000 requests in 10 seconds (use burst capacity)
Wait 50 seconds (tokens refill)
Repeat

Our token bucket saw each burst as “legitimate traffic that occasionally spikes.”

3. Expensive Endpoints They targeted our most resource-intensive endpoints:

/api/v2/reports/generate (15-30 second execution)
/api/v2/exports/bulk (ties up workers for 2+ minutes)
/api/v2/analytics/compute (CPU-intensive calculations)

The Real Problem

Our rate limiting only counted requests, not resource cost.

A single /reports/generate request consumed 300x the resources of a simple /users/me call, but our rate limiter treated them identically.

The math that broke us:

3,200 API keys × 1,000 burst capacity = 3.2M requests
Each targeting 15-second endpoint
= 48M seconds of compute time
= 800,000 compute-minutes
= 13,333 compute-hours
In 4 hours.

Our autoscaling couldn’t keep up. We hit AWS account limits. Services crashed.

Our Redis Cluster Also Failed

During the attack, our Redis cluster became the bottleneck:

The cascade:

Rate limiting checks: 2.4M/minute → 40K/second
Redis CPU: 45% → 94% (3 minutes)
Memory pressure: 62% → 89%
Network saturation: 12Gbps → 24Gbps
Replication lag: 0.2s → 45s
Query timeouts: 0% → 38%

When Redis queries started timing out, our application’s fail-open behavior (we couldn’t determine rate limits, so we allowed requests) made everything worse.

Critical mistake: We hadn’t load tested our rate limiting infrastructure at attack volumes.

What We Built: Context-Aware Multi-Tier Rate Limiting

It took us 6 weeks and 3 senior engineers to build what we should have had from the start.

Architecture: Hierarchical Rate Limiting

interface RateLimitPolicy {
  tier: 'free' | 'pro' | 'enterprise';
  limits: {
    requests: { rate: number; burst: number };
    compute: { units: number; window: number };
    cost: { dollars: number; window: number };
  };
  isolation: 'shared' | 'dedicated';
}

// Cost-based rate limiting
const endpointCosts = {
  'GET /users/me': 1,
  'POST /reports/generate': 300,
  'POST /exports/bulk': 500,
  'POST /analytics/compute': 750,
};

function calculateCost(endpoint: string, duration: number): number {
  const baseCost = endpointCosts[endpoint] || 1;
  const executionCost = duration / 1000; // Per second
  return baseCost * executionCost;
}

Key Changes

1. Multi-Dimensional Limits

Instead of just counting requests, we now enforce:

Request rate (requests/second)
Compute units (weighted by endpoint cost)
Concurrent executions (prevents resource exhaustion)
Dollar cost (estimated infrastructure spend)

2. Adaptive Limits Based on System Health

def get_rate_limit(user_tier, current_system_load):
    base_limit = TIER_LIMITS[user_tier]
    
    # Reduce limits when system is stressed
    if current_system_load > 0.8:
        return base_limit * 0.5
    elif current_system_load > 0.6:
        return base_limit * 0.75
    
    return base_limit

# System load metric combines:
# - API latency P99
# - Database connection pool usage
# - Worker queue depth
# - Error rate

3. Behavioral Analysis

We added anomaly detection:

def is_suspicious(api_key, request_pattern):
    # Historical pattern analysis
    normal_pattern = get_historical_pattern(api_key)
    
    # Check for anomalies
    suspicious_signals = [
        request_pattern.burst_ratio > 10,  # Excessive bursting
        request_pattern.expensive_endpoints > 0.8,  # Too many costly calls
        request_pattern.geographic_spread > 5,  # Too many regions
        request_pattern.failure_tolerance > 0.9,  # Continues despite errors
    ]
    
    return sum(suspicious_signals) >= 2

4. Multiple Redis Layers

Instead of one Redis cluster handling everything:

L1 Cache: Local in-memory (LRU, 10K keys, 1ms)
L2 Cache: Redis Cluster (distributed, 10M keys, 3ms)
L3 Fallback: DynamoDB (persistent, unlimited, 50ms)

When Redis is overloaded, we fall back to DynamoDB with cached values.

The Results: What Actually Works

We’ve been running this system for 9 months. Here’s what changed:

Attack Mitigation

Before:

Detected attacks: 12 minutes after start (manual)
Mitigation: 47 minutes (manual API key suspension)
Damage: $850K

After:

Detected attacks: 23 seconds (automated)
Mitigation: 1.2 minutes (automatic rate limit reduction)
Largest attack since: 890K req/min (blocked at 12% of normal capacity usage)
Cost of blocked attack: $0

Performance Impact

Our rate limiting overhead:

P50 latency: 0.8ms (L1 cache hit)
P95 latency: 2.4ms (L2 cache hit)
P99 latency: 4.1ms (L2 cache miss, sync to L1)
P99.9 latency: 52ms (L3 fallback)

Cost Optimization

Interesting side effect—cost-based rate limiting helped legitimate users too:

User behavior changes:

34% reduction in expensive report generation requests
Users switched to cheaper incremental queries
Monthly infrastructure costs: down 23% ($47K/month)

Our pricing now reflects actual resource cost, and customers optimize their usage accordingly.

False Positive Rate

The hardest part was tuning behavioral analysis:

Initial deployment:

False positive rate: 8.7%
Legitimate traffic blocked: 140K requests/day
Customer complaints: 23/day

After 6 months of tuning:

False positive rate: 0.13%
Legitimate traffic blocked: 2.1K requests/day
Customer complaints: 0-1/day

We use machine learning to continuously refine the behavioral models.

What We Learned (The Expensive Way)

1. Token Buckets Are Necessary But Not Sufficient

Token bucket algorithms are great for smoothing traffic, but they treat all requests equally. In the real world:

Some requests cost 1000x more than others
Attackers understand your algorithms better than you do
“Industry standard” means “attackers have already found the weaknesses”

2. Rate Limiting Infrastructure Must Be Overprovisioned

We underestimated our rate limiting infrastructure needs by 10x:

Our mistake:

Sized Redis for “2x peak legitimate traffic”
Assumed attacks would be simple volumetric

Reality:

Attacks can be 20x peak traffic
Rate limiting itself becomes the bottleneck
Need 10x headroom on rate limiting infrastructure

3. Fail-Closed is Better Than Fail-Open

Our original design: “If we can’t determine rate limits, allow the request.”

Better approach:

Temporary rate limit reduction on infrastructure failures
Cached rate limit decisions (stale data is better than no data)
Degraded service is better than no service

4. Behavioral Analysis Catches What Algorithms Miss

Our most effective defense wasn’t better algorithms—it was understanding normal behavior:

Legitimate users have consistent patterns
Bots exhibit statistical anomalies
Geographic spread, timing patterns, error tolerance

The key insight: Attackers optimize for your rate limiter. They can’t optimize for being indistinguishable from legitimate users.

5. Observability is Critical

We were blind during the attack. Now we monitor:

Rate limiting decision latency (per cache tier)
False positive/negative rates
Cost per API key (actual vs. allowed)
Behavioral anomaly scores
Infrastructure health (Redis, DynamoDB)

Dashboard metrics that saved us:

“Requests blocked by behavioral analysis”: Caught 3 attacks in first week
“Cost per API key trending”: Identified abuse before rate limits hit
“Cache hit rates by tier”: Optimized cache sizing

Practical Implementation Guide

Want to avoid our mistakes? Here’s what to do:

Start With This Architecture

Multi-tier caching (local → Redis → persistent)
Cost-based limits (not just request counts)
Behavioral analysis (detect anomalies)
Adaptive limits (reduce under load)
Fail-closed (degrade gracefully)

Don’t Overthink It Initially

Start simple, add sophistication:

Week 1: Basic cost-weighted limits
Month 1: Add local caching
Month 2: Add behavioral analysis
Month 3: Tune false positive rates
Month 6: ML-based optimization

Load Test Your Rate Limiting

We didn’t. It cost $850K. You should:

Test at 10x peak legitimate traffic
Test with Redis failures
Test with network partitions
Test attack patterns (bursty, distributed, targeted)

Monitor Everything

Critical metrics:

Rate limiting decision latency (P50, P95, P99)
Cache hit rates (L1, L2, L3)
False positive rate (daily)
Cost per user tier (actual vs. limit)
Infrastructure health (Redis CPU, memory, network)

Resources That Helped Us

These resources saved us when building our replacement system:

NIST Cybersecurity Framework - Rate Limiting - Foundation for security controls
AWS API Gateway Rate Limiting - Reference architecture
Redis Lua Scripting Best Practices - Atomic rate limiting operations
Kong Rate Limiting Plugin - Production patterns
Tyk API Gateway Docs - Policy management
Google Cloud Rate Limiting Patterns - Architecture guidance
NGINX Rate Limiting Guide - Leaky bucket implementation
Istio Traffic Management - Service mesh patterns
DataDog Rate Limiting Monitoring - Observability strategies
Azure API Management Rate Limiting - Enterprise patterns
Cloudflare Rate Limiting - DDoS protection
Redis Enterprise Cluster Sizing - Infrastructure planning
CrashBytes: Advanced Rate Limiting Patterns - Deep dive on enterprise implementation

The Bottom Line

Simple rate limiting is easy. Enterprise rate limiting is hard.

Our $850K lesson: Don’t wait for an attack to discover your rate limiting weaknesses. The attackers have already found them.

Build sophisticated rate limiting from the start:

Cost-aware limits
Behavioral analysis
Adaptive responses
Overprovisioned infrastructure
Comprehensive monitoring

Or learn the expensive way, like we did.

Building rate limiting for enterprise APIs? Let’s talk about implementation strategies before you learn these lessons the expensive way.