The $850K Rate Limiting Mistake: When Token Buckets Aren't Enough

How we learned the hard way that enterprise rate limiting requires more than basic algorithms—featuring bot attacks, Redis failures, and a very expensive Black Friday.

The Day Our Rate Limiting Failed Spectacularly

Black Friday 2024. 2:17 AM EST. Our monitoring dashboard lit up like a Christmas tree. API requests: 2.4M/minute. Normal peak traffic: 180K/minute.

We had rate limiting. Token bucket algorithm. Redis-backed. Industry standard. It was completely useless.

By the time we understood what was happening, the damage was done:

  • $127K in AWS overage charges (4 hours)
  • $340K in customer credits (SLA violations)
  • $383K in lost revenue (checkout failures)
  • 47% of customers affected
  • 12-hour incident response

Total cost: $850,000

This is the story of how “good enough” rate limiting nearly destroyed our business, and what we built to replace it.

What We Had: The Industry Standard Approach

Before that night, we were proud of our rate limiting:

-- Token bucket in Redis (Lua script)
local key = KEYS[1]
local tokens = tonumber(ARGV[1])
local rate = tonumber(ARGV[2])
local capacity = tonumber(ARGV[3])

local current = redis.call('GET', key)
if current == false then
  redis.call('SET', key, capacity - tokens)
  return 1
end

if tonumber(current) >= tokens then
  redis.call('DECRBY', key, tokens)
  return 1
end

return 0

Our configuration:

  • 100 requests/second per API key
  • 1,000 burst capacity
  • 60-second refill window
  • Redis Cluster with 3 masters, 3 replicas

This worked perfectly for 18 months. Until it didn’t.

The Attack: Sophisticated Bot Network

The attackers were smart. They didn’t just overwhelm our rate limiter—they understood its weaknesses and exploited them systematically.

What They Did

1. Distributed Attack Pattern

  • 47,000 unique IP addresses
  • 890 compromised AWS accounts (stolen credentials)
  • 3,200+ unique API keys (legitimate trial accounts)
  • Geographic distribution across 23 countries

2. Burst Timing They timed their bursts to align with our token refill windows:

  • Send 1,000 requests in 10 seconds (use burst capacity)
  • Wait 50 seconds (tokens refill)
  • Repeat

Our token bucket saw each burst as “legitimate traffic that occasionally spikes.”

3. Expensive Endpoints They targeted our most resource-intensive endpoints:

  • /api/v2/reports/generate (15-30 second execution)
  • /api/v2/exports/bulk (ties up workers for 2+ minutes)
  • /api/v2/analytics/compute (CPU-intensive calculations)

The Real Problem

Our rate limiting only counted requests, not resource cost.

A single /reports/generate request consumed 300x the resources of a simple /users/me call, but our rate limiter treated them identically.

The math that broke us:

3,200 API keys × 1,000 burst capacity = 3.2M requests
Each targeting 15-second endpoint
= 48M seconds of compute time
= 800,000 compute-minutes
= 13,333 compute-hours
In 4 hours.

Our autoscaling couldn’t keep up. We hit AWS account limits. Services crashed.

Our Redis Cluster Also Failed

During the attack, our Redis cluster became the bottleneck:

The cascade:

  1. Rate limiting checks: 2.4M/minute → 40K/second
  2. Redis CPU: 45% → 94% (3 minutes)
  3. Memory pressure: 62% → 89%
  4. Network saturation: 12Gbps → 24Gbps
  5. Replication lag: 0.2s → 45s
  6. Query timeouts: 0% → 38%

When Redis queries started timing out, our application’s fail-open behavior (we couldn’t determine rate limits, so we allowed requests) made everything worse.

Critical mistake: We hadn’t load tested our rate limiting infrastructure at attack volumes.

What We Built: Context-Aware Multi-Tier Rate Limiting

It took us 6 weeks and 3 senior engineers to build what we should have had from the start.

Architecture: Hierarchical Rate Limiting

interface RateLimitPolicy {
  tier: 'free' | 'pro' | 'enterprise';
  limits: {
    requests: { rate: number; burst: number };
    compute: { units: number; window: number };
    cost: { dollars: number; window: number };
  };
  isolation: 'shared' | 'dedicated';
}

// Cost-based rate limiting
const endpointCosts = {
  'GET /users/me': 1,
  'POST /reports/generate': 300,
  'POST /exports/bulk': 500,
  'POST /analytics/compute': 750,
};

function calculateCost(endpoint: string, duration: number): number {
  const baseCost = endpointCosts[endpoint] || 1;
  const executionCost = duration / 1000; // Per second
  return baseCost * executionCost;
}

Key Changes

1. Multi-Dimensional Limits

Instead of just counting requests, we now enforce:

  • Request rate (requests/second)
  • Compute units (weighted by endpoint cost)
  • Concurrent executions (prevents resource exhaustion)
  • Dollar cost (estimated infrastructure spend)

2. Adaptive Limits Based on System Health

def get_rate_limit(user_tier, current_system_load):
    base_limit = TIER_LIMITS[user_tier]
    
    # Reduce limits when system is stressed
    if current_system_load > 0.8:
        return base_limit * 0.5
    elif current_system_load > 0.6:
        return base_limit * 0.75
    
    return base_limit

# System load metric combines:
# - API latency P99
# - Database connection pool usage
# - Worker queue depth
# - Error rate

3. Behavioral Analysis

We added anomaly detection:

def is_suspicious(api_key, request_pattern):
    # Historical pattern analysis
    normal_pattern = get_historical_pattern(api_key)
    
    # Check for anomalies
    suspicious_signals = [
        request_pattern.burst_ratio > 10,  # Excessive bursting
        request_pattern.expensive_endpoints > 0.8,  # Too many costly calls
        request_pattern.geographic_spread > 5,  # Too many regions
        request_pattern.failure_tolerance > 0.9,  # Continues despite errors
    ]
    
    return sum(suspicious_signals) >= 2

4. Multiple Redis Layers

Instead of one Redis cluster handling everything:

  • L1 Cache: Local in-memory (LRU, 10K keys, 1ms)
  • L2 Cache: Redis Cluster (distributed, 10M keys, 3ms)
  • L3 Fallback: DynamoDB (persistent, unlimited, 50ms)

When Redis is overloaded, we fall back to DynamoDB with cached values.

The Results: What Actually Works

We’ve been running this system for 9 months. Here’s what changed:

Attack Mitigation

Before:

  • Detected attacks: 12 minutes after start (manual)
  • Mitigation: 47 minutes (manual API key suspension)
  • Damage: $850K

After:

  • Detected attacks: 23 seconds (automated)
  • Mitigation: 1.2 minutes (automatic rate limit reduction)
  • Largest attack since: 890K req/min (blocked at 12% of normal capacity usage)
  • Cost of blocked attack: $0

Performance Impact

Our rate limiting overhead:

  • P50 latency: 0.8ms (L1 cache hit)
  • P95 latency: 2.4ms (L2 cache hit)
  • P99 latency: 4.1ms (L2 cache miss, sync to L1)
  • P99.9 latency: 52ms (L3 fallback)

Cost Optimization

Interesting side effect—cost-based rate limiting helped legitimate users too:

User behavior changes:

  • 34% reduction in expensive report generation requests
  • Users switched to cheaper incremental queries
  • Monthly infrastructure costs: down 23% ($47K/month)

Our pricing now reflects actual resource cost, and customers optimize their usage accordingly.

False Positive Rate

The hardest part was tuning behavioral analysis:

Initial deployment:

  • False positive rate: 8.7%
  • Legitimate traffic blocked: 140K requests/day
  • Customer complaints: 23/day

After 6 months of tuning:

  • False positive rate: 0.13%
  • Legitimate traffic blocked: 2.1K requests/day
  • Customer complaints: 0-1/day

We use machine learning to continuously refine the behavioral models.

What We Learned (The Expensive Way)

1. Token Buckets Are Necessary But Not Sufficient

Token bucket algorithms are great for smoothing traffic, but they treat all requests equally. In the real world:

  • Some requests cost 1000x more than others
  • Attackers understand your algorithms better than you do
  • “Industry standard” means “attackers have already found the weaknesses”

2. Rate Limiting Infrastructure Must Be Overprovisioned

We underestimated our rate limiting infrastructure needs by 10x:

Our mistake:

  • Sized Redis for “2x peak legitimate traffic”
  • Assumed attacks would be simple volumetric

Reality:

  • Attacks can be 20x peak traffic
  • Rate limiting itself becomes the bottleneck
  • Need 10x headroom on rate limiting infrastructure

3. Fail-Closed is Better Than Fail-Open

Our original design: “If we can’t determine rate limits, allow the request.”

Better approach:

  • Temporary rate limit reduction on infrastructure failures
  • Cached rate limit decisions (stale data is better than no data)
  • Degraded service is better than no service

4. Behavioral Analysis Catches What Algorithms Miss

Our most effective defense wasn’t better algorithms—it was understanding normal behavior:

  • Legitimate users have consistent patterns
  • Bots exhibit statistical anomalies
  • Geographic spread, timing patterns, error tolerance

The key insight: Attackers optimize for your rate limiter. They can’t optimize for being indistinguishable from legitimate users.

5. Observability is Critical

We were blind during the attack. Now we monitor:

  • Rate limiting decision latency (per cache tier)
  • False positive/negative rates
  • Cost per API key (actual vs. allowed)
  • Behavioral anomaly scores
  • Infrastructure health (Redis, DynamoDB)

Dashboard metrics that saved us:

  • “Requests blocked by behavioral analysis”: Caught 3 attacks in first week
  • “Cost per API key trending”: Identified abuse before rate limits hit
  • “Cache hit rates by tier”: Optimized cache sizing

Practical Implementation Guide

Want to avoid our mistakes? Here’s what to do:

Start With This Architecture

  1. Multi-tier caching (local → Redis → persistent)
  2. Cost-based limits (not just request counts)
  3. Behavioral analysis (detect anomalies)
  4. Adaptive limits (reduce under load)
  5. Fail-closed (degrade gracefully)

Don’t Overthink It Initially

Start simple, add sophistication:

  • Week 1: Basic cost-weighted limits
  • Month 1: Add local caching
  • Month 2: Add behavioral analysis
  • Month 3: Tune false positive rates
  • Month 6: ML-based optimization

Load Test Your Rate Limiting

We didn’t. It cost $850K. You should:

  • Test at 10x peak legitimate traffic
  • Test with Redis failures
  • Test with network partitions
  • Test attack patterns (bursty, distributed, targeted)

Monitor Everything

Critical metrics:

  • Rate limiting decision latency (P50, P95, P99)
  • Cache hit rates (L1, L2, L3)
  • False positive rate (daily)
  • Cost per user tier (actual vs. limit)
  • Infrastructure health (Redis CPU, memory, network)

Resources That Helped Us

These resources saved us when building our replacement system:

The Bottom Line

Simple rate limiting is easy. Enterprise rate limiting is hard.

Our $850K lesson: Don’t wait for an attack to discover your rate limiting weaknesses. The attackers have already found them.

Build sophisticated rate limiting from the start:

  • Cost-aware limits
  • Behavioral analysis
  • Adaptive responses
  • Overprovisioned infrastructure
  • Comprehensive monitoring

Or learn the expensive way, like we did.


Building rate limiting for enterprise APIs? Let’s talk about implementation strategies before you learn these lessons the expensive way.