The Day Our Rate Limiting Failed Spectacularly
Black Friday 2024. 2:17 AM EST. Our monitoring dashboard lit up like a Christmas tree. API requests: 2.4M/minute. Normal peak traffic: 180K/minute.
We had rate limiting. Token bucket algorithm. Redis-backed. Industry standard. It was completely useless.
By the time we understood what was happening, the damage was done:
- $127K in AWS overage charges (4 hours)
- $340K in customer credits (SLA violations)
- $383K in lost revenue (checkout failures)
- 47% of customers affected
- 12-hour incident response
Total cost: $850,000
This is the story of how “good enough” rate limiting nearly destroyed our business, and what we built to replace it.
What We Had: The Industry Standard Approach
Before that night, we were proud of our rate limiting:
-- Token bucket in Redis (Lua script)
local key = KEYS[1]
local tokens = tonumber(ARGV[1])
local rate = tonumber(ARGV[2])
local capacity = tonumber(ARGV[3])
local current = redis.call('GET', key)
if current == false then
redis.call('SET', key, capacity - tokens)
return 1
end
if tonumber(current) >= tokens then
redis.call('DECRBY', key, tokens)
return 1
end
return 0
Our configuration:
- 100 requests/second per API key
- 1,000 burst capacity
- 60-second refill window
- Redis Cluster with 3 masters, 3 replicas
This worked perfectly for 18 months. Until it didn’t.
The Attack: Sophisticated Bot Network
The attackers were smart. They didn’t just overwhelm our rate limiter—they understood its weaknesses and exploited them systematically.
What They Did
1. Distributed Attack Pattern
- 47,000 unique IP addresses
- 890 compromised AWS accounts (stolen credentials)
- 3,200+ unique API keys (legitimate trial accounts)
- Geographic distribution across 23 countries
2. Burst Timing They timed their bursts to align with our token refill windows:
- Send 1,000 requests in 10 seconds (use burst capacity)
- Wait 50 seconds (tokens refill)
- Repeat
Our token bucket saw each burst as “legitimate traffic that occasionally spikes.”
3. Expensive Endpoints They targeted our most resource-intensive endpoints:
/api/v2/reports/generate
(15-30 second execution)/api/v2/exports/bulk
(ties up workers for 2+ minutes)/api/v2/analytics/compute
(CPU-intensive calculations)
The Real Problem
Our rate limiting only counted requests, not resource cost.
A single /reports/generate
request consumed 300x the resources of a simple /users/me
call, but our rate limiter treated them identically.
The math that broke us:
3,200 API keys × 1,000 burst capacity = 3.2M requests
Each targeting 15-second endpoint
= 48M seconds of compute time
= 800,000 compute-minutes
= 13,333 compute-hours
In 4 hours.
Our autoscaling couldn’t keep up. We hit AWS account limits. Services crashed.
Our Redis Cluster Also Failed
During the attack, our Redis cluster became the bottleneck:
The cascade:
- Rate limiting checks: 2.4M/minute → 40K/second
- Redis CPU: 45% → 94% (3 minutes)
- Memory pressure: 62% → 89%
- Network saturation: 12Gbps → 24Gbps
- Replication lag: 0.2s → 45s
- Query timeouts: 0% → 38%
When Redis queries started timing out, our application’s fail-open behavior (we couldn’t determine rate limits, so we allowed requests) made everything worse.
Critical mistake: We hadn’t load tested our rate limiting infrastructure at attack volumes.
What We Built: Context-Aware Multi-Tier Rate Limiting
It took us 6 weeks and 3 senior engineers to build what we should have had from the start.
Architecture: Hierarchical Rate Limiting
interface RateLimitPolicy {
tier: 'free' | 'pro' | 'enterprise';
limits: {
requests: { rate: number; burst: number };
compute: { units: number; window: number };
cost: { dollars: number; window: number };
};
isolation: 'shared' | 'dedicated';
}
// Cost-based rate limiting
const endpointCosts = {
'GET /users/me': 1,
'POST /reports/generate': 300,
'POST /exports/bulk': 500,
'POST /analytics/compute': 750,
};
function calculateCost(endpoint: string, duration: number): number {
const baseCost = endpointCosts[endpoint] || 1;
const executionCost = duration / 1000; // Per second
return baseCost * executionCost;
}
Key Changes
1. Multi-Dimensional Limits
Instead of just counting requests, we now enforce:
- Request rate (requests/second)
- Compute units (weighted by endpoint cost)
- Concurrent executions (prevents resource exhaustion)
- Dollar cost (estimated infrastructure spend)
2. Adaptive Limits Based on System Health
def get_rate_limit(user_tier, current_system_load):
base_limit = TIER_LIMITS[user_tier]
# Reduce limits when system is stressed
if current_system_load > 0.8:
return base_limit * 0.5
elif current_system_load > 0.6:
return base_limit * 0.75
return base_limit
# System load metric combines:
# - API latency P99
# - Database connection pool usage
# - Worker queue depth
# - Error rate
3. Behavioral Analysis
We added anomaly detection:
def is_suspicious(api_key, request_pattern):
# Historical pattern analysis
normal_pattern = get_historical_pattern(api_key)
# Check for anomalies
suspicious_signals = [
request_pattern.burst_ratio > 10, # Excessive bursting
request_pattern.expensive_endpoints > 0.8, # Too many costly calls
request_pattern.geographic_spread > 5, # Too many regions
request_pattern.failure_tolerance > 0.9, # Continues despite errors
]
return sum(suspicious_signals) >= 2
4. Multiple Redis Layers
Instead of one Redis cluster handling everything:
- L1 Cache: Local in-memory (LRU, 10K keys, 1ms)
- L2 Cache: Redis Cluster (distributed, 10M keys, 3ms)
- L3 Fallback: DynamoDB (persistent, unlimited, 50ms)
When Redis is overloaded, we fall back to DynamoDB with cached values.
The Results: What Actually Works
We’ve been running this system for 9 months. Here’s what changed:
Attack Mitigation
Before:
- Detected attacks: 12 minutes after start (manual)
- Mitigation: 47 minutes (manual API key suspension)
- Damage: $850K
After:
- Detected attacks: 23 seconds (automated)
- Mitigation: 1.2 minutes (automatic rate limit reduction)
- Largest attack since: 890K req/min (blocked at 12% of normal capacity usage)
- Cost of blocked attack: $0
Performance Impact
Our rate limiting overhead:
- P50 latency: 0.8ms (L1 cache hit)
- P95 latency: 2.4ms (L2 cache hit)
- P99 latency: 4.1ms (L2 cache miss, sync to L1)
- P99.9 latency: 52ms (L3 fallback)
Cost Optimization
Interesting side effect—cost-based rate limiting helped legitimate users too:
User behavior changes:
- 34% reduction in expensive report generation requests
- Users switched to cheaper incremental queries
- Monthly infrastructure costs: down 23% ($47K/month)
Our pricing now reflects actual resource cost, and customers optimize their usage accordingly.
False Positive Rate
The hardest part was tuning behavioral analysis:
Initial deployment:
- False positive rate: 8.7%
- Legitimate traffic blocked: 140K requests/day
- Customer complaints: 23/day
After 6 months of tuning:
- False positive rate: 0.13%
- Legitimate traffic blocked: 2.1K requests/day
- Customer complaints: 0-1/day
We use machine learning to continuously refine the behavioral models.
What We Learned (The Expensive Way)
1. Token Buckets Are Necessary But Not Sufficient
Token bucket algorithms are great for smoothing traffic, but they treat all requests equally. In the real world:
- Some requests cost 1000x more than others
- Attackers understand your algorithms better than you do
- “Industry standard” means “attackers have already found the weaknesses”
2. Rate Limiting Infrastructure Must Be Overprovisioned
We underestimated our rate limiting infrastructure needs by 10x:
Our mistake:
- Sized Redis for “2x peak legitimate traffic”
- Assumed attacks would be simple volumetric
Reality:
- Attacks can be 20x peak traffic
- Rate limiting itself becomes the bottleneck
- Need 10x headroom on rate limiting infrastructure
3. Fail-Closed is Better Than Fail-Open
Our original design: “If we can’t determine rate limits, allow the request.”
Better approach:
- Temporary rate limit reduction on infrastructure failures
- Cached rate limit decisions (stale data is better than no data)
- Degraded service is better than no service
4. Behavioral Analysis Catches What Algorithms Miss
Our most effective defense wasn’t better algorithms—it was understanding normal behavior:
- Legitimate users have consistent patterns
- Bots exhibit statistical anomalies
- Geographic spread, timing patterns, error tolerance
The key insight: Attackers optimize for your rate limiter. They can’t optimize for being indistinguishable from legitimate users.
5. Observability is Critical
We were blind during the attack. Now we monitor:
- Rate limiting decision latency (per cache tier)
- False positive/negative rates
- Cost per API key (actual vs. allowed)
- Behavioral anomaly scores
- Infrastructure health (Redis, DynamoDB)
Dashboard metrics that saved us:
- “Requests blocked by behavioral analysis”: Caught 3 attacks in first week
- “Cost per API key trending”: Identified abuse before rate limits hit
- “Cache hit rates by tier”: Optimized cache sizing
Practical Implementation Guide
Want to avoid our mistakes? Here’s what to do:
Start With This Architecture
- Multi-tier caching (local → Redis → persistent)
- Cost-based limits (not just request counts)
- Behavioral analysis (detect anomalies)
- Adaptive limits (reduce under load)
- Fail-closed (degrade gracefully)
Don’t Overthink It Initially
Start simple, add sophistication:
- Week 1: Basic cost-weighted limits
- Month 1: Add local caching
- Month 2: Add behavioral analysis
- Month 3: Tune false positive rates
- Month 6: ML-based optimization
Load Test Your Rate Limiting
We didn’t. It cost $850K. You should:
- Test at 10x peak legitimate traffic
- Test with Redis failures
- Test with network partitions
- Test attack patterns (bursty, distributed, targeted)
Monitor Everything
Critical metrics:
- Rate limiting decision latency (P50, P95, P99)
- Cache hit rates (L1, L2, L3)
- False positive rate (daily)
- Cost per user tier (actual vs. limit)
- Infrastructure health (Redis CPU, memory, network)
Resources That Helped Us
These resources saved us when building our replacement system:
- NIST Cybersecurity Framework - Rate Limiting - Foundation for security controls
- AWS API Gateway Rate Limiting - Reference architecture
- Redis Lua Scripting Best Practices - Atomic rate limiting operations
- Kong Rate Limiting Plugin - Production patterns
- Tyk API Gateway Docs - Policy management
- Google Cloud Rate Limiting Patterns - Architecture guidance
- NGINX Rate Limiting Guide - Leaky bucket implementation
- Istio Traffic Management - Service mesh patterns
- DataDog Rate Limiting Monitoring - Observability strategies
- Azure API Management Rate Limiting - Enterprise patterns
- Cloudflare Rate Limiting - DDoS protection
- Redis Enterprise Cluster Sizing - Infrastructure planning
- CrashBytes: Advanced Rate Limiting Patterns - Deep dive on enterprise implementation
The Bottom Line
Simple rate limiting is easy. Enterprise rate limiting is hard.
Our $850K lesson: Don’t wait for an attack to discover your rate limiting weaknesses. The attackers have already found them.
Build sophisticated rate limiting from the start:
- Cost-aware limits
- Behavioral analysis
- Adaptive responses
- Overprovisioned infrastructure
- Comprehensive monitoring
Or learn the expensive way, like we did.
Building rate limiting for enterprise APIs? Let’s talk about implementation strategies before you learn these lessons the expensive way.