The Problem: Gateway Sprawl Was Killing Us
By Q4 2024, our infrastructure had evolved into a nightmare: 47 independent API gateways across 15 data centers, each configured differently, none talking to each other.
Every new service deployment required:
- 4-6 hours of manual gateway configuration
- Updates to 12+ different gateway configs
- Prayer that nobody fat-fingered a regex
- Cross-team coordination meetings (death by Zoom)
Our SRE team was spending 30+ hours per week on gateway maintenance. Something had to change.
After reading about API gateway federation patterns, I pitched a radical plan: federate everything, support all protocols, one control plane.
My VP thought I was insane. Turns out, he was half right.
Decision Point: Build vs. Buy vs. Customize
We evaluated three paths:
Option 1: Envoy + Custom Control Plane
Pros: Total control, WebAssembly plugins, perfect Istio integration Cons: 18-24 month build timeline, 4 FTE engineers Verdict: Too risky for our timeline
Option 2: Kong Enterprise + Kuma Service Mesh
Pros: Battle-tested, strong GraphQL support, decent federation Cons: License costs $380K/year, limited gRPC optimization Verdict: Close second
Option 3: Hybrid Approach - Envoy + Istio + Apollo Router
Pros: Best-in-class for each protocol, open source, strong community Cons: Complex integration, three systems to manage Verdict: We chose this - and here’s why
The Architecture: How We Made It Work
Our final architecture looked like this:
┌─────────────────────────────────────────────────────┐
│ Global Control Plane (GitOps) │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Argo CD │ │ Istio │ │ Apollo │ │
│ └──────────┘ └──────────┘ └──────────┘ │
└─────────────────────────────────────────────────────┘
│
┌───────────────┼───────────────┐
│ │ │
┌───▼────┐ ┌───▼────┐ ┌───▼────┐
│ Region │ │ Region │ │ Region │
│ US-E │ │ US-W │ │ EU-W │
└────────┘ └────────┘ └────────┘
│ │ │
[Envoy] [Envoy] [Envoy]
├─REST ├─REST ├─REST
├─gRPC ├─gRPC ├─gRPC
├─GraphQL ├─GraphQL ├─GraphQL
└─WebSocket └─WebSocket └─WebSocket
Protocol-Specific Handling
REST APIs: Handled directly by Envoy with custom rate limiting filters gRPC: Envoy’s native gRPC support + connection pooling optimizations GraphQL: Apollo Router federation with subgraph stitching WebSocket: Envoy with sticky sessions and connection draining
The magic was in the WebAssembly filters we wrote for:
- JWT validation (12µs overhead)
- Custom rate limiting (sub-millisecond)
- Protocol detection and routing
- Real-time fraud detection
Phase 1: The Prototype That Almost Failed
We started with a single cluster running 5 services. Week 1 was a disaster.
Problem 1: Certificate Chaos
Istio’s certificate rotation broke our mTLS connections every 12 hours. Services couldn’t talk to each other.
Root cause: We were running Istio 1.18 with cert-manager that had a known bug.
Solution: Upgraded to Istio 1.19, implemented custom cert-manager webhooks, added 4-hour certificate overlap for rotation.
Problem 2: GraphQL Query Complexity
A single malicious query was taking down entire gateway pods.
Root cause: No query depth limiting, 15-level nested queries consuming 8GB memory.
Solution: Implemented Apollo’s depthLimit
directive + custom complexity scoring:
const complexityPlugin = {
requestDidStart() {
return {
didResolveOperation({ request, document }) {
const complexity = calculateComplexity({
query: document,
variables: request.variables,
maxDepth: 10,
complexityScoreThreshold: 1000
});
if (complexity > 1000) {
throw new GraphQLError('Query too complex', {
extensions: { code: 'QUERY_COMPLEXITY_EXCEEDED' }
});
}
}
};
}
};
Problem 3: The gRPC Connection Pool Leak
Memory usage was growing at 2GB/hour per gateway pod.
Root cause: gRPC connections weren’t being properly closed when services scaled down.
Solution: Custom Envoy filter to track connection lifetimes + aggressive timeout policies:
clusters:
- name: grpc-services
connect_timeout: 5s
type: STRICT_DNS
http2_protocol_options:
max_concurrent_streams: 100
initial_stream_window_size: 65536
initial_connection_window_size: 1048576
common_http_protocol_options:
idle_timeout: 30s
max_connection_duration: 300s
upstream_connection_options:
tcp_keepalive:
keepalive_time: 60
keepalive_interval: 10
keepalive_probes: 3
Phase 2: Multi-Region Rollout
Once we stabilized the prototype, we faced the real challenge: rolling out across 15 regions without downtime.
The Migration Strategy
We couldn’t do a big-bang migration. Our approach:
Week 1-2: Shadow traffic to new federated gateways (0% production) Week 3-4: 10% production traffic split testing Week 5-8: Progressive rollout (10% → 50% → 90%) Week 9: Full cutover + 72-hour monitoring Week 10: Decommission old gateways
Challenges We Hit
Challenge 1: WebSocket Connection Draining Users got disconnected during deployments every 6 hours.
Solution: Implemented connection draining with 15-minute grace periods:
apiVersion: v1
kind: Service
metadata:
annotations:
service.beta.kubernetes.io/aws-load-balancer-connection-draining-enabled: "true"
service.beta.kubernetes.io/aws-load-balancer-connection-draining-timeout: "900"
Challenge 2: Cross-Region Certificate Trust Services in US-E couldn’t talk to services in EU-W due to certificate trust issues.
Solution: Implemented hierarchical PKI with cross-signing:
- Root CA in us-east-1
- Intermediate CAs in each region
- Automated cross-signing via Vault
- 4-hour certificate refresh cycle
Challenge 3: The Latency Spike After migration, p99 latency increased from 45ms to 180ms.
Root cause: Default Envoy retry policies were cascading failures across regions.
Solution: Tuned retry policies aggressively:
route:
retry_policy:
retry_on: "5xx"
num_retries: 1
per_try_timeout: 50ms
retry_host_predicate:
- name: envoy.retry_host_predicates.previous_hosts
host_selection_retry_max_attempts: 3
Result: p99 dropped to 52ms - better than before!
The Performance Numbers
After full migration, our metrics transformed:
Gateway Configuration Time
- Before: 4-6 hours per service
- After: 3 minutes (automated GitOps)
- Improvement: 98% reduction
Cross-Region Latency (p99)
- Before: 215ms (multi-hop through legacy gateways)
- After: 52ms (direct federation)
- Improvement: 76% reduction
Operational Overhead
- Before: 30 hours/week (4 SRE engineers)
- After: 4 hours/week (1 SRE engineer)
- Improvement: 87% reduction
Cost Impact
- Before: $47K/month (gateway infrastructure + operational overhead)
- After: $18K/month (consolidated infrastructure)
- Improvement: 62% reduction
Throughput
- Sustained: 50,000 requests/second across all protocols
- Peak: 127,000 requests/second (Black Friday 2025)
- Error rate: 0.003% (mostly client timeouts)
The Hidden Costs Nobody Talks About
Federating API gateways isn’t free. Here’s what we didn’t expect:
1. Training and Knowledge Transfer
Cost: 240 engineering hours Impact: Teams needed to learn Envoy, Istio, and Apollo Router
We ran weekly “Gateway Office Hours” for 3 months to onboard teams.
2. WebAssembly Filter Debugging
Cost: Countless hours of frustration Challenge: No good debugging tools for Wasm filters running in Envoy
Our solution: Built custom logging and tracing into every filter, added e2e tests in Go.
3. Observability Complexity
Challenge: Three systems (Envoy, Istio, Apollo) = three telemetry stacks
Solution: Unified everything through OpenTelemetry:
- Distributed tracing across all protocols
- Unified metrics in Prometheus
- Log aggregation in Loki
- Custom Grafana dashboards per protocol type
4. The “Magic” Problem
Developers complained: “It just works, but I don’t know how it works.”
Solution: Created comprehensive runbooks, architecture diagrams, and a “Gateway 101” internal course.
What I’d Do Differently
Looking back with 20/20 hindsight:
1. Start with OpenTelemetry Integration
We retrofitted observability. Should have been day-one priority. The debugging pain was immense before we had proper tracing.
2. Write More Wasm Filters Earlier
We waited until Phase 2 to write custom filters. Should have built them in the prototype phase. Custom rate limiting and circuit breaking would have prevented several production incidents.
3. Invest in Load Testing Infrastructure Earlier
We discovered our connection pool leak in production. A proper load testing environment would have caught it in staging.
Load test setup we eventually built:
- K6 scripts running 100K concurrent connections
- Mixed protocol tests (REST, gRPC, GraphQL, WebSocket)
- Chaos engineering scenarios (pod failures, network partitions)
- Automated performance regression tests in CI/CD
4. Document Everything Immediately
We lost knowledge during team transitions because documentation lagged reality by 6 weeks. Real-time runbook updates would have prevented several incidents.
Lessons for Teams Considering Federation
If you’re thinking about federating your API gateways:
✅ Do This:
- Start small - Federate 2-3 gateways first, learn, iterate
- Invest in observability - You can’t debug what you can’t see
- Automate from day one - Manual gateway configs will kill you at scale
- Build Wasm skills - Custom filters are your secret weapon
- Plan for multi-protocol - Even if you only use REST today
❌ Don’t Do This:
- Big-bang migrations - Shadow traffic first, validate everything
- Skip load testing - Production is not your test environment
- Ignore certificate lifecycle - mTLS management is complex
- Assume 100% compatibility - Test everything during migration
- Underestimate training - Teams need time to learn new systems
The ROI: Was It Worth It?
Bottom line: Yes. Absolutely worth it.
Quantifiable benefits:
- $348K annual savings (infrastructure + operational overhead)
- 98% faster service deployments
- 76% lower latency for cross-region traffic
- 87% less SRE time spent on gateway operations
Intangible benefits:
- Developer confidence (deploy without fear)
- Consistent security policies across all services
- Unified observability across protocols
- Foundation for future multi-cloud expansion
What’s Next?
We’re now exploring:
- WebAssembly-based canary deployments - Traffic splitting at the gateway layer
- AI-powered rate limiting - Adaptive limits based on user behavior patterns
- Multi-cloud federation - Extending our architecture to GCP and Azure
- Protocol translation - Automatic REST-to-gRPC conversion for legacy services
Federating API gateways was one of the most technically challenging projects I’ve led, but also one of the most rewarding. The architectural complexity is real, but the operational benefits are transformative.
For more on modern API gateway patterns, check out the comprehensive API gateway federation guide that helped inform our architectural decisions.
Questions about API gateway federation? Connect on LinkedIn or follow my journey on Twitter.