The API Sprawl Nightmare
June 2024. Our mobile team filed their 14th ticket this month requesting API changes.
The problem: They needed data from 5 different REST APIs to render a single screen.
User Profile Screen requires:
├─ GET /users/{id} (User Service)
├─ GET /orders/{userId} (Order Service)
├─ GET /payments/{userId} (Payment Service)
├─ GET /preferences/{userId} (Preference Service)
└─ GET /recommendations/{userId} (Recommendation Service)
Total: 5 sequential HTTP requests
Average load time: 2.3 seconds
Our mobile engineers were furious. And they were right.
We had 87 microservices with 87 separate REST APIs. Every new feature required:
- Coordinating with 3-5 backend teams
- Waiting for API changes in their backlogs
- Writing brittle orchestration code
- Dealing with version mismatches
After reading about GraphQL Federation at scale, I proposed a radical solution: Unify everything under a federated GraphQL API.
My CTO’s reaction: “That sounds like schema hell waiting to happen.”
Spoiler: It was. But we survived.
The Decision: Why Federation Over Monolithic GraphQL
We considered three approaches:
Option 1: BFF (Backend for Frontend) per Platform
Pros: Each team owns their BFF, simple to reason about Cons: Duplicates business logic, iOS/Android/Web all implement same patterns Verdict: No. Too much duplication.
Option 2: Monolithic GraphQL Gateway
Pros: Single codebase, easy to deploy Cons: Single team bottleneck, can’t scale teams independently Verdict: Maybe, but doesn’t solve organizational issues.
Option 3: GraphQL Federation
Pros: Teams own their subgraphs, schema composition automatic, scales with org Cons: Complex setup, schema conflicts, distributed debugging Verdict: We chose this - organizational scaling was more important than technical complexity.
Phase 1: The Proof of Concept (Weeks 1-4)
We started with 3 services to prove federation worked:
Subgraphs:
- User Service - User profiles, authentication
- Order Service - Order history, tracking
- Product Service - Product catalog, inventory
The First Subgraph: Users
# users-subgraph/schema.graphql
extend schema
@link(url: "https://specs.apollo.dev/federation/v2.3", import: ["@key", "@shareable"])
type User @key(fields: "id") {
id: ID!
email: String!
name: String!
createdAt: DateTime!
}
type Query {
me: User
user(id: ID!): User
}
Implementation (Node.js + Apollo Server):
// users-subgraph/src/resolvers.ts
import { Resolvers } from './generated/graphql';
import { getUserById } from './services/userService';
export const resolvers: Resolvers = {
Query: {
me: async (_, __, { userId }) => {
if (!userId) throw new Error('Not authenticated');
return getUserById(userId);
},
user: async (_, { id }) => {
return getUserById(id);
},
},
// Federation reference resolver
User: {
__resolveReference: async (reference) => {
return getUserById(reference.id);
},
},
};
The Second Subgraph: Orders (with Type Extension)
# orders-subgraph/schema.graphql
extend schema
@link(url: "https://specs.apollo.dev/federation/v2.3", import: ["@key", "@external"])
# Extend User type from users subgraph
type User @key(fields: "id") {
id: ID! @external
# Add orders field to User
orders: [Order!]!
}
type Order @key(fields: "id") {
id: ID!
userId: ID!
items: [OrderItem!]!
total: Float!
status: OrderStatus!
createdAt: DateTime!
}
enum OrderStatus {
PENDING
CONFIRMED
SHIPPED
DELIVERED
CANCELLED
}
type OrderItem {
productId: ID!
quantity: Int!
price: Float!
}
type Query {
order(id: ID!): Order
}
Implementation:
// orders-subgraph/src/resolvers.ts
import { Resolvers } from './generated/graphql';
import { getOrdersByUserId, getOrderById } from './services/orderService';
export const resolvers: Resolvers = {
Query: {
order: async (_, { id }) => {
return getOrderById(id);
},
},
// Extend User type with orders field
User: {
orders: async (user) => {
// 'user' only contains { id } from users subgraph
return getOrdersByUserId(user.id);
},
},
Order: {
__resolveReference: async (reference) => {
return getOrderById(reference.id);
},
},
};
Apollo Gateway Setup
// gateway/src/index.ts
import { ApolloServer } from '@apollo/server';
import { startStandaloneServer } from '@apollo/server/standalone';
import { ApolloGateway, IntrospectAndCompose } from '@apollo/gateway';
const gateway = new ApolloGateway({
supergraphSdl: new IntrospectAndCompose({
subgraphs: [
{ name: 'users', url: 'http://users-service:4001/graphql' },
{ name: 'orders', url: 'http://orders-service:4002/graphql' },
{ name: 'products', url: 'http://products-service:4003/graphql' },
],
}),
});
const server = new ApolloServer({
gateway,
// Disable subscriptions (not supported in federation v1)
subscriptions: false,
});
const { url } = await startStandaloneServer(server, {
listen: { port: 4000 },
});
console.log(`🚀 Gateway ready at ${url}`);
The First Federated Query
query UserDashboard {
me {
id
name
email
# This field is resolved by orders subgraph!
orders {
id
total
status
createdAt
}
}
}
It worked! Data from two services, one query, automatic stitching.
But then we tried to scale to 87 services…
The Schema Conflict Hell (Weeks 5-8)
Conflict 1: Type Name Collisions
Teams independently created Error
types:
# payments-subgraph
type Error { # CONFLICT!
code: String!
message: String!
}
# shipping-subgraph
type Error { # SAME NAME!
errorCode: Int!
description: String!
}
Apollo Gateway: “Error: Type ‘Error’ is defined in multiple subgraphs with incompatible fields.”
Solution: Enforce naming conventions:
# New convention: prefix with service name
type PaymentError {
code: String!
message: String!
}
type ShippingError {
errorCode: Int!
description: String!
}
Conflict 2: Shareable Scalar Conflicts
Multiple teams defined DateTime
differently:
# subgraph-a
scalar DateTime # ISO 8601 string
# subgraph-b
scalar DateTime # Unix timestamp (number)
Solution: Centralize common scalars in a “shared” subgraph:
# shared-types-subgraph
scalar DateTime @shareable
scalar JSON @shareable
scalar Email @shareable
scalar URL @shareable
All other subgraphs must import these, not redefine them.
Conflict 3: Entity Key Mismatches
Two subgraphs tried to extend User with different keys:
# analytics-subgraph
type User @key(fields: "id") {
id: ID!
analyticsData: AnalyticsData!
}
# recommendations-subgraph
type User @key(fields: "email") { # DIFFERENT KEY!
email: String!
recommendations: [Product!]!
}
Apollo Gateway: Refused to compose. Keys must match.
Solution: Agreed on id
as canonical key, added resolvers to fetch by email when needed.
Governance at Scale: The Schema Registry
After hitting our 20th schema conflict, we implemented strict governance.
Apollo Studio Schema Registry
# Install Rover CLI
npm install -g @apollo/rover
# Configure Studio
rover config auth
# Publish subgraph schemas
rover subgraph publish my-supergraph@main \
--name users \
--schema ./users-subgraph/schema.graphql \
--routing-url https://users-service.prod/graphql
Checks before deployment:
# .github/workflows/schema-check.yml
name: Schema Check
on:
pull_request:
paths:
- '**/schema.graphql'
jobs:
check:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Install Rover
run: npm install -g @apollo/rover
- name: Schema Check
env:
APOLLO_KEY: ${{ secrets.APOLLO_KEY }}
run: |
rover subgraph check my-supergraph@main \
--name users \
--schema ./users-subgraph/schema.graphql
# Fails PR if breaking changes detected!
This CI check prevented 47 breaking changes from reaching production.
Schema Linting Rules
We created custom linting rules:
// schema-linter.js
const { GraphQLProjectConfig } = require('@graphql-inspector/core');
module.exports = {
rules: {
// Enforce naming conventions
'naming-convention': {
types: 'PascalCase',
FieldDefinition: 'camelCase',
EnumValueDefinition: 'UPPER_CASE',
},
// Require descriptions
'require-description': {
types: true,
FieldDefinition: true,
},
// Deprecation warnings
'require-deprecation-reason': true,
// No ID fields without @key
'require-id-key': {
types: ['User', 'Order', 'Product'],
},
},
};
Performance: The N+1 Query Problem
The Problem
Query:
query {
users(limit: 100) {
id
name
orders { # N+1!
id
total
}
}
}
Execution:
- Gateway fetches 100 users from users-subgraph
- For EACH user, gateway calls orders-subgraph
- 100 sequential HTTP requests to orders service
Latency: 8.2 seconds for 100 users. Unacceptable.
Solution: DataLoader Pattern
// orders-subgraph/src/dataloaders.ts
import DataLoader from 'dataloader';
import { getOrdersByUserIds } from './services/orderService';
// Batch and cache user order fetches
export const createOrdersDataLoader = () => {
return new DataLoader<string, Order[]>(
async (userIds: readonly string[]) => {
// Single database query for all userIds
const ordersMap = await getOrdersByUserIds([...userIds]);
// Return in same order as input
return userIds.map(userId => ordersMap[userId] || []);
},
{
// Cache results for 100ms
cacheKeyFn: (key) => key,
batchScheduleFn: (callback) => setTimeout(callback, 10),
}
);
};
// Use in resolver
export const resolvers: Resolvers = {
User: {
orders: async (user, _, { dataloaders }) => {
return dataloaders.orders.load(user.id);
},
},
};
After DataLoader:
- 100 users: 1 batch query instead of 100 individual queries
- Latency: 140ms (98% improvement!)
Migrating 87 Services: The Strategy
We couldn’t migrate everything at once. Our approach:
Wave 1: High-Value, Low-Complexity (Months 2-3)
- User service
- Product catalog
- Order history
- Payment methods
- Shipping info
Why these first? Most commonly accessed by mobile apps, relatively simple schemas.
Wave 2: Medium Complexity (Months 4-5)
- Recommendations engine
- Search service
- Analytics
- Notifications
- Reviews/Ratings
Wave 3: Complex Domains (Months 6-7)
- Fraud detection
- Inventory management
- Pricing engine
- Promotions
- Tax calculation
Wave 4: Legacy Systems (Months 8-9)
- Mainframe integration
- Third-party vendor APIs
- Internal admin tools
The Migration Pattern
For each service:
1. Create GraphQL subgraph (federated schema)
↓
2. Implement resolvers (fetch from existing REST API)
↓
3. Publish schema to registry
↓
4. Deploy subgraph to staging
↓
5. Run schema checks & integration tests
↓
6. Canary deploy (5% → 25% → 50% → 100%)
↓
7. Monitor for 1 week
↓
8. Deprecate REST endpoints
Parallel migrations: Up to 6 teams migrating simultaneously.
Production Rollout: Apollo Router
After 3 months, we hit Apollo Gateway performance limits:
- Single-threaded Node.js
- 15-25ms gateway overhead per query
- Couldn’t saturate a 16-core machine
Solution: Migrate to Apollo Router (Rust-based).
Performance Comparison
Metric | Apollo Gateway (Node) | Apollo Router (Rust) |
---|---|---|
Latency overhead | 18ms | 1.2ms |
Throughput/core | 1,200 req/sec | 12,000 req/sec |
Memory/instance | 840MB | 95MB |
CPU utilization | 1 core only | All cores |
Apollo Router config:
# router.yaml
supergraph:
# Fetch supergraph schema from Apollo Studio
introspection: false
# Performance tuning
traffic_shaping:
# Connection pool per subgraph
router:
global_rate_limit:
capacity: 10000
interval: 1s
# Per-subgraph limits
subgraphs:
users:
timeout: 30s
rate_limit:
capacity: 5000
interval: 1s
orders:
timeout: 60s
rate_limit:
capacity: 3000
interval: 1s
# Distributed caching
apq:
enabled: true
router:
cache:
redis:
urls: ["redis://redis-cluster:6379"]
ttl: 300s
# Query plan caching
supergraph:
query_planning:
cache:
redis:
urls: ["redis://redis-cluster:6379"]
ttl: 3600s
Deployment:
# k8s/apollo-router.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: apollo-router
namespace: graphql
spec:
replicas: 3
selector:
matchLabels:
app: apollo-router
template:
metadata:
labels:
app: apollo-router
spec:
containers:
- name: router
image: ghcr.io/apollographql/router:v1.28.0
ports:
- containerPort: 4000
env:
- name: APOLLO_KEY
valueFrom:
secretKeyRef:
name: apollo-credentials
key: key
- name: APOLLO_GRAPH_REF
value: "my-supergraph@main"
resources:
requests:
cpu: "2"
memory: "1Gi"
limits:
cpu: "4"
memory: "2Gi"
livenessProbe:
httpGet:
path: /health
port: 4000
initialDelaySeconds: 10
periodSeconds: 10
The Results: 9 Months Later
Performance Metrics
API Response Times:
- Before (REST): Average 2.3s for complex screens
- After (GraphQL): Average 850ms
- Improvement: 63% faster
Mobile App Performance:
- Requests per screen: 5-8 → 1
- Network data transferred: 450KB → 120KB
- Battery impact: 35% improvement
Backend Load:
- Requests to microservices: Down 67% (deduplication)
- Cache hit rate: 84% (query plan + APQ caching)
Developer Velocity
Feature delivery time:
- Before: 2-3 weeks (coordinating multiple teams)
- After: 3-5 days (mobile team self-serves)
- Improvement: 70% faster
API change requests:
- Before: 14 per month (average)
- After: 2 per month (most changes handled client-side)
- Reduction: 86%
Infrastructure Costs
Gateway costs:
- Apollo Gateway (Node): $8,200/month (42 instances)
- Apollo Router (Rust): $1,400/month (3 instances)
- Savings: $6,800/month
Total cost impact (including reduced microservice load):
- Savings: $12,300/month ($147,600/year)
Lessons for Teams Considering Federation
✅ When Federation Makes Sense
- Multiple teams owning different domains
- Mobile/web apps making many API calls
- Microservices architecture already in place
- Need to move fast without backend bottlenecks
- Data relationships across services
❌ When Federation Doesn’t Make Sense
- Small team (<10 engineers) - overhead not worth it
- Monolithic backend - fix that first
- Simple CRUD - REST is fine
- No organizational scaling issues
- Team unfamiliar with GraphQL - learn GraphQL first
Migration Advice
If you’re implementing federation:
- Start small: 3-5 subgraphs maximum for POC
- Governance first: Schema registry + CI checks before scaling
- Use Apollo Router: Don’t waste time with Gateway
- DataLoaders everywhere: Prevent N+1 from day one
- Monitor query complexity: Set limits to prevent abuse
- Automated testing: Schema checks, integration tests, performance tests
Red flags to abort:
- Teams fighting over schema design (need better communication)
- No buy-in from backend teams (they’ll sabotage)
- Can’t articulate organizational benefit (why bother?)
- GraphQL is “just cool” (not a reason)
What’s Next?
We’re now exploring:
- Federated subscriptions - Real-time updates across subgraphs
- @defer/@stream directives - Progressive data loading
- Field-level authorization - Fine-grained security
- Subgraph deployment automation - GitOps for schemas
GraphQL Federation transformed how our mobile/web teams build features. The migration was harder than expected, but 147 fewer API coordination meetings per year = priceless.
For more on GraphQL Federation architecture patterns, see the comprehensive federation guide that helped inform our implementation.
Implementing GraphQL Federation? Connect on LinkedIn or share your federation war stories on Twitter.