GraphQL Federation at Scale: Stitching 87 Services Without Losing My Mind

The API Sprawl Nightmare

June 2024. Our mobile team filed their 14th ticket this month requesting API changes.

The problem: They needed data from 5 different REST APIs to render a single screen.

User Profile Screen requires:
├─ GET /users/{id}           (User Service)
├─ GET /orders/{userId}      (Order Service)
├─ GET /payments/{userId}    (Payment Service)
├─ GET /preferences/{userId} (Preference Service)
└─ GET /recommendations/{userId} (Recommendation Service)

Total: 5 sequential HTTP requests
Average load time: 2.3 seconds

Our mobile engineers were furious. And they were right.

We had 87 microservices with 87 separate REST APIs. Every new feature required:

Coordinating with 3-5 backend teams
Waiting for API changes in their backlogs
Writing brittle orchestration code
Dealing with version mismatches

After reading about GraphQL Federation at scale, I proposed a radical solution: Unify everything under a federated GraphQL API.

My CTO’s reaction: “That sounds like schema hell waiting to happen.”

Spoiler: It was. But we survived.

The Decision: Why Federation Over Monolithic GraphQL

We considered three approaches:

Option 1: BFF (Backend for Frontend) per Platform

Pros: Each team owns their BFF, simple to reason about Cons: Duplicates business logic, iOS/Android/Web all implement same patterns Verdict: No. Too much duplication.

Option 2: Monolithic GraphQL Gateway

Pros: Single codebase, easy to deploy Cons: Single team bottleneck, can’t scale teams independently Verdict: Maybe, but doesn’t solve organizational issues.

Option 3: GraphQL Federation

Pros: Teams own their subgraphs, schema composition automatic, scales with org Cons: Complex setup, schema conflicts, distributed debugging Verdict: We chose this - organizational scaling was more important than technical complexity.

Phase 1: The Proof of Concept (Weeks 1-4)

We started with 3 services to prove federation worked:

Subgraphs:

User Service - User profiles, authentication
Order Service - Order history, tracking
Product Service - Product catalog, inventory

The First Subgraph: Users

# users-subgraph/schema.graphql
extend schema
  @link(url: "https://specs.apollo.dev/federation/v2.3", import: ["@key", "@shareable"])

type User @key(fields: "id") {
  id: ID!
  email: String!
  name: String!
  createdAt: DateTime!
}

type Query {
  me: User
  user(id: ID!): User
}

Implementation (Node.js + Apollo Server):

// users-subgraph/src/resolvers.ts
import { Resolvers } from './generated/graphql';
import { getUserById } from './services/userService';

export const resolvers: Resolvers = {
  Query: {
    me: async (_, __, { userId }) => {
      if (!userId) throw new Error('Not authenticated');
      return getUserById(userId);
    },
    user: async (_, { id }) => {
      return getUserById(id);
    },
  },
  
  // Federation reference resolver
  User: {
    __resolveReference: async (reference) => {
      return getUserById(reference.id);
    },
  },
};

The Second Subgraph: Orders (with Type Extension)

# orders-subgraph/schema.graphql
extend schema
  @link(url: "https://specs.apollo.dev/federation/v2.3", import: ["@key", "@external"])

# Extend User type from users subgraph
type User @key(fields: "id") {
  id: ID! @external
  # Add orders field to User
  orders: [Order!]!
}

type Order @key(fields: "id") {
  id: ID!
  userId: ID!
  items: [OrderItem!]!
  total: Float!
  status: OrderStatus!
  createdAt: DateTime!
}

enum OrderStatus {
  PENDING
  CONFIRMED
  SHIPPED
  DELIVERED
  CANCELLED
}

type OrderItem {
  productId: ID!
  quantity: Int!
  price: Float!
}

type Query {
  order(id: ID!): Order
}

Implementation:

// orders-subgraph/src/resolvers.ts
import { Resolvers } from './generated/graphql';
import { getOrdersByUserId, getOrderById } from './services/orderService';

export const resolvers: Resolvers = {
  Query: {
    order: async (_, { id }) => {
      return getOrderById(id);
    },
  },
  
  // Extend User type with orders field
  User: {
    orders: async (user) => {
      // 'user' only contains { id } from users subgraph
      return getOrdersByUserId(user.id);
    },
  },
  
  Order: {
    __resolveReference: async (reference) => {
      return getOrderById(reference.id);
    },
  },
};

Apollo Gateway Setup

// gateway/src/index.ts
import { ApolloServer } from '@apollo/server';
import { startStandaloneServer } from '@apollo/server/standalone';
import { ApolloGateway, IntrospectAndCompose } from '@apollo/gateway';

const gateway = new ApolloGateway({
  supergraphSdl: new IntrospectAndCompose({
    subgraphs: [
      { name: 'users', url: 'http://users-service:4001/graphql' },
      { name: 'orders', url: 'http://orders-service:4002/graphql' },
      { name: 'products', url: 'http://products-service:4003/graphql' },
    ],
  }),
});

const server = new ApolloServer({
  gateway,
  // Disable subscriptions (not supported in federation v1)
  subscriptions: false,
});

const { url } = await startStandaloneServer(server, {
  listen: { port: 4000 },
});

console.log(`🚀 Gateway ready at ${url}`);

The First Federated Query

query UserDashboard {
  me {
    id
    name
    email
    # This field is resolved by orders subgraph!
    orders {
      id
      total
      status
      createdAt
    }
  }
}

It worked! Data from two services, one query, automatic stitching.

But then we tried to scale to 87 services…

The Schema Conflict Hell (Weeks 5-8)

Conflict 1: Type Name Collisions

Teams independently created Error types:

# payments-subgraph
type Error {  # CONFLICT!
  code: String!
  message: String!
}

# shipping-subgraph  
type Error {  # SAME NAME!
  errorCode: Int!
  description: String!
}

Apollo Gateway: “Error: Type ‘Error’ is defined in multiple subgraphs with incompatible fields.”

Solution: Enforce naming conventions:

# New convention: prefix with service name
type PaymentError {
  code: String!
  message: String!
}

type ShippingError {
  errorCode: Int!
  description: String!
}

Conflict 2: Shareable Scalar Conflicts

Multiple teams defined DateTime differently:

# subgraph-a
scalar DateTime  # ISO 8601 string

# subgraph-b
scalar DateTime  # Unix timestamp (number)

Solution: Centralize common scalars in a “shared” subgraph:

# shared-types-subgraph
scalar DateTime @shareable
scalar JSON @shareable
scalar Email @shareable
scalar URL @shareable

All other subgraphs must import these, not redefine them.

Conflict 3: Entity Key Mismatches

Two subgraphs tried to extend User with different keys:

# analytics-subgraph
type User @key(fields: "id") {
  id: ID!
  analyticsData: AnalyticsData!
}

# recommendations-subgraph
type User @key(fields: "email") {  # DIFFERENT KEY!
  email: String!
  recommendations: [Product!]!
}

Apollo Gateway: Refused to compose. Keys must match.

Solution: Agreed on id as canonical key, added resolvers to fetch by email when needed.

Governance at Scale: The Schema Registry

After hitting our 20th schema conflict, we implemented strict governance.

Apollo Studio Schema Registry

# Install Rover CLI
npm install -g @apollo/rover

# Configure Studio
rover config auth

# Publish subgraph schemas
rover subgraph publish my-supergraph@main \
  --name users \
  --schema ./users-subgraph/schema.graphql \
  --routing-url https://users-service.prod/graphql

Checks before deployment:

# .github/workflows/schema-check.yml
name: Schema Check

on:
  pull_request:
    paths:
      - '**/schema.graphql'

jobs:
  check:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      
      - name: Install Rover
        run: npm install -g @apollo/rover
      
      - name: Schema Check
        env:
          APOLLO_KEY: ${{ secrets.APOLLO_KEY }}
        run: |
          rover subgraph check my-supergraph@main \
            --name users \
            --schema ./users-subgraph/schema.graphql
      
      # Fails PR if breaking changes detected!

This CI check prevented 47 breaking changes from reaching production.

Schema Linting Rules

We created custom linting rules:

// schema-linter.js
const { GraphQLProjectConfig } = require('@graphql-inspector/core');

module.exports = {
  rules: {
    // Enforce naming conventions
    'naming-convention': {
      types: 'PascalCase',
      FieldDefinition: 'camelCase',
      EnumValueDefinition: 'UPPER_CASE',
    },
    
    // Require descriptions
    'require-description': {
      types: true,
      FieldDefinition: true,
    },
    
    // Deprecation warnings
    'require-deprecation-reason': true,
    
    // No ID fields without @key
    'require-id-key': {
      types: ['User', 'Order', 'Product'],
    },
  },
};

Performance: The N+1 Query Problem

The Problem

Query:

query {
  users(limit: 100) {
    id
    name
    orders {  # N+1!
      id
      total
    }
  }
}

Execution:

Gateway fetches 100 users from users-subgraph
For EACH user, gateway calls orders-subgraph
100 sequential HTTP requests to orders service

Latency: 8.2 seconds for 100 users. Unacceptable.

Solution: DataLoader Pattern

// orders-subgraph/src/dataloaders.ts
import DataLoader from 'dataloader';
import { getOrdersByUserIds } from './services/orderService';

// Batch and cache user order fetches
export const createOrdersDataLoader = () => {
  return new DataLoader<string, Order[]>(
    async (userIds: readonly string[]) => {
      // Single database query for all userIds
      const ordersMap = await getOrdersByUserIds([...userIds]);
      
      // Return in same order as input
      return userIds.map(userId => ordersMap[userId] || []);
    },
    {
      // Cache results for 100ms
      cacheKeyFn: (key) => key,
      batchScheduleFn: (callback) => setTimeout(callback, 10),
    }
  );
};

// Use in resolver
export const resolvers: Resolvers = {
  User: {
    orders: async (user, _, { dataloaders }) => {
      return dataloaders.orders.load(user.id);
    },
  },
};

After DataLoader:

100 users: 1 batch query instead of 100 individual queries
Latency: 140ms (98% improvement!)

Migrating 87 Services: The Strategy

We couldn’t migrate everything at once. Our approach:

Wave 1: High-Value, Low-Complexity (Months 2-3)

User service
Product catalog
Order history
Payment methods
Shipping info

Why these first? Most commonly accessed by mobile apps, relatively simple schemas.

Wave 2: Medium Complexity (Months 4-5)

Recommendations engine
Search service
Analytics
Notifications
Reviews/Ratings

Wave 3: Complex Domains (Months 6-7)

Fraud detection
Inventory management
Pricing engine
Promotions
Tax calculation

Wave 4: Legacy Systems (Months 8-9)

Mainframe integration
Third-party vendor APIs
Internal admin tools

The Migration Pattern

For each service:

1. Create GraphQL subgraph (federated schema)
   ↓
2. Implement resolvers (fetch from existing REST API)
   ↓
3. Publish schema to registry
   ↓
4. Deploy subgraph to staging
   ↓
5. Run schema checks & integration tests
   ↓
6. Canary deploy (5% → 25% → 50% → 100%)
   ↓
7. Monitor for 1 week
   ↓
8. Deprecate REST endpoints

Parallel migrations: Up to 6 teams migrating simultaneously.

Production Rollout: Apollo Router

After 3 months, we hit Apollo Gateway performance limits:

Single-threaded Node.js
15-25ms gateway overhead per query
Couldn’t saturate a 16-core machine

Solution: Migrate to Apollo Router (Rust-based).

Performance Comparison

Metric	Apollo Gateway (Node)	Apollo Router (Rust)
Latency overhead	18ms	1.2ms
Throughput/core	1,200 req/sec	12,000 req/sec
Memory/instance	840MB	95MB
CPU utilization	1 core only	All cores

Apollo Router config:

# router.yaml
supergraph:
  # Fetch supergraph schema from Apollo Studio
  introspection: false
  
# Performance tuning
traffic_shaping:
  # Connection pool per subgraph
  router:
    global_rate_limit:
      capacity: 10000
      interval: 1s
  
  # Per-subgraph limits
  subgraphs:
    users:
      timeout: 30s
      rate_limit:
        capacity: 5000
        interval: 1s
    
    orders:
      timeout: 60s
      rate_limit:
        capacity: 3000
        interval: 1s

# Distributed caching
apq:
  enabled: true
  router:
    cache:
      redis:
        urls: ["redis://redis-cluster:6379"]
        ttl: 300s

# Query plan caching
supergraph:
  query_planning:
    cache:
      redis:
        urls: ["redis://redis-cluster:6379"]
        ttl: 3600s

Deployment:

# k8s/apollo-router.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: apollo-router
  namespace: graphql
spec:
  replicas: 3
  selector:
    matchLabels:
      app: apollo-router
  template:
    metadata:
      labels:
        app: apollo-router
    spec:
      containers:
      - name: router
        image: ghcr.io/apollographql/router:v1.28.0
        ports:
        - containerPort: 4000
        env:
        - name: APOLLO_KEY
          valueFrom:
            secretKeyRef:
              name: apollo-credentials
              key: key
        - name: APOLLO_GRAPH_REF
          value: "my-supergraph@main"
        resources:
          requests:
            cpu: "2"
            memory: "1Gi"
          limits:
            cpu: "4"
            memory: "2Gi"
        livenessProbe:
          httpGet:
            path: /health
            port: 4000
          initialDelaySeconds: 10
          periodSeconds: 10

The Results: 9 Months Later

Performance Metrics

API Response Times:

Before (REST): Average 2.3s for complex screens
After (GraphQL): Average 850ms
Improvement: 63% faster

Mobile App Performance:

Requests per screen: 5-8 → 1
Network data transferred: 450KB → 120KB
Battery impact: 35% improvement

Backend Load:

Requests to microservices: Down 67% (deduplication)
Cache hit rate: 84% (query plan + APQ caching)

Developer Velocity

Feature delivery time:

Before: 2-3 weeks (coordinating multiple teams)
After: 3-5 days (mobile team self-serves)
Improvement: 70% faster

API change requests:

Before: 14 per month (average)
After: 2 per month (most changes handled client-side)
Reduction: 86%

Infrastructure Costs

Gateway costs:

Apollo Gateway (Node): $8,200/month (42 instances)
Apollo Router (Rust): $1,400/month (3 instances)
Savings: $6,800/month

Total cost impact (including reduced microservice load):

Savings: $12,300/month ($147,600/year)

Lessons for Teams Considering Federation

✅ When Federation Makes Sense

Multiple teams owning different domains
Mobile/web apps making many API calls
Microservices architecture already in place
Need to move fast without backend bottlenecks
Data relationships across services

❌ When Federation Doesn’t Make Sense

Small team (<10 engineers) - overhead not worth it
Monolithic backend - fix that first
Simple CRUD - REST is fine
No organizational scaling issues
Team unfamiliar with GraphQL - learn GraphQL first

Migration Advice

If you’re implementing federation:

Start small: 3-5 subgraphs maximum for POC
Governance first: Schema registry + CI checks before scaling
Use Apollo Router: Don’t waste time with Gateway
DataLoaders everywhere: Prevent N+1 from day one
Monitor query complexity: Set limits to prevent abuse
Automated testing: Schema checks, integration tests, performance tests

Red flags to abort:

Teams fighting over schema design (need better communication)
No buy-in from backend teams (they’ll sabotage)
Can’t articulate organizational benefit (why bother?)
GraphQL is “just cool” (not a reason)

What’s Next?

We’re now exploring:

Federated subscriptions - Real-time updates across subgraphs
@defer/@stream directives - Progressive data loading
Field-level authorization - Fine-grained security
Subgraph deployment automation - GitOps for schemas

GraphQL Federation transformed how our mobile/web teams build features. The migration was harder than expected, but 147 fewer API coordination meetings per year = priceless.

For more on GraphQL Federation architecture patterns, see the comprehensive federation guide that helped inform our implementation.

Implementing GraphQL Federation? Connect on LinkedIn or share your federation war stories on Twitter.