Building Our Internal Developer Platform: 6 Months from Zero to Production

The Catalyst: Developer Experience Crisis

After reading about platform engineering as the future of DevOps, I knew we had a problem. Our 40-person engineering team was struggling with:

16 different deployment methods across microservices
5-day average for new engineer environment setup
142 internal tools with no central discovery
4+ hours per week per developer on infrastructure tasks

Our VP of Engineering gave us a mandate: “Make it as easy to deploy at our company as it is to deploy on Vercel.”

No pressure.

This is the story of how we built our Internal Developer Platform (IDP) in 6 months, the critical decisions we made, and the ROI that justified the investment.

Phase 1: Discovery & Platform Team Formation

The Team Structure Debate

We debated three approaches:

Option A: Embedded Platform Engineers

Platform engineers embedded in product teams
Pros: Domain knowledge, trust
Cons: Inconsistent standards, hard to scale

Option B: Centralized Platform Team

Dedicated platform team as service provider
Pros: Focused expertise, consistency
Cons: Risk of ivory tower syndrome

Option C: Hybrid Model

Core platform team + embedded liaisons
Pros: Best of both worlds
Cons: Complex communication overhead

We chose Option B with a twist: 3-person core platform team with rotating “platform advocate” seats where product engineers joined for 1 quarter.

Defining Platform Team Scope

We clarified what the platform team would and wouldn’t own:

Platform Team Owns:

Developer portal (Backstage)
CI/CD pipeline templates
Infrastructure provisioning automation
Service scaffolding & templates
Observability stack integration
Secret management

Product Teams Own:

Application code
Service-specific configurations
Database schemas
API design
Feature flags

This boundary definition prevented scope creep.

Phase 2: Technology Selection

The Build vs. Buy Decision

We evaluated the landscape:

Solution	Build Effort	Maintenance	Flexibility	Cost
Build from scratch	12+ months	High	Total	$0 software
Backstage (OSS)	4-6 months	Medium	High	$0 software
Commercial IDP	2-3 months	Low	Limited	$150k+/year

We chose Backstage for three reasons:

Open-source with strong Spotify backing
Plugin ecosystem aligned with our stack
Flexibility to customize without vendor lock-in

The Core Stack

# Our platform architecture
developer_portal:
  core: backstage
  authentication: okta-saml
  catalog: service-catalog + component-discovery
  
cicd:
  orchestration: github-actions
  runners: kubernetes-self-hosted
  templates: cookiecutter + terraform-modules
  
infrastructure:
  iac: terraform
  provisioning: crossplane
  orchestration: kubernetes (EKS)
  
observability:
  metrics: prometheus + grafana
  logs: loki
  traces: tempo
  alerts: alertmanager
  
secrets:
  management: vault
  injection: external-secrets-operator

Phase 3: Building the MVP (Months 1-3)

Golden Path: Service Scaffolding

Our first deliverable: one-command service creation.

We built a service template generator integrated with Backstage:

# template.yaml - Backstage Software Template
apiVersion: scaffolder.backstage.io/v1beta3
kind: Template
metadata:
  name: golang-microservice
  title: Go Microservice
  description: Create a new Go-based microservice with complete CI/CD
spec:
  owner: platform-team
  type: service
  
  parameters:
    - title: Service Information
      properties:
        name:
          title: Service Name
          type: string
          pattern: '^[a-z0-9-]+$'
        description:
          title: Description
          type: string
        owner:
          title: Team Owner
          type: string
          ui:field: OwnerPicker
        
    - title: Infrastructure Configuration
      properties:
        database:
          title: Needs Database?
          type: boolean
        database_type:
          title: Database Type
          type: string
          enum: ['postgresql', 'mysql', 'mongodb']
          ui:visible:
            required: ['database']
        cache:
          title: Needs Redis Cache?
          type: boolean
        replicas:
          title: Replica Count
          type: number
          default: 3
          minimum: 1
          maximum: 10

  steps:
    - id: fetch-base
      name: Fetch Base Template
      action: fetch:template
      input:
        url: ./skeleton
        values:
          name: ${{ parameters.name }}
          description: ${{ parameters.description }}
          owner: ${{ parameters.owner }}
    
    - id: create-repo
      name: Create GitHub Repository
      action: github:repo:create
      input:
        repoUrl: github.com?owner=our-org&repo=${{ parameters.name }}
        description: ${{ parameters.description }}
        defaultBranch: main
        
    - id: provision-infrastructure
      name: Provision Infrastructure
      action: terraform:apply
      input:
        module: platform-templates/microservice
        vars:
          service_name: ${{ parameters.name }}
          database_enabled: ${{ parameters.database }}
          database_type: ${{ parameters.database_type }}
          cache_enabled: ${{ parameters.cache }}
          replicas: ${{ parameters.replicas }}
    
    - id: register-catalog
      name: Register in Catalog
      action: catalog:register
      input:
        repoContentsUrl: ${{ steps.create-repo.output.repoContentsUrl }}
        catalogInfoPath: /catalog-info.yaml

  output:
    links:
      - title: Repository
        url: ${{ steps.create-repo.output.remoteUrl }}
      - title: CI/CD Pipeline
        url: ${{ steps.create-repo.output.remoteUrl }}/actions
      - title: Service Dashboard
        url: https://grafana.ourcompany.com/d/${{ parameters.name }}

Result: New service setup dropped from 3 days to 8 minutes.

The TechDocs Integration

Documentation was scattered. We centralized it in Backstage using TechDocs (MkDocs backend):

# mkdocs.yml in each service repo
site_name: User Service Documentation

plugins:
  - techdocs-core

nav:
  - Home: index.md
  - Architecture:
      - Overview: architecture/overview.md
      - Database Schema: architecture/database.md
      - API Endpoints: architecture/api.md
  - Operations:
      - Deployment: operations/deployment.md
      - Monitoring: operations/monitoring.md
      - Incident Response: operations/incidents.md
  - Development:
      - Local Setup: development/setup.md
      - Testing: development/testing.md
      - Contributing: development/contributing.md

Engineers could now find service documentation directly in the developer portal—no more Confluence archaeology.

The Kubernetes Abstraction Layer

We created a simplified Kubernetes abstraction using Crossplane to hide complexity:

# ServiceDeployment CRD
apiVersion: platform.ourcompany.com/v1alpha1
kind: ServiceDeployment
metadata:
  name: user-service
spec:
  # Simple developer-facing spec
  image: user-service:v1.2.3
  replicas: 3
  
  resources:
    cpu: 500m
    memory: 512Mi
  
  database:
    type: postgresql
    version: "14"
    storage: 20Gi
  
  cache:
    enabled: true
    memory: 2Gi
  
  ingress:
    hostname: api.ourcompany.com
    path: /users
    
  monitoring:
    enabled: true
    alerts:
      - name: high-error-rate
        threshold: 5%
      - name: high-latency
        threshold: 1000ms

---
# This gets expanded by Crossplane into:
# - Deployment
# - Service
# - HorizontalPodAutoscaler
# - PodDisruptionBudget
# - PostgreSQL instance (via AWS RDS)
# - Redis instance (via ElastiCache)
# - Ingress with cert-manager
# - ServiceMonitor for Prometheus
# - PrometheusRules for alerts
# - NetworkPolicies

Developers now declared intent, not implementation.

Phase 4: Adoption Challenges (Months 4-5)

Challenge 1: The Legacy Service Problem

40% of our services predated the platform. Migrating them was painful.

Solution: We created a migration tool that automated 70% of the work:

# migrate_to_platform.py
import yaml
import subprocess
from pathlib import Path

def migrate_service(service_path):
    """
    Migrate a legacy service to platform-managed deployment.
    """
    print(f"Migrating service at {service_path}")
    
    # 1. Detect current deployment method
    deployment_type = detect_deployment(service_path)
    
    # 2. Extract configuration
    config = extract_config(service_path, deployment_type)
    
    # 3. Generate platform manifests
    manifests = generate_platform_manifests(config)
    
    # 4. Create PR with migration
    create_migration_pr(service_path, manifests)
    
    # 5. Generate migration runbook
    runbook = generate_runbook(config, deployment_type)
    
    return {
        'status': 'migration_ready',
        'pr_url': pr_url,
        'runbook': runbook
    }

def detect_deployment(service_path):
    """Detect legacy deployment method."""
    if Path(service_path / 'deploy.sh').exists():
        return 'bash_script'
    elif Path(service_path / 'Jenkinsfile').exists():
        return 'jenkins'
    elif Path(service_path / '.circleci').exists():
        return 'circleci'
    elif Path(service_path / 'docker-compose.yml').exists():
        return 'docker_compose'
    else:
        return 'unknown'

def extract_config(service_path, deployment_type):
    """Extract config from legacy deployment."""
    config = {}
    
    if deployment_type == 'docker_compose':
        compose = yaml.safe_load(
            Path(service_path / 'docker-compose.yml').read_text()
        )
        
        service = list(compose['services'].values())[0]
        
        config = {
            'image': service.get('image', ''),
            'ports': service.get('ports', []),
            'environment': service.get('environment', {}),
            'volumes': service.get('volumes', []),
            'depends_on': list(service.get('depends_on', {}).keys())
        }
    
    # Add more extractors for other deployment types
    
    return config

def generate_platform_manifests(config):
    """Generate platform CRD manifests."""
    return f"""
apiVersion: platform.ourcompany.com/v1alpha1
kind: ServiceDeployment
metadata:
  name: {config['name']}
spec:
  image: {config['image']}
  replicas: {config.get('replicas', 3)}
  
  resources:
    cpu: {config.get('cpu', '500m')}
    memory: {config.get('memory', '512Mi')}
  
  database:
    type: {infer_database_type(config)}
    
  ingress:
    path: {config.get('path', '/')}
"""

# Migration results: 28/40 legacy services migrated in 2 weeks

Challenge 2: Resistance to Change

Some senior engineers resisted the platform, preferring their custom setups.

What didn’t work: Mandates from management What worked:

Show, don’t tell: Live demos of the platform’s speed
Metrics: Published deployment time comparisons
Early wins: Migrated the most painful services first
Champions: Empowered advocates in each team

Challenge 3: The “Magic” Problem

Abstraction is powerful but can feel like magic. When things broke, developers were confused.

Solution: Built a “platform explainer” command:

$ platform explain user-service

Service: user-service
Platform Version: v1.3.0

Generated Resources:
  ✓ Deployment (user-service)
  ✓ Service (user-service)
  ✓ HorizontalPodAutoscaler (user-service-hpa)
  ✓ PodDisruptionBudget (user-service-pdb)
  ✓ RDS Instance (user-service-db)
  ✓ Redis Instance (user-service-cache)
  ✓ Ingress (user-service-ingress)
  ✓ Certificate (user-service-cert)
  ✓ ServiceMonitor (user-service-metrics)
  ✓ PrometheusRule (user-service-alerts)

To see the full Kubernetes manifests:
  $ platform manifests user-service --output yaml

To debug a specific resource:
  $ kubectl describe deployment user-service
  $ kubectl logs -l app=user-service

Transparency built trust.

Phase 5: Production & Scale (Month 6)

Metrics That Mattered

After 6 months, we measured:

Metric	Before Platform	After Platform	Improvement
Time to deploy new service	3 days	8 minutes	99.8%
Deployment frequency	2.3/week	8.7/week	278%
New engineer onboarding	5 days	1.8 hours	95%
Mean time to recovery	45 min	12 min	73%
Infrastructure incidents	12/month	3/month	75%
Developer satisfaction	6.2/10	8.9/10	44%

The Unexpected ROI

Direct cost savings:

$12,000/month in reduced AWS costs (better resource utilization)
$28,000/month in engineering time saved (4 hours/week per engineer)

Hidden value:

Faster feature delivery → increased revenue
Better developer experience → improved retention
Standardized observability → fewer incidents

Total ROI: 340% in first year

The Architecture We Landed On

┌─────────────────────────────────────────────────────────────┐
│                    Backstage Portal                          │
│  ┌────────────────┬─────────────────┬───────────────────┐  │
│  │ Service Catalog │   TechDocs      │  Software Templates│  │
│  └────────────────┴─────────────────┴───────────────────┘  │
└─────────────────────────────────────────────────────────────┘
                             │
                             ▼
┌─────────────────────────────────────────────────────────────┐
│              Platform Control Plane (Kubernetes)             │
│                                                               │
│  ┌──────────────────┐    ┌──────────────────┐              │
│  │   Crossplane      │    │  ArgoCD          │              │
│  │   (IaC Operator)  │    │  (GitOps)        │              │
│  └──────────────────┘    └──────────────────┘              │
│                                                               │
│  ┌──────────────────┐    ┌──────────────────┐              │
│  │ External Secrets │    │  Prometheus      │              │
│  │ (Vault Sync)     │    │  (Monitoring)    │              │
│  └──────────────────┘    └──────────────────┘              │
└─────────────────────────────────────────────────────────────┘
                             │
                             ▼
┌─────────────────────────────────────────────────────────────┐
│                     Cloud Resources (AWS)                    │
│  EKS • RDS • ElastiCache • S3 • ALB • Route53 • CloudWatch  │
└─────────────────────────────────────────────────────────────┘

Key Lessons Learned

1. Start with User Research

We spent 2 weeks interviewing developers before building anything. This surfaced pain points we wouldn’t have guessed.

2. Golden Paths, Not Guardrails

We made the easy path also the best path. Developers could still customize, but defaults were excellent.

3. Documentation is Infrastructure

TechDocs integration was as important as the deployment pipeline. Searchable, versioned, co-located docs were a game-changer.

4. Measure Developer Experience

We tracked:

Time to Hello World (new service deployment)
Time to Coffee (onboarding time)
Toil hours (time spent on infrastructure tasks)
NPS score (developer satisfaction)

5. Platform as a Product

We treated platform like a product:

Roadmap based on user feedback
Regular demos and office hours
Slack channel for support
Quarterly user surveys

6. Incremental Adoption Works

We didn’t force migration. We made the platform so good that teams wanted to migrate.

What’s Next: The Roadmap

Our platform backlog includes:

Q2 2025:

AI-powered cost optimization recommendations
Automatic dependency updates via Renovate integration
Service-level SLO tracking and alerting

Q3 2025:

Multi-cloud support (AWS + GCP)
Policy-as-code enforcement (OPA/Gatekeeper)
Advanced canary deployment strategies

Q4 2025:

Self-service disaster recovery testing
Platform API for programmatic access
ML-powered capacity planning

Final Thoughts

Building an internal developer platform was one of the highest-leverage investments we made. Six months of focused work yielded:

✅ 99.8% faster service deployment ✅ 278% increase in deployment frequency
✅ 95% reduction in onboarding time ✅ 340% first-year ROI

But the real win was cultural: developers now spend time building features, not fighting infrastructure.

Platform engineering isn’t about replacing DevOps—it’s about evolving it. As the CrashBytes article on platform engineering explains, this is the future of cloud operations.

If you’re considering building an IDP, my advice: start now, start small, and obsess over developer experience.

Want to discuss platform engineering strategies? Connect with me on LinkedIn or follow me on Twitter.