The Catalyst: Developer Experience Crisis
After reading about platform engineering as the future of DevOps, I knew we had a problem. Our 40-person engineering team was struggling with:
- 16 different deployment methods across microservices
- 5-day average for new engineer environment setup
- 142 internal tools with no central discovery
- 4+ hours per week per developer on infrastructure tasks
Our VP of Engineering gave us a mandate: “Make it as easy to deploy at our company as it is to deploy on Vercel.”
No pressure.
This is the story of how we built our Internal Developer Platform (IDP) in 6 months, the critical decisions we made, and the ROI that justified the investment.
Phase 1: Discovery & Platform Team Formation
The Team Structure Debate
We debated three approaches:
Option A: Embedded Platform Engineers
- Platform engineers embedded in product teams
- Pros: Domain knowledge, trust
- Cons: Inconsistent standards, hard to scale
Option B: Centralized Platform Team
- Dedicated platform team as service provider
- Pros: Focused expertise, consistency
- Cons: Risk of ivory tower syndrome
Option C: Hybrid Model
- Core platform team + embedded liaisons
- Pros: Best of both worlds
- Cons: Complex communication overhead
We chose Option B with a twist: 3-person core platform team with rotating “platform advocate” seats where product engineers joined for 1 quarter.
Defining Platform Team Scope
We clarified what the platform team would and wouldn’t own:
Platform Team Owns:
- Developer portal (Backstage)
- CI/CD pipeline templates
- Infrastructure provisioning automation
- Service scaffolding & templates
- Observability stack integration
- Secret management
Product Teams Own:
- Application code
- Service-specific configurations
- Database schemas
- API design
- Feature flags
This boundary definition prevented scope creep.
Phase 2: Technology Selection
The Build vs. Buy Decision
We evaluated the landscape:
Solution | Build Effort | Maintenance | Flexibility | Cost |
---|---|---|---|---|
Build from scratch | 12+ months | High | Total | $0 software |
Backstage (OSS) | 4-6 months | Medium | High | $0 software |
Commercial IDP | 2-3 months | Low | Limited | $150k+/year |
We chose Backstage for three reasons:
- Open-source with strong Spotify backing
- Plugin ecosystem aligned with our stack
- Flexibility to customize without vendor lock-in
The Core Stack
# Our platform architecture
developer_portal:
core: backstage
authentication: okta-saml
catalog: service-catalog + component-discovery
cicd:
orchestration: github-actions
runners: kubernetes-self-hosted
templates: cookiecutter + terraform-modules
infrastructure:
iac: terraform
provisioning: crossplane
orchestration: kubernetes (EKS)
observability:
metrics: prometheus + grafana
logs: loki
traces: tempo
alerts: alertmanager
secrets:
management: vault
injection: external-secrets-operator
Phase 3: Building the MVP (Months 1-3)
Golden Path: Service Scaffolding
Our first deliverable: one-command service creation.
We built a service template generator integrated with Backstage:
# template.yaml - Backstage Software Template
apiVersion: scaffolder.backstage.io/v1beta3
kind: Template
metadata:
name: golang-microservice
title: Go Microservice
description: Create a new Go-based microservice with complete CI/CD
spec:
owner: platform-team
type: service
parameters:
- title: Service Information
properties:
name:
title: Service Name
type: string
pattern: '^[a-z0-9-]+$'
description:
title: Description
type: string
owner:
title: Team Owner
type: string
ui:field: OwnerPicker
- title: Infrastructure Configuration
properties:
database:
title: Needs Database?
type: boolean
database_type:
title: Database Type
type: string
enum: ['postgresql', 'mysql', 'mongodb']
ui:visible:
required: ['database']
cache:
title: Needs Redis Cache?
type: boolean
replicas:
title: Replica Count
type: number
default: 3
minimum: 1
maximum: 10
steps:
- id: fetch-base
name: Fetch Base Template
action: fetch:template
input:
url: ./skeleton
values:
name: ${{ parameters.name }}
description: ${{ parameters.description }}
owner: ${{ parameters.owner }}
- id: create-repo
name: Create GitHub Repository
action: github:repo:create
input:
repoUrl: github.com?owner=our-org&repo=${{ parameters.name }}
description: ${{ parameters.description }}
defaultBranch: main
- id: provision-infrastructure
name: Provision Infrastructure
action: terraform:apply
input:
module: platform-templates/microservice
vars:
service_name: ${{ parameters.name }}
database_enabled: ${{ parameters.database }}
database_type: ${{ parameters.database_type }}
cache_enabled: ${{ parameters.cache }}
replicas: ${{ parameters.replicas }}
- id: register-catalog
name: Register in Catalog
action: catalog:register
input:
repoContentsUrl: ${{ steps.create-repo.output.repoContentsUrl }}
catalogInfoPath: /catalog-info.yaml
output:
links:
- title: Repository
url: ${{ steps.create-repo.output.remoteUrl }}
- title: CI/CD Pipeline
url: ${{ steps.create-repo.output.remoteUrl }}/actions
- title: Service Dashboard
url: https://grafana.ourcompany.com/d/${{ parameters.name }}
Result: New service setup dropped from 3 days to 8 minutes.
The TechDocs Integration
Documentation was scattered. We centralized it in Backstage using TechDocs (MkDocs backend):
# mkdocs.yml in each service repo
site_name: User Service Documentation
plugins:
- techdocs-core
nav:
- Home: index.md
- Architecture:
- Overview: architecture/overview.md
- Database Schema: architecture/database.md
- API Endpoints: architecture/api.md
- Operations:
- Deployment: operations/deployment.md
- Monitoring: operations/monitoring.md
- Incident Response: operations/incidents.md
- Development:
- Local Setup: development/setup.md
- Testing: development/testing.md
- Contributing: development/contributing.md
Engineers could now find service documentation directly in the developer portal—no more Confluence archaeology.
The Kubernetes Abstraction Layer
We created a simplified Kubernetes abstraction using Crossplane to hide complexity:
# ServiceDeployment CRD
apiVersion: platform.ourcompany.com/v1alpha1
kind: ServiceDeployment
metadata:
name: user-service
spec:
# Simple developer-facing spec
image: user-service:v1.2.3
replicas: 3
resources:
cpu: 500m
memory: 512Mi
database:
type: postgresql
version: "14"
storage: 20Gi
cache:
enabled: true
memory: 2Gi
ingress:
hostname: api.ourcompany.com
path: /users
monitoring:
enabled: true
alerts:
- name: high-error-rate
threshold: 5%
- name: high-latency
threshold: 1000ms
---
# This gets expanded by Crossplane into:
# - Deployment
# - Service
# - HorizontalPodAutoscaler
# - PodDisruptionBudget
# - PostgreSQL instance (via AWS RDS)
# - Redis instance (via ElastiCache)
# - Ingress with cert-manager
# - ServiceMonitor for Prometheus
# - PrometheusRules for alerts
# - NetworkPolicies
Developers now declared intent, not implementation.
Phase 4: Adoption Challenges (Months 4-5)
Challenge 1: The Legacy Service Problem
40% of our services predated the platform. Migrating them was painful.
Solution: We created a migration tool that automated 70% of the work:
# migrate_to_platform.py
import yaml
import subprocess
from pathlib import Path
def migrate_service(service_path):
"""
Migrate a legacy service to platform-managed deployment.
"""
print(f"Migrating service at {service_path}")
# 1. Detect current deployment method
deployment_type = detect_deployment(service_path)
# 2. Extract configuration
config = extract_config(service_path, deployment_type)
# 3. Generate platform manifests
manifests = generate_platform_manifests(config)
# 4. Create PR with migration
create_migration_pr(service_path, manifests)
# 5. Generate migration runbook
runbook = generate_runbook(config, deployment_type)
return {
'status': 'migration_ready',
'pr_url': pr_url,
'runbook': runbook
}
def detect_deployment(service_path):
"""Detect legacy deployment method."""
if Path(service_path / 'deploy.sh').exists():
return 'bash_script'
elif Path(service_path / 'Jenkinsfile').exists():
return 'jenkins'
elif Path(service_path / '.circleci').exists():
return 'circleci'
elif Path(service_path / 'docker-compose.yml').exists():
return 'docker_compose'
else:
return 'unknown'
def extract_config(service_path, deployment_type):
"""Extract config from legacy deployment."""
config = {}
if deployment_type == 'docker_compose':
compose = yaml.safe_load(
Path(service_path / 'docker-compose.yml').read_text()
)
service = list(compose['services'].values())[0]
config = {
'image': service.get('image', ''),
'ports': service.get('ports', []),
'environment': service.get('environment', {}),
'volumes': service.get('volumes', []),
'depends_on': list(service.get('depends_on', {}).keys())
}
# Add more extractors for other deployment types
return config
def generate_platform_manifests(config):
"""Generate platform CRD manifests."""
return f"""
apiVersion: platform.ourcompany.com/v1alpha1
kind: ServiceDeployment
metadata:
name: {config['name']}
spec:
image: {config['image']}
replicas: {config.get('replicas', 3)}
resources:
cpu: {config.get('cpu', '500m')}
memory: {config.get('memory', '512Mi')}
database:
type: {infer_database_type(config)}
ingress:
path: {config.get('path', '/')}
"""
# Migration results: 28/40 legacy services migrated in 2 weeks
Challenge 2: Resistance to Change
Some senior engineers resisted the platform, preferring their custom setups.
What didn’t work: Mandates from management What worked:
- Show, don’t tell: Live demos of the platform’s speed
- Metrics: Published deployment time comparisons
- Early wins: Migrated the most painful services first
- Champions: Empowered advocates in each team
Challenge 3: The “Magic” Problem
Abstraction is powerful but can feel like magic. When things broke, developers were confused.
Solution: Built a “platform explainer” command:
$ platform explain user-service
Service: user-service
Platform Version: v1.3.0
Generated Resources:
✓ Deployment (user-service)
✓ Service (user-service)
✓ HorizontalPodAutoscaler (user-service-hpa)
✓ PodDisruptionBudget (user-service-pdb)
✓ RDS Instance (user-service-db)
✓ Redis Instance (user-service-cache)
✓ Ingress (user-service-ingress)
✓ Certificate (user-service-cert)
✓ ServiceMonitor (user-service-metrics)
✓ PrometheusRule (user-service-alerts)
To see the full Kubernetes manifests:
$ platform manifests user-service --output yaml
To debug a specific resource:
$ kubectl describe deployment user-service
$ kubectl logs -l app=user-service
Transparency built trust.
Phase 5: Production & Scale (Month 6)
Metrics That Mattered
After 6 months, we measured:
Metric | Before Platform | After Platform | Improvement |
---|---|---|---|
Time to deploy new service | 3 days | 8 minutes | 99.8% |
Deployment frequency | 2.3/week | 8.7/week | 278% |
New engineer onboarding | 5 days | 1.8 hours | 95% |
Mean time to recovery | 45 min | 12 min | 73% |
Infrastructure incidents | 12/month | 3/month | 75% |
Developer satisfaction | 6.2/10 | 8.9/10 | 44% |
The Unexpected ROI
Direct cost savings:
- $12,000/month in reduced AWS costs (better resource utilization)
- $28,000/month in engineering time saved (4 hours/week per engineer)
Hidden value:
- Faster feature delivery → increased revenue
- Better developer experience → improved retention
- Standardized observability → fewer incidents
Total ROI: 340% in first year
The Architecture We Landed On
┌─────────────────────────────────────────────────────────────┐
│ Backstage Portal │
│ ┌────────────────┬─────────────────┬───────────────────┐ │
│ │ Service Catalog │ TechDocs │ Software Templates│ │
│ └────────────────┴─────────────────┴───────────────────┘ │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Platform Control Plane (Kubernetes) │
│ │
│ ┌──────────────────┐ ┌──────────────────┐ │
│ │ Crossplane │ │ ArgoCD │ │
│ │ (IaC Operator) │ │ (GitOps) │ │
│ └──────────────────┘ └──────────────────┘ │
│ │
│ ┌──────────────────┐ ┌──────────────────┐ │
│ │ External Secrets │ │ Prometheus │ │
│ │ (Vault Sync) │ │ (Monitoring) │ │
│ └──────────────────┘ └──────────────────┘ │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Cloud Resources (AWS) │
│ EKS • RDS • ElastiCache • S3 • ALB • Route53 • CloudWatch │
└─────────────────────────────────────────────────────────────┘
Key Lessons Learned
1. Start with User Research
We spent 2 weeks interviewing developers before building anything. This surfaced pain points we wouldn’t have guessed.
2. Golden Paths, Not Guardrails
We made the easy path also the best path. Developers could still customize, but defaults were excellent.
3. Documentation is Infrastructure
TechDocs integration was as important as the deployment pipeline. Searchable, versioned, co-located docs were a game-changer.
4. Measure Developer Experience
We tracked:
- Time to Hello World (new service deployment)
- Time to Coffee (onboarding time)
- Toil hours (time spent on infrastructure tasks)
- NPS score (developer satisfaction)
5. Platform as a Product
We treated platform like a product:
- Roadmap based on user feedback
- Regular demos and office hours
- Slack channel for support
- Quarterly user surveys
6. Incremental Adoption Works
We didn’t force migration. We made the platform so good that teams wanted to migrate.
What’s Next: The Roadmap
Our platform backlog includes:
Q2 2025:
- AI-powered cost optimization recommendations
- Automatic dependency updates via Renovate integration
- Service-level SLO tracking and alerting
Q3 2025:
- Multi-cloud support (AWS + GCP)
- Policy-as-code enforcement (OPA/Gatekeeper)
- Advanced canary deployment strategies
Q4 2025:
- Self-service disaster recovery testing
- Platform API for programmatic access
- ML-powered capacity planning
Final Thoughts
Building an internal developer platform was one of the highest-leverage investments we made. Six months of focused work yielded:
✅ 99.8% faster service deployment
✅ 278% increase in deployment frequency
✅ 95% reduction in onboarding time
✅ 340% first-year ROI
But the real win was cultural: developers now spend time building features, not fighting infrastructure.
Platform engineering isn’t about replacing DevOps—it’s about evolving it. As the CrashBytes article on platform engineering explains, this is the future of cloud operations.
If you’re considering building an IDP, my advice: start now, start small, and obsess over developer experience.
Want to discuss platform engineering strategies? Connect with me on LinkedIn or follow me on Twitter.