The Problem: Go Was “Fast Enough” Until It Wasn’t
Q4 2023. Our payment processing system was melting down every Monday morning.
The pattern was predictable:
- 9 AM: Traffic spike as businesses process weekend transactions
- 9:15 AM: Latency climbs from 45ms to 890ms
- 9:30 AM: Memory usage hits 95%, pods start OOMing
- 9:45 AM: Auto-scaler frantically launches 40+ new pods
- 10:30 AM: Traffic normalizes, we’re left with massive over-provisioning
Our weekly infrastructure dance:
- Monday AM: 60 pods @ $120/hour
- Rest of week: 15 pods @ $30/hour
- Monthly waste: ~$14,400 on peak capacity we only needed 4 hours/week
Our Go services were well-written. But garbage collection pauses and memory overhead were killing us at scale.
After reading about Rust transforming system design, I proposed something radical: Rewrite our hottest path services in Rust.
My VP’s response: “Do you have any idea how long that will take?”
Spoiler: Longer than we thought. But worth every painful moment.
The Business Case: Convincing Leadership
Before writing a single line of Rust, I needed executive buy-in.
The Financial Analysis
Current costs (Go implementation):
- Infrastructure: $62K/month (excessive memory usage)
- Developer time: $18K/month (debugging GC issues)
- Incident response: $12K/month (on-call + lost productivity)
- Total: $92K/month
Projected costs (Rust rewrite):
- Migration effort: $180K (3 engineers × 2 months)
- Infrastructure: $19K/month (70% reduction)
- Developer time: $8K/month (simpler debugging)
- Incident response: $3K/month (fewer crashes)
- Total: $30K/month + $180K one-time
Break-even: 3 months after migration 3-year ROI: $2.2M savings
The VP: “You have 3 months to prove this works. Start with one service.”
Phase 1: The Proof of Concept (Weeks 1-4)
We chose our most problematic service: transaction-validator (2,500 req/sec at peak).
The Go Baseline
// Original Go implementation
type TransactionValidator struct {
cache *redis.Client
db *sql.DB
mu sync.RWMutex
metrics *prometheus.Registry
}
func (v *TransactionValidator) Validate(ctx context.Context, tx Transaction) error {
v.mu.RLock()
defer v.mu.RUnlock()
// Check cache
cached, err := v.cache.Get(ctx, tx.ID).Result()
if err == nil {
return v.processCached(cached)
}
// Validate against rules
if err := v.validateRules(tx); err != nil {
return err
}
// Store in cache
v.cache.Set(ctx, tx.ID, tx, 10*time.Minute)
return nil
}
Performance baseline:
- p50 latency: 42ms
- p99 latency: 340ms
- Memory per pod: 850MB steady-state, 2.1GB peak
- GC pauses: 8-15ms every 2 seconds
The Rust Rewrite: Attempt 1 (Failed)
My first Rust code was… a disaster.
// My first terrible Rust attempt
use std::sync::Arc;
use tokio::sync::RwLock;
struct TransactionValidator {
cache: Arc<RwLock<redis::Client>>, // Wrong!
db: Arc<RwLock<sqlx::Pool<sqlx::Postgres>>>, // Also wrong!
}
// This compiles but performs WORSE than Go!
impl TransactionValidator {
async fn validate(&self, tx: Transaction) -> Result<(), Error> {
let cache = self.cache.write().await; // Serial lock!
// ... rest of implementation
}
}
Problems:
- Over-locking: Every cache access acquired a write lock
- Arc
everywhere : Fighting the borrow checker wrong way - Blocking calls in async: Mixed sync and async code poorly
- No connection pooling: Creating new connections on every request
Performance: Worse than Go
- p50: 68ms (61% slower!)
- p99: 520ms
- Memory: 420MB (better, but code was trash)
The Rust Rewrite: Attempt 2 (Success)
After a week of learning Rust’s ownership model properly:
// Proper Rust implementation
use std::sync::Arc;
use dashmap::DashMap; // Lock-free concurrent HashMap
use deadpool_redis::{Config as RedisConfig, Pool as RedisPool, Runtime};
use sqlx::postgres::PgPool;
#[derive(Clone)]
struct TransactionValidator {
// Arc only where needed, no mutex spam
cache: RedisPool, // Connection pooled
db: PgPool, // Connection pooled
local_cache: Arc<DashMap<String, CachedTransaction>>, // Lock-free
metrics: Arc<Metrics>,
}
impl TransactionValidator {
async fn validate(&self, tx: Transaction) -> Result<(), ValidationError> {
// Check lock-free local cache first (nanosecond access)
if let Some(cached) = self.local_cache.get(&tx.id) {
return self.process_cached(&cached);
}
// Check Redis (millisecond access)
let mut conn = self.cache.get().await?;
if let Ok(cached) = conn.get::<_, String>(&tx.id).await {
let cached_tx: CachedTransaction = serde_json::from_str(&cached)?;
self.local_cache.insert(tx.id.clone(), cached_tx.clone());
return self.process_cached(&cached_tx);
}
// Validate rules (parallel execution)
let validation_results = futures::future::join_all(
vec![
self.validate_amount(&tx),
self.validate_merchant(&tx),
self.validate_card(&tx),
self.validate_risk_score(&tx),
]
).await;
// Check all validations passed
for result in validation_results {
result?;
}
// Cache result
let cached = CachedTransaction::from(tx.clone());
let serialized = serde_json::to_string(&cached)?;
conn.set_ex(&tx.id, serialized, 600).await?;
self.local_cache.insert(tx.id.clone(), cached);
Ok(())
}
}
Key improvements:
- DashMap: Lock-free concurrent HashMap (10x faster than RwLock)
- Connection pooling: Reuse connections efficiently
- Parallel validation: Run independent checks concurrently
- Zero-copy where possible: Minimize allocations
Performance after rewrite:
- p50 latency: 15ms (64% faster than Go!)
- p99 latency: 48ms (86% faster!)
- Memory per pod: 180MB steady-state, 240MB peak (71% less!)
- No GC pauses: Deterministic performance
The Migration: Phase 2 (Months 2-3)
After proving Rust worked, we migrated 11 more services.
Services We Migrated
Service | Go Memory | Rust Memory | Latency Improvement | Status |
---|---|---|---|---|
transaction-validator | 850MB | 180MB | 64% faster | ✅ Production |
fraud-detector | 1.2GB | 290MB | 71% faster | ✅ Production |
payment-processor | 980MB | 210MB | 58% faster | ✅ Production |
account-service | 640MB | 150MB | 52% faster | ✅ Production |
notification-service | 420MB | 95MB | 48% faster | ✅ Production |
analytics-aggregator | 2.1GB | 480MB | 79% faster | ✅ Production |
The Migration Strategy
Strangler Fig Pattern:
┌─────────────────────────────────────┐
│ Load Balancer (50/50 split) │
└──────────────┬──────────────────────┘
│
┌──────┴──────┐
│ │
┌───▼────┐ ┌───▼────┐
│ Go │ │ Rust │
│Service │ │Service │
└────────┘ └────────┘
Traffic migration:
- Week 1: 5% Rust, 95% Go (canary)
- Week 2: 25% Rust, 75% Go (if metrics good)
- Week 3: 50% Rust, 50% Go (split testing)
- Week 4: 90% Rust, 10% Go (final validation)
- Week 5: 100% Rust, decommission Go
Rollback plan: Single kubectl command to route 100% to Go.
The Challenges: What Almost Killed Us
Challenge 1: The Memory Leak We Didn’t Expect
Month 2, Week 3: Rust services started slowly leaking memory.
Day 1: 180MB → 185MB (normal variation)
Day 3: 185MB → 205MB (concerning)
Day 7: 205MB → 280MB (WTF?)
Day 10: 280MB → 420MB (PANIC!)
The hunt: We added extensive instrumentation.
// Added memory profiling
use jemalloc_ctl::{stats, epoch};
async fn memory_stats_handler() -> impl Responder {
// Trigger stats refresh
epoch::mib().unwrap().advance().unwrap();
let allocated = stats::allocated::mib().unwrap().read().unwrap();
let resident = stats::resident::mib().unwrap().read().unwrap();
HttpResponse::Ok().json(json!({
"allocated_mb": allocated / 1024 / 1024,
"resident_mb": resident / 1024 / 1024
}))
}
The culprit: Connection pool not properly closing idle connections.
// BUG: Connections leaked when Redis server restarted
let pool = RedisPool::builder(config)
.max_size(50)
.build()
.unwrap();
// FIX: Add connection recycling and health checks
let pool = RedisPool::builder(config)
.max_size(50)
.runtime(Runtime::Tokio1)
.recycle_timeout(Some(Duration::from_secs(30))) // KEY: Recycle old connections
.wait_timeout(Some(Duration::from_secs(5)))
.create_timeout(Some(Duration::from_secs(5)))
.build()
.unwrap();
After fix: Memory stable at 180MB for 30+ days.
Challenge 2: The Tokio Runtime Tuning
Default Tokio runtime settings destroyed our performance under load.
// Default (BAD): Single-threaded runtime
#[tokio::main]
async fn main() {
// Only uses 1 CPU core!
}
// Fixed: Multi-threaded with tuning
#[tokio::main(flavor = "multi_thread", worker_threads = 4)]
async fn main() {
// Set thread-local capacity for channels
std::env::set_var("TOKIO_WORKER_THREADS", "4");
}
// Even better: Custom runtime configuration
fn main() {
let runtime = tokio::runtime::Builder::new_multi_thread()
.worker_threads(4)
.thread_name("validator-worker")
.enable_all()
.build()
.unwrap();
runtime.block_on(async {
// Application code
});
}
Impact: 3.2x throughput increase with proper runtime config.
Challenge 3: Error Handling Culture Shift
Go’s if err != nil
vs Rust’s Result<T, E>
required team mindset change.
Go style:
result, err := doSomething()
if err != nil {
log.Error(err) // Often just logged
return nil // Silently fails
}
Rust style:
let result = do_something()?; // Propagates error up
// Or handle explicitly
match do_something() {
Ok(val) => val,
Err(e) => {
tracing::error!("Operation failed: {}", e);
return Err(e.into()); // Must handle
}
}
The shift: Rust forces you to handle errors. This found 47 bugs in our Go code we didn’t know existed.
The Results: 6 Months in Production
Performance Improvements
Latency (p99):
- Before (Go): 340ms average across services
- After (Rust): 52ms average
- Improvement: 85% faster
Memory Usage:
- Before (Go): 850MB average per pod
- After (Rust): 245MB average per pod
- Improvement: 71% reduction
Throughput:
- Before (Go): 2,500 req/sec per pod
- After (Rust): 8,200 req/sec per pod
- Improvement: 3.3x increase
Infrastructure Cost Savings
Pod count reduction:
- Before: 60 pods peak, 15 steady-state
- After: 12 pods peak, 5 steady-state
- Reduction: 80% fewer pods
Monthly costs:
- Before: $62K infrastructure
- After: $19K infrastructure
- Savings: $43K/month ($516K/year)
Developer Experience
Pros:
- Compile-time error checking caught bugs early
- No more GC-related debugging
- Performance predictability (no GC surprises)
- Memory safety prevented entire classes of bugs
Cons:
- Steeper learning curve (3-4 weeks to proficiency)
- Longer compile times (2-3 min vs 30 sec for Go)
- Smaller ecosystem (some crates less mature)
- Hiring harder (fewer Rust developers)
Incident Reduction
Production incidents (6-month comparison):
Incident Type | Go (6 months) | Rust (6 months) |
---|---|---|
Memory leaks | 12 | 1 |
Race conditions | 8 | 0 |
Null pointer panics | 15 | 0 (impossible in Rust) |
Performance degradation | 23 | 3 |
Total | 58 | 4 |
93% reduction in production incidents.
Lessons for Teams Considering Rust
✅ Good Reasons to Use Rust
- Performance critical paths - APIs, data processing, real-time systems
- Memory-constrained environments - Edge devices, embedded systems
- Long-running services - Where GC pauses matter
- Safety-critical systems - Financial, healthcare, infrastructure
- High-scale systems - Where performance improvements = cost savings
❌ Bad Reasons to Use Rust
- “It’s trendy” - Not a reason. Learn it properly first.
- CRUD APIs - Go/Node/Python are fine for most APIs
- Rapid prototyping - Rust’s compile-time checks slow iteration
- Team has no Rust experience - Budget 2-3 months for learning
- Small applications - Rust’s benefits don’t materialize at small scale
Migration Advice
If you’re migrating to Rust:
- Start small: Pick ONE service, preferably hot-path with clear metrics
- Measure everything: Establish baseline before migration
- Learn ownership model: Don’t fight the borrow checker
- Use mature crates: Tokio, Axum, SQLx, Serde are battle-tested
- Profile early: Use
cargo-flamegraph
andperf
from day one - Plan for training: Budget 3-4 weeks per engineer for proficiency
Red flags to abort migration:
- Team resistance (forcing Rust on unwilling team = disaster)
- No clear performance problem to solve
- Lack of time for proper learning
- Can’t articulate ROI beyond “Rust is cool”
What’s Next?
We’re now exploring:
- WebAssembly compilation - Rust → WASM for edge deployment
- Async runtime optimization - Custom Tokio executors
- Zero-copy deserialization - Using
rkyv
for ultra-fast parsing - GPU acceleration - CUDA bindings for ML inference
Rust transformed our infrastructure economics. The migration was harder than expected, but $516K annual savings + 93% fewer incidents = worth it.
For more on systems programming and performance optimization, see the comprehensive Rust guide that influenced our architecture decisions.
Considering Rust for your infrastructure? Connect on LinkedIn or share your migration stories on Twitter.