Rewriting Our Core Services in Rust: 64% Faster, 71% Less Memory, Worth the Pain

Why we rewrote 12 critical services from Go to Rust, the migration hell we endured, the memory leak that almost killed us, and why our infrastructure costs dropped $43K/month.

The Problem: Go Was “Fast Enough” Until It Wasn’t

Q4 2023. Our payment processing system was melting down every Monday morning.

The pattern was predictable:

  • 9 AM: Traffic spike as businesses process weekend transactions
  • 9:15 AM: Latency climbs from 45ms to 890ms
  • 9:30 AM: Memory usage hits 95%, pods start OOMing
  • 9:45 AM: Auto-scaler frantically launches 40+ new pods
  • 10:30 AM: Traffic normalizes, we’re left with massive over-provisioning

Our weekly infrastructure dance:

  • Monday AM: 60 pods @ $120/hour
  • Rest of week: 15 pods @ $30/hour
  • Monthly waste: ~$14,400 on peak capacity we only needed 4 hours/week

Our Go services were well-written. But garbage collection pauses and memory overhead were killing us at scale.

After reading about Rust transforming system design, I proposed something radical: Rewrite our hottest path services in Rust.

My VP’s response: “Do you have any idea how long that will take?”

Spoiler: Longer than we thought. But worth every painful moment.

The Business Case: Convincing Leadership

Before writing a single line of Rust, I needed executive buy-in.

The Financial Analysis

Current costs (Go implementation):

  • Infrastructure: $62K/month (excessive memory usage)
  • Developer time: $18K/month (debugging GC issues)
  • Incident response: $12K/month (on-call + lost productivity)
  • Total: $92K/month

Projected costs (Rust rewrite):

  • Migration effort: $180K (3 engineers × 2 months)
  • Infrastructure: $19K/month (70% reduction)
  • Developer time: $8K/month (simpler debugging)
  • Incident response: $3K/month (fewer crashes)
  • Total: $30K/month + $180K one-time

Break-even: 3 months after migration 3-year ROI: $2.2M savings

The VP: “You have 3 months to prove this works. Start with one service.”

Phase 1: The Proof of Concept (Weeks 1-4)

We chose our most problematic service: transaction-validator (2,500 req/sec at peak).

The Go Baseline

// Original Go implementation
type TransactionValidator struct {
    cache    *redis.Client
    db       *sql.DB
    mu       sync.RWMutex
    metrics  *prometheus.Registry
}

func (v *TransactionValidator) Validate(ctx context.Context, tx Transaction) error {
    v.mu.RLock()
    defer v.mu.RUnlock()
    
    // Check cache
    cached, err := v.cache.Get(ctx, tx.ID).Result()
    if err == nil {
        return v.processCached(cached)
    }
    
    // Validate against rules
    if err := v.validateRules(tx); err != nil {
        return err
    }
    
    // Store in cache
    v.cache.Set(ctx, tx.ID, tx, 10*time.Minute)
    
    return nil
}

Performance baseline:

  • p50 latency: 42ms
  • p99 latency: 340ms
  • Memory per pod: 850MB steady-state, 2.1GB peak
  • GC pauses: 8-15ms every 2 seconds

The Rust Rewrite: Attempt 1 (Failed)

My first Rust code was… a disaster.

// My first terrible Rust attempt
use std::sync::Arc;
use tokio::sync::RwLock;

struct TransactionValidator {
    cache: Arc<RwLock<redis::Client>>,  // Wrong!
    db: Arc<RwLock<sqlx::Pool<sqlx::Postgres>>>,  // Also wrong!
}

// This compiles but performs WORSE than Go!
impl TransactionValidator {
    async fn validate(&self, tx: Transaction) -> Result<(), Error> {
        let cache = self.cache.write().await;  // Serial lock!
        // ... rest of implementation
    }
}

Problems:

  1. Over-locking: Every cache access acquired a write lock
  2. Arc everywhere: Fighting the borrow checker wrong way
  3. Blocking calls in async: Mixed sync and async code poorly
  4. No connection pooling: Creating new connections on every request

Performance: Worse than Go

  • p50: 68ms (61% slower!)
  • p99: 520ms
  • Memory: 420MB (better, but code was trash)

The Rust Rewrite: Attempt 2 (Success)

After a week of learning Rust’s ownership model properly:

// Proper Rust implementation
use std::sync::Arc;
use dashmap::DashMap;  // Lock-free concurrent HashMap
use deadpool_redis::{Config as RedisConfig, Pool as RedisPool, Runtime};
use sqlx::postgres::PgPool;

#[derive(Clone)]
struct TransactionValidator {
    // Arc only where needed, no mutex spam
    cache: RedisPool,           // Connection pooled
    db: PgPool,                 // Connection pooled
    local_cache: Arc<DashMap<String, CachedTransaction>>,  // Lock-free
    metrics: Arc<Metrics>,
}

impl TransactionValidator {
    async fn validate(&self, tx: Transaction) -> Result<(), ValidationError> {
        // Check lock-free local cache first (nanosecond access)
        if let Some(cached) = self.local_cache.get(&tx.id) {
            return self.process_cached(&cached);
        }
        
        // Check Redis (millisecond access)
        let mut conn = self.cache.get().await?;
        if let Ok(cached) = conn.get::<_, String>(&tx.id).await {
            let cached_tx: CachedTransaction = serde_json::from_str(&cached)?;
            self.local_cache.insert(tx.id.clone(), cached_tx.clone());
            return self.process_cached(&cached_tx);
        }
        
        // Validate rules (parallel execution)
        let validation_results = futures::future::join_all(
            vec![
                self.validate_amount(&tx),
                self.validate_merchant(&tx),
                self.validate_card(&tx),
                self.validate_risk_score(&tx),
            ]
        ).await;
        
        // Check all validations passed
        for result in validation_results {
            result?;
        }
        
        // Cache result
        let cached = CachedTransaction::from(tx.clone());
        let serialized = serde_json::to_string(&cached)?;
        conn.set_ex(&tx.id, serialized, 600).await?;
        self.local_cache.insert(tx.id.clone(), cached);
        
        Ok(())
    }
}

Key improvements:

  1. DashMap: Lock-free concurrent HashMap (10x faster than RwLock)
  2. Connection pooling: Reuse connections efficiently
  3. Parallel validation: Run independent checks concurrently
  4. Zero-copy where possible: Minimize allocations

Performance after rewrite:

  • p50 latency: 15ms (64% faster than Go!)
  • p99 latency: 48ms (86% faster!)
  • Memory per pod: 180MB steady-state, 240MB peak (71% less!)
  • No GC pauses: Deterministic performance

The Migration: Phase 2 (Months 2-3)

After proving Rust worked, we migrated 11 more services.

Services We Migrated

ServiceGo MemoryRust MemoryLatency ImprovementStatus
transaction-validator850MB180MB64% faster✅ Production
fraud-detector1.2GB290MB71% faster✅ Production
payment-processor980MB210MB58% faster✅ Production
account-service640MB150MB52% faster✅ Production
notification-service420MB95MB48% faster✅ Production
analytics-aggregator2.1GB480MB79% faster✅ Production

The Migration Strategy

Strangler Fig Pattern:

┌─────────────────────────────────────┐
│      Load Balancer (50/50 split)    │
└──────────────┬──────────────────────┘

        ┌──────┴──────┐
        │             │
    ┌───▼────┐   ┌───▼────┐
    │   Go   │   │  Rust  │
    │Service │   │Service │
    └────────┘   └────────┘

Traffic migration:

  • Week 1: 5% Rust, 95% Go (canary)
  • Week 2: 25% Rust, 75% Go (if metrics good)
  • Week 3: 50% Rust, 50% Go (split testing)
  • Week 4: 90% Rust, 10% Go (final validation)
  • Week 5: 100% Rust, decommission Go

Rollback plan: Single kubectl command to route 100% to Go.

The Challenges: What Almost Killed Us

Challenge 1: The Memory Leak We Didn’t Expect

Month 2, Week 3: Rust services started slowly leaking memory.

Day 1:  180MB → 185MB (normal variation)
Day 3:  185MB → 205MB (concerning)
Day 7:  205MB → 280MB (WTF?)
Day 10: 280MB → 420MB (PANIC!)

The hunt: We added extensive instrumentation.

// Added memory profiling
use jemalloc_ctl::{stats, epoch};

async fn memory_stats_handler() -> impl Responder {
    // Trigger stats refresh
    epoch::mib().unwrap().advance().unwrap();
    
    let allocated = stats::allocated::mib().unwrap().read().unwrap();
    let resident = stats::resident::mib().unwrap().read().unwrap();
    
    HttpResponse::Ok().json(json!({
        "allocated_mb": allocated / 1024 / 1024,
        "resident_mb": resident / 1024 / 1024
    }))
}

The culprit: Connection pool not properly closing idle connections.

// BUG: Connections leaked when Redis server restarted
let pool = RedisPool::builder(config)
    .max_size(50)
    .build()
    .unwrap();

// FIX: Add connection recycling and health checks
let pool = RedisPool::builder(config)
    .max_size(50)
    .runtime(Runtime::Tokio1)
    .recycle_timeout(Some(Duration::from_secs(30)))  // KEY: Recycle old connections
    .wait_timeout(Some(Duration::from_secs(5)))
    .create_timeout(Some(Duration::from_secs(5)))
    .build()
    .unwrap();

After fix: Memory stable at 180MB for 30+ days.

Challenge 2: The Tokio Runtime Tuning

Default Tokio runtime settings destroyed our performance under load.

// Default (BAD): Single-threaded runtime
#[tokio::main]
async fn main() {
    // Only uses 1 CPU core!
}

// Fixed: Multi-threaded with tuning
#[tokio::main(flavor = "multi_thread", worker_threads = 4)]
async fn main() {
    // Set thread-local capacity for channels
    std::env::set_var("TOKIO_WORKER_THREADS", "4");
}

// Even better: Custom runtime configuration
fn main() {
    let runtime = tokio::runtime::Builder::new_multi_thread()
        .worker_threads(4)
        .thread_name("validator-worker")
        .enable_all()
        .build()
        .unwrap();
        
    runtime.block_on(async {
        // Application code
    });
}

Impact: 3.2x throughput increase with proper runtime config.

Challenge 3: Error Handling Culture Shift

Go’s if err != nil vs Rust’s Result<T, E> required team mindset change.

Go style:

result, err := doSomething()
if err != nil {
    log.Error(err)  // Often just logged
    return nil      // Silently fails
}

Rust style:

let result = do_something()?;  // Propagates error up
// Or handle explicitly
match do_something() {
    Ok(val) => val,
    Err(e) => {
        tracing::error!("Operation failed: {}", e);
        return Err(e.into());  // Must handle
    }
}

The shift: Rust forces you to handle errors. This found 47 bugs in our Go code we didn’t know existed.

The Results: 6 Months in Production

Performance Improvements

Latency (p99):

  • Before (Go): 340ms average across services
  • After (Rust): 52ms average
  • Improvement: 85% faster

Memory Usage:

  • Before (Go): 850MB average per pod
  • After (Rust): 245MB average per pod
  • Improvement: 71% reduction

Throughput:

  • Before (Go): 2,500 req/sec per pod
  • After (Rust): 8,200 req/sec per pod
  • Improvement: 3.3x increase

Infrastructure Cost Savings

Pod count reduction:

  • Before: 60 pods peak, 15 steady-state
  • After: 12 pods peak, 5 steady-state
  • Reduction: 80% fewer pods

Monthly costs:

  • Before: $62K infrastructure
  • After: $19K infrastructure
  • Savings: $43K/month ($516K/year)

Developer Experience

Pros:

  • Compile-time error checking caught bugs early
  • No more GC-related debugging
  • Performance predictability (no GC surprises)
  • Memory safety prevented entire classes of bugs

Cons:

  • Steeper learning curve (3-4 weeks to proficiency)
  • Longer compile times (2-3 min vs 30 sec for Go)
  • Smaller ecosystem (some crates less mature)
  • Hiring harder (fewer Rust developers)

Incident Reduction

Production incidents (6-month comparison):

Incident TypeGo (6 months)Rust (6 months)
Memory leaks121
Race conditions80
Null pointer panics150 (impossible in Rust)
Performance degradation233
Total584

93% reduction in production incidents.

Lessons for Teams Considering Rust

✅ Good Reasons to Use Rust

  1. Performance critical paths - APIs, data processing, real-time systems
  2. Memory-constrained environments - Edge devices, embedded systems
  3. Long-running services - Where GC pauses matter
  4. Safety-critical systems - Financial, healthcare, infrastructure
  5. High-scale systems - Where performance improvements = cost savings

❌ Bad Reasons to Use Rust

  1. “It’s trendy” - Not a reason. Learn it properly first.
  2. CRUD APIs - Go/Node/Python are fine for most APIs
  3. Rapid prototyping - Rust’s compile-time checks slow iteration
  4. Team has no Rust experience - Budget 2-3 months for learning
  5. Small applications - Rust’s benefits don’t materialize at small scale

Migration Advice

If you’re migrating to Rust:

  1. Start small: Pick ONE service, preferably hot-path with clear metrics
  2. Measure everything: Establish baseline before migration
  3. Learn ownership model: Don’t fight the borrow checker
  4. Use mature crates: Tokio, Axum, SQLx, Serde are battle-tested
  5. Profile early: Use cargo-flamegraph and perf from day one
  6. Plan for training: Budget 3-4 weeks per engineer for proficiency

Red flags to abort migration:

  • Team resistance (forcing Rust on unwilling team = disaster)
  • No clear performance problem to solve
  • Lack of time for proper learning
  • Can’t articulate ROI beyond “Rust is cool”

What’s Next?

We’re now exploring:

  1. WebAssembly compilation - Rust → WASM for edge deployment
  2. Async runtime optimization - Custom Tokio executors
  3. Zero-copy deserialization - Using rkyv for ultra-fast parsing
  4. GPU acceleration - CUDA bindings for ML inference

Rust transformed our infrastructure economics. The migration was harder than expected, but $516K annual savings + 93% fewer incidents = worth it.

For more on systems programming and performance optimization, see the comprehensive Rust guide that influenced our architecture decisions.


Considering Rust for your infrastructure? Connect on LinkedIn or share your migration stories on Twitter.