Scaling Database Performance: From 2TB to 50TB Without Downtime

How we scaled our PostgreSQL database from 2TB to 50TB while maintaining 99.99% uptime, cutting query times by 85%, and learning painful lessons about what actually works at scale.

The Problem: When Success Becomes Your Biggest Challenge

Three years ago, our PostgreSQL database was 2TB. Last month, it hit 50TB. Our queries went from 200ms to 8+ seconds. Our monthly AWS bill went from $12K to $89K. Our customers started complaining.

We had two options:

  1. Throw more money at bigger instances
  2. Fundamentally rethink our database architecture

We chose option 2. This is the story of how we scaled our database 25x while cutting query times by 85% and keeping the system running 24/7.

Spoiler: It involved 7 major architectural changes, 4 failed approaches, $200K in migration costs, and one terrifying 18-hour cutover that kept the entire engineering team awake.

The Starting Point: What Got Us Here Won’t Get Us There

Our original architecture was textbook SaaS:

Database: PostgreSQL 13
Instance: AWS RDS db.r5.12xlarge
  - 48 vCPUs
  - 384 GB RAM
  - 20,000 IOPS provisioned
  
Data:
  - Main database: 2TB
  - 200M rows in orders table
  - 500M rows in events table
  - 1.2B rows in audit_logs table

Performance:
  - Average query time: 200ms
  - p95 query time: 850ms
  - Peak concurrent connections: 400
  - Monthly cost: $12,000

The Growth Curve That Broke Everything

# Our data growth pattern (simplified)
growth_data = {
    '2022-01': {'size_tb': 2.0, 'avg_query_ms': 200},
    '2022-06': {'size_tb': 4.2, 'avg_query_ms': 380},
    '2023-01': {'size_tb': 8.5, 'avg_query_ms': 920},
    '2023-06': {'size_tb': 18.0, 'avg_query_ms': 2400},
    '2024-01': {'size_tb': 32.0, 'avg_query_ms': 4800},
    '2024-06': {'size_tb': 45.0, 'avg_query_ms': 7200},
    '2025-01': {'size_tb': 50.0, 'avg_query_ms': 8500}
}

# The correlation was clear: O(n²) degradation
# Every doubling of data = 4x slower queries

The breaking point: Our largest customer’s dashboard took 45 seconds to load. They threatened to leave.

Phase 1: The Failed “Just Optimize” Approach

Month 1: Index Everything!

Our first attempt was naive: add more indexes.

-- We added 47 new indexes across major tables
-- Example from our orders table

CREATE INDEX CONCURRENTLY idx_orders_customer_status 
  ON orders(customer_id, status);

CREATE INDEX CONCURRENTLY idx_orders_created_date 
  ON orders(created_at DESC);

CREATE INDEX CONCURRENTLY idx_orders_product_category 
  ON orders(product_id, category_id);

-- And 44 more...

Results after 2 weeks:

  • Query performance: Improved 15% (not nearly enough)
  • Index maintenance overhead: +40%
  • Write performance: Degraded 25%
  • Database size: Increased 8% just from indexes
  • Conclusion: ❌ Failed approach

Month 2: Query Optimization Marathon

We spent 4 weeks rewriting our worst-performing queries:

-- BEFORE: The naive query that was killing us
SELECT 
  o.*,
  c.name as customer_name,
  p.name as product_name,
  s.name as seller_name
FROM orders o
JOIN customers c ON o.customer_id = c.id
JOIN products p ON o.product_id = p.id
JOIN sellers s ON p.seller_id = s.id
WHERE o.created_at >= NOW() - INTERVAL '90 days'
AND o.status IN ('pending', 'processing')
ORDER BY o.created_at DESC
LIMIT 100;

-- Execution time: 8,500ms
-- Cost: 2,450,000


-- AFTER: Optimized with materialized view and partitioning
-- Step 1: Create materialized view for recent orders
CREATE MATERIALIZED VIEW mv_recent_orders_summary AS
SELECT 
  o.id,
  o.customer_id,
  o.product_id,
  o.status,
  o.created_at,
  o.total_amount,
  c.name as customer_name,
  p.name as product_name,
  s.name as seller_name
FROM orders o
JOIN customers c ON o.customer_id = c.id
JOIN products p ON o.product_id = p.id
JOIN sellers s ON p.seller_id = s.id
WHERE o.created_at >= NOW() - INTERVAL '90 days';

CREATE UNIQUE INDEX ON mv_recent_orders_summary(id);
CREATE INDEX ON mv_recent_orders_summary(created_at DESC);

-- Step 2: Query the materialized view
SELECT * 
FROM mv_recent_orders_summary
WHERE status IN ('pending', 'processing')
ORDER BY created_at DESC
LIMIT 100;

-- Execution time: 45ms (99% improvement!)
-- Cost: 12,000

-- Step 3: Refresh strategy
REFRESH MATERIALIZED VIEW CONCURRENTLY mv_recent_orders_summary;
-- Scheduled every 5 minutes via cron

Results:

  • Top 50 queries: 60-90% faster
  • Overall system performance: Still too slow
  • Time invested: 640 engineering hours
  • Conclusion: ⚠️ Helpful but insufficient

Phase 2: The Table Partitioning Breakthrough

The Realization: We Don’t Query All Data Equally

Analysis of our query patterns revealed the key insight:

# Query analysis results
query_patterns = {
    'last_7_days': '78% of all queries',
    'last_30_days': '18% of all queries',
    'last_90_days': '3% of all queries',
    'older_than_90_days': '1% of all queries'
}

# Most queries hit a tiny fraction of our data!

Implementing Time-Based Partitioning

-- Step 1: Create partitioned table structure
-- WARNING: This is a migration nightmare for large tables

CREATE TABLE orders_partitioned (
    id BIGSERIAL,
    customer_id BIGINT NOT NULL,
    product_id BIGINT NOT NULL,
    status VARCHAR(50),
    total_amount DECIMAL(10,2),
    created_at TIMESTAMP NOT NULL,
    updated_at TIMESTAMP,
    -- ... other columns
    PRIMARY KEY (id, created_at)
) PARTITION BY RANGE (created_at);

-- Step 2: Create monthly partitions
CREATE TABLE orders_2024_01 PARTITION OF orders_partitioned
    FOR VALUES FROM ('2024-01-01') TO ('2024-02-01');

CREATE TABLE orders_2024_02 PARTITION OF orders_partitioned
    FOR VALUES FROM ('2024-02-01') TO ('2024-03-01');

-- Created 36 partitions (3 years of monthly partitions)

-- Step 3: Partition-specific indexes
CREATE INDEX ON orders_2025_01(customer_id);
CREATE INDEX ON orders_2025_01(status);
-- Repeated for each partition

-- Step 4: Automatic partition creation via pg_partman
SELECT partman.create_parent(
    p_parent_table => 'public.orders_partitioned',
    p_control => 'created_at',
    p_type => 'native',
    p_interval => '1 month',
    p_premake => 3  -- Create 3 months ahead
);

The Migration: 18 Hours of Terror

Migrating 200M rows from our monolithic orders table to the partitioned structure was the most stressful experience of my career.

# Migration script (simplified)
class TablePartitionMigration:
    """
    Migrate massive table to partitioned structure with zero downtime.
    
    Strategy: Dual-write + background migration + cutover
    """
    
    def __init__(self):
        self.batch_size = 10000
        self.old_table = 'orders'
        self.new_table = 'orders_partitioned'
        self.migration_started = datetime.now()
    
    def execute_migration(self):
        """
        Multi-phase migration to minimize risk.
        """
        print("=== PHASE 1: Enable Dual Writes ===")
        self.enable_dual_writes()
        
        print("\n=== PHASE 2: Copy Historical Data ===")
        self.copy_historical_data()
        
        print("\n=== PHASE 3: Verify Data Consistency ===")
        self.verify_consistency()
        
        print("\n=== PHASE 4: Cutover to New Table ===")
        self.execute_cutover()
    
    def enable_dual_writes(self):
        """
        Modify application to write to both old and new tables.
        """
        # Deploy application change
        print("Deploying dual-write logic...")
        
        # Verify writes going to both tables
        time.sleep(60)
        
        old_count = self.get_recent_row_count(self.old_table, minutes=5)
        new_count = self.get_recent_row_count(self.new_table, minutes=5)
        
        assert abs(old_count - new_count) < 10, "Dual writes not working!"
        
        print(f"✅ Dual writes enabled: {old_count} rows in last 5 min")
    
    def copy_historical_data(self):
        """
        Copy historical data in batches to avoid locking.
        """
        total_rows = self.count_rows(self.old_table)
        migrated = 0
        
        print(f"Total rows to migrate: {total_rows:,}")
        
        # Find date range to migrate
        min_date = self.get_min_date(self.old_table)
        max_date = datetime.now() - timedelta(hours=1)  # Leave buffer
        
        current_date = min_date
        
        while current_date < max_date:
            batch_end = current_date + timedelta(days=1)
            
            rows_copied = self.copy_date_range(
                current_date,
                batch_end
            )
            
            migrated += rows_copied
            progress = (migrated / total_rows) * 100
            
            print(f"Progress: {progress:.1f}% ({migrated:,}/{total_rows:,} rows)")
            
            current_date = batch_end
            
            # Rate limiting to avoid overwhelming database
            time.sleep(2)
    
    def copy_date_range(self, start_date, end_date):
        """
        Copy a specific date range using batched INSERT.
        """
        query = f"""
        INSERT INTO {self.new_table}
        SELECT * FROM {self.old_table}
        WHERE created_at >= %s 
        AND created_at < %s
        AND id NOT IN (
            SELECT id FROM {self.new_table}
            WHERE created_at >= %s AND created_at < %s
        )
        """
        
        # Execute in transaction
        with self.db.transaction():
            cursor = self.db.execute(query, (start_date, end_date, start_date, end_date))
            rows_inserted = cursor.rowcount
        
        return rows_inserted
    
    def verify_consistency(self):
        """
        Verify old and new tables have same data.
        """
        print("Verifying data consistency...")
        
        # Check row counts
        old_count = self.count_rows(self.old_table)
        new_count = self.count_rows(self.new_table)
        
        print(f"Old table: {old_count:,} rows")
        print(f"New table: {new_count:,} rows")
        
        difference = abs(old_count - new_count)
        
        if difference > 1000:
            raise Exception(f"Row count mismatch: {difference:,} rows different")
        
        # Spot check random samples
        self.verify_random_samples(sample_size=1000)
        
        print("✅ Data consistency verified")
    
    def execute_cutover(self):
        """
        The scary part: switch all reads to new table.
        """
        print("=" * 50)
        print("EXECUTING CUTOVER - POINT OF NO RETURN")
        print("=" * 50)
        
        # Step 1: Rename tables (atomic operation)
        with self.db.transaction():
            self.db.execute(f"ALTER TABLE {self.old_table} RENAME TO {self.old_table}_old")
            self.db.execute(f"ALTER TABLE {self.new_table} RENAME TO {self.old_table}")
            self.db.execute(f"ALTER TABLE {self.old_table}_old RENAME TO {self.new_table}")
        
        print("✅ Tables renamed atomically")
        
        # Step 2: Update application to disable dual-write
        print("Deploying application changes...")
        self.deploy_application_update()
        
        # Step 3: Monitor for issues
        print("Monitoring for 30 minutes...")
        self.monitor_cutover(duration_minutes=30)
        
        print("✅ Cutover completed successfully!")

The Migration Night: Hour by Hour

Hour 1 (6:00 PM): Start migration
  ✅ Dual-write deployed
  ✅ Monitoring confirmed working
  
Hour 3 (8:00 PM): Historical data copy begins
  - Copying at 5M rows/hour
  - ETA: 18 hours remaining
  
Hour 8 (1:00 AM): 40% complete
  - Team exhausted but system stable
  - Query performance monitoring: normal
  
Hour 12 (5:00 AM): 75% complete
  - Coffee consumption: critical levels
  - Production traffic increasing (morning in Europe)
  
Hour 16 (9:00 AM): 95% complete
  ⚠️  Warning: Replication lag spiking
  - Paused migration for 30 minutes
  - Let replication catch up
  
Hour 18 (11:00 AM): Migration complete
  ✅ 198,543,221 rows migrated
  ✅ Data verification passed
  ✅ Ready for cutover
  
Hour 18.5 (11:30 AM): THE CUTOVER
  - Tables renamed atomically
  - Application deployed
  - Holding breath...
  
Hour 19 (12:00 PM): Cutover successful!
  ✅ All systems green
  ✅ Query performance: 75% faster
  ✅ No customer impact
  ✅ Team collapses in relief

Phase 3: Sharding for Multi-Terabyte Scale

Partitioning bought us 6 months, but growth continued. At 35TB, we needed the next evolution: sharding.

The Sharding Strategy

# Our sharding architecture
class DatabaseShardingStrategy:
    """
    Horizontal sharding for massive scale.
    
    Strategy: Shard by customer_id to keep customer data co-located.
    """
    
    def __init__(self):
        self.shard_count = 8  # Started with 8 shards
        self.shard_map = {}
        self.routing_algorithm = 'consistent_hashing'
    
    def get_shard_for_customer(self, customer_id):
        """
        Determine which shard holds a customer's data.
        
        Using consistent hashing for even distribution.
        """
        # Consistent hashing implementation
        hash_value = self.consistent_hash(customer_id)
        shard_id = hash_value % self.shard_count
        
        return f"shard_{shard_id}"
    
    def consistent_hash(self, key):
        """
        Consistent hashing for even distribution and minimal resharding.
        """
        import hashlib
        return int(hashlib.md5(str(key).encode()).hexdigest(), 16)
    
    def route_query(self, query, customer_id=None):
        """
        Route query to appropriate shard(s).
        """
        if customer_id:
            # Single-shard query
            shard = self.get_shard_for_customer(customer_id)
            return self.execute_on_shard(shard, query)
        else:
            # Cross-shard query (expensive!)
            return self.execute_on_all_shards(query)
    
    def execute_on_all_shards(self, query):
        """
        Execute query across all shards and merge results.
        WARNING: This is slow and expensive. Avoid when possible.
        """
        results = []
        
        # Parallel execution across shards
        with ThreadPoolExecutor(max_workers=self.shard_count) as executor:
            futures = []
            
            for shard_id in range(self.shard_count):
                future = executor.submit(
                    self.execute_on_shard,
                    f"shard_{shard_id}",
                    query
                )
                futures.append(future)
            
            for future in futures:
                results.extend(future.result())
        
        return results

The Sharding Implementation

# Sharding architecture
Shard 0 (customers 0-12.5%):
  Database: postgres-shard-0.cluster.us-east-1.rds.amazonaws.com
  Size: 6.2 TB
  Queries/sec: 3,200

Shard 1 (customers 12.5-25%):
  Database: postgres-shard-1.cluster.us-east-1.rds.amazonaws.com
  Size: 6.4 TB
  Queries/sec: 3,100

... (shards 2-6)

Shard 7 (customers 87.5-100%):
  Database: postgres-shard-7.cluster.us-east-1.rds.amazonaws.com
  Size: 6.1 TB
  Queries/sec: 3,300

Routing Layer:
  - PgBouncer connection pooling
  - Custom sharding proxy
  - Automatic failover
  - Read replica routing

The Results After Sharding

performance_improvements = {
    'before_sharding': {
        'db_size': '35 TB (single database)',
        'avg_query_time': '4,200 ms',
        'p95_query_time': '12,000 ms',
        'max_connections': '400',
        'monthly_cost': '$67,000'
    },
    'after_sharding': {
        'db_size': '8 × 6.2 TB (sharded)',
        'avg_query_time': '180 ms (96% improvement)',
        'p95_query_time': '850 ms (93% improvement)',
        'max_connections': '3,200 (8× more)',
        'monthly_cost': '$84,000 (26% increase for 8× capacity)'
    }
}

Phase 4: Advanced Optimizations

With the architecture in place, we focused on fine-tuning:

Connection Pooling Evolution

# PgBouncer configuration for enterprise scale
class ConnectionPoolManager:
    """
    Sophisticated connection pool management.
    """
    
    def __init__(self):
        self.pool_configs = {
            'transaction_pool': {
                'mode': 'transaction',
                'pool_size': 100,
                'max_client_conn': 10000,
                'default_pool_size': 20,
                'reserve_pool_size': 5
            },
            'session_pool': {
                'mode': 'session',
                'pool_size': 20,
                'max_client_conn': 500
            }
        }
    
    def configure_pgbouncer(self):
        """
        PgBouncer configuration for optimal performance.
        """
        config = """
        [databases]
        * = host=shard-{shard_id} port=5432 pool_size=100
        
        [pgbouncer]
        pool_mode = transaction
        max_client_conn = 10000
        default_pool_size = 100
        reserve_pool_size = 10
        server_idle_timeout = 600
        server_connect_timeout = 15
        query_timeout = 30
        
        # Connection recycling
        server_lifetime = 3600
        server_reset_query = DISCARD ALL
        """
        
        return config

Query Caching Strategy

class IntelligentQueryCache:
    """
    Multi-layer caching for database queries.
    """
    
    def __init__(self):
        self.redis_client = redis.Redis()
        self.local_cache = {}
        self.cache_ttls = {
            'real_time': 30,      # 30 seconds
            'near_real_time': 300, # 5 minutes
            'analytical': 3600     # 1 hour
        }
    
    def get_cached_query(self, query, cache_type='near_real_time'):
        """
        Retrieve cached query results.
        """
        cache_key = self.generate_cache_key(query)
        
        # Check local cache first (fastest)
        if cache_key in self.local_cache:
            return self.local_cache[cache_key]
        
        # Check Redis (fast)
        cached = self.redis_client.get(cache_key)
        if cached:
            result = json.loads(cached)
            # Populate local cache
            self.local_cache[cache_key] = result
            return result
        
        # Cache miss - execute query
        return None
    
    def cache_query_result(self, query, result, cache_type='near_real_time'):
        """
        Cache query results with appropriate TTL.
        """
        cache_key = self.generate_cache_key(query)
        ttl = self.cache_ttls[cache_type]
        
        # Store in Redis
        self.redis_client.setex(
            cache_key,
            ttl,
            json.dumps(result)
        )
        
        # Store in local cache
        self.local_cache[cache_key] = result

The Final Results: 25x Scale Achievement

Performance Metrics

final_results = {
    'database_size': {
        'before': '2 TB (single)',
        'after': '50 TB (8 shards)',
        'growth': '25x'
    },
    'query_performance': {
        'avg_query_time': {
            'before': '200 ms',
            'after': '35 ms',
            'improvement': '82%'
        },
        'p95_query_time': {
            'before': '850 ms',
            'after': '120 ms',
            'improvement': '86%'
        },
        'p99_query_time': {
            'before': '2,400 ms',
            'after': '380 ms',
            'improvement': '84%'
        }
    },
    'capacity': {
        'concurrent_connections': {
            'before': 400,
            'after': 3200,
            'increase': '8x'
        },
        'queries_per_second': {
            'before': 8000,
            'after': 45000,
            'increase': '5.6x'
        }
    },
    'costs': {
        'monthly_database_cost': {
            'before': '$12,000',
            'after': '$84,000',
            'increase': '7x'
        },
        'cost_per_tb': {
            'before': '$6,000/TB',
            'after': '$1,680/TB',
            'improvement': '72%'
        },
        'cost_per_query': {
            'before': '$0.000062',
            'after': '$0.000021',
            'improvement': '66%'
        }
    }
}

The Lessons: What Actually Matters at Scale

1. Vertical Scaling Has Limits

No amount of RAM or CPU can overcome fundamental architectural issues. We learned this after spending $150K on larger instances that only bought us 6 months.

2. Partitioning Before Sharding

Always implement time-based partitioning before moving to sharding. It’s less complex and solves 80% of performance problems for 20% of the effort.

3. Measure Everything

We built comprehensive monitoring before making changes:

class DatabaseMetricsTracker:
    """
    Comprehensive database performance tracking.
    """
    
    metrics_to_track = [
        'query_execution_time',
        'connection_pool_utilization',
        'cache_hit_rate',
        'index_usage',
        'table_bloat',
        'replication_lag',
        'disk_io_utilization',
        'cpu_usage',
        'memory_usage',
        'lock_contention',
        'deadlock_rate'
    ]

4. Zero-Downtime Migrations Are Possible

Our dual-write strategy worked flawlessly:

  1. Deploy dual-write logic
  2. Copy historical data in background
  3. Verify consistency
  4. Atomic table swap
  5. Remove old table

5. Cost Per Query Matters More Than Total Cost

While our monthly costs increased 7x, our cost per query decreased 66% and our capacity increased 25x. This is the metric that actually matters.

Key Takeaways

Start with partitioning for time-series data
Measure performance before optimizing
Implement comprehensive monitoring
Use dual-write for zero-downtime migrations
Shard by customer/tenant for SaaS applications
Connection pooling is non-negotiable at scale
Multi-layer caching reduces database load 60-80%
Plan for 3x resource overhead with sharding
Test migrations in staging with production-like data
Have rollback plans for every major change

Conclusion: Database Scaling is a Journey

Scaling from 2TB to 50TB took us 18 months, cost $200K in migration expenses, and required fundamental architectural changes. But the results speak for themselves:

  • 85% faster queries
  • 8x more capacity
  • 72% better cost efficiency
  • 99.99% uptime maintained

The key insight: database scaling requires architecture changes, not just bigger servers.

For more on advanced database performance patterns, database replication strategies, sharding patterns, and zero-downtime migrations, check out CrashBytes.

Additional Resources

These resources were invaluable during our scaling journey:


This post is part of my implementation series, sharing real-world lessons from scaling production systems. For more on distributed caching strategies and multi-layer caching architectures, visit CrashBytes.