The Problem: When Success Becomes Your Biggest Challenge
Three years ago, our PostgreSQL database was 2TB. Last month, it hit 50TB. Our queries went from 200ms to 8+ seconds. Our monthly AWS bill went from $12K to $89K. Our customers started complaining.
We had two options:
- Throw more money at bigger instances
- Fundamentally rethink our database architecture
We chose option 2. This is the story of how we scaled our database 25x while cutting query times by 85% and keeping the system running 24/7.
Spoiler: It involved 7 major architectural changes, 4 failed approaches, $200K in migration costs, and one terrifying 18-hour cutover that kept the entire engineering team awake.
The Starting Point: What Got Us Here Won’t Get Us There
Our original architecture was textbook SaaS:
Database: PostgreSQL 13
Instance: AWS RDS db.r5.12xlarge
- 48 vCPUs
- 384 GB RAM
- 20,000 IOPS provisioned
Data:
- Main database: 2TB
- 200M rows in orders table
- 500M rows in events table
- 1.2B rows in audit_logs table
Performance:
- Average query time: 200ms
- p95 query time: 850ms
- Peak concurrent connections: 400
- Monthly cost: $12,000
The Growth Curve That Broke Everything
# Our data growth pattern (simplified)
growth_data = {
'2022-01': {'size_tb': 2.0, 'avg_query_ms': 200},
'2022-06': {'size_tb': 4.2, 'avg_query_ms': 380},
'2023-01': {'size_tb': 8.5, 'avg_query_ms': 920},
'2023-06': {'size_tb': 18.0, 'avg_query_ms': 2400},
'2024-01': {'size_tb': 32.0, 'avg_query_ms': 4800},
'2024-06': {'size_tb': 45.0, 'avg_query_ms': 7200},
'2025-01': {'size_tb': 50.0, 'avg_query_ms': 8500}
}
# The correlation was clear: O(n²) degradation
# Every doubling of data = 4x slower queries
The breaking point: Our largest customer’s dashboard took 45 seconds to load. They threatened to leave.
Phase 1: The Failed “Just Optimize” Approach
Month 1: Index Everything!
Our first attempt was naive: add more indexes.
-- We added 47 new indexes across major tables
-- Example from our orders table
CREATE INDEX CONCURRENTLY idx_orders_customer_status
ON orders(customer_id, status);
CREATE INDEX CONCURRENTLY idx_orders_created_date
ON orders(created_at DESC);
CREATE INDEX CONCURRENTLY idx_orders_product_category
ON orders(product_id, category_id);
-- And 44 more...
Results after 2 weeks:
- Query performance: Improved 15% (not nearly enough)
- Index maintenance overhead: +40%
- Write performance: Degraded 25%
- Database size: Increased 8% just from indexes
- Conclusion: ❌ Failed approach
Month 2: Query Optimization Marathon
We spent 4 weeks rewriting our worst-performing queries:
-- BEFORE: The naive query that was killing us
SELECT
o.*,
c.name as customer_name,
p.name as product_name,
s.name as seller_name
FROM orders o
JOIN customers c ON o.customer_id = c.id
JOIN products p ON o.product_id = p.id
JOIN sellers s ON p.seller_id = s.id
WHERE o.created_at >= NOW() - INTERVAL '90 days'
AND o.status IN ('pending', 'processing')
ORDER BY o.created_at DESC
LIMIT 100;
-- Execution time: 8,500ms
-- Cost: 2,450,000
-- AFTER: Optimized with materialized view and partitioning
-- Step 1: Create materialized view for recent orders
CREATE MATERIALIZED VIEW mv_recent_orders_summary AS
SELECT
o.id,
o.customer_id,
o.product_id,
o.status,
o.created_at,
o.total_amount,
c.name as customer_name,
p.name as product_name,
s.name as seller_name
FROM orders o
JOIN customers c ON o.customer_id = c.id
JOIN products p ON o.product_id = p.id
JOIN sellers s ON p.seller_id = s.id
WHERE o.created_at >= NOW() - INTERVAL '90 days';
CREATE UNIQUE INDEX ON mv_recent_orders_summary(id);
CREATE INDEX ON mv_recent_orders_summary(created_at DESC);
-- Step 2: Query the materialized view
SELECT *
FROM mv_recent_orders_summary
WHERE status IN ('pending', 'processing')
ORDER BY created_at DESC
LIMIT 100;
-- Execution time: 45ms (99% improvement!)
-- Cost: 12,000
-- Step 3: Refresh strategy
REFRESH MATERIALIZED VIEW CONCURRENTLY mv_recent_orders_summary;
-- Scheduled every 5 minutes via cron
Results:
- Top 50 queries: 60-90% faster
- Overall system performance: Still too slow
- Time invested: 640 engineering hours
- Conclusion: ⚠️ Helpful but insufficient
Phase 2: The Table Partitioning Breakthrough
The Realization: We Don’t Query All Data Equally
Analysis of our query patterns revealed the key insight:
# Query analysis results
query_patterns = {
'last_7_days': '78% of all queries',
'last_30_days': '18% of all queries',
'last_90_days': '3% of all queries',
'older_than_90_days': '1% of all queries'
}
# Most queries hit a tiny fraction of our data!
Implementing Time-Based Partitioning
-- Step 1: Create partitioned table structure
-- WARNING: This is a migration nightmare for large tables
CREATE TABLE orders_partitioned (
id BIGSERIAL,
customer_id BIGINT NOT NULL,
product_id BIGINT NOT NULL,
status VARCHAR(50),
total_amount DECIMAL(10,2),
created_at TIMESTAMP NOT NULL,
updated_at TIMESTAMP,
-- ... other columns
PRIMARY KEY (id, created_at)
) PARTITION BY RANGE (created_at);
-- Step 2: Create monthly partitions
CREATE TABLE orders_2024_01 PARTITION OF orders_partitioned
FOR VALUES FROM ('2024-01-01') TO ('2024-02-01');
CREATE TABLE orders_2024_02 PARTITION OF orders_partitioned
FOR VALUES FROM ('2024-02-01') TO ('2024-03-01');
-- Created 36 partitions (3 years of monthly partitions)
-- Step 3: Partition-specific indexes
CREATE INDEX ON orders_2025_01(customer_id);
CREATE INDEX ON orders_2025_01(status);
-- Repeated for each partition
-- Step 4: Automatic partition creation via pg_partman
SELECT partman.create_parent(
p_parent_table => 'public.orders_partitioned',
p_control => 'created_at',
p_type => 'native',
p_interval => '1 month',
p_premake => 3 -- Create 3 months ahead
);
The Migration: 18 Hours of Terror
Migrating 200M rows from our monolithic orders
table to the partitioned structure was the most stressful experience of my career.
# Migration script (simplified)
class TablePartitionMigration:
"""
Migrate massive table to partitioned structure with zero downtime.
Strategy: Dual-write + background migration + cutover
"""
def __init__(self):
self.batch_size = 10000
self.old_table = 'orders'
self.new_table = 'orders_partitioned'
self.migration_started = datetime.now()
def execute_migration(self):
"""
Multi-phase migration to minimize risk.
"""
print("=== PHASE 1: Enable Dual Writes ===")
self.enable_dual_writes()
print("\n=== PHASE 2: Copy Historical Data ===")
self.copy_historical_data()
print("\n=== PHASE 3: Verify Data Consistency ===")
self.verify_consistency()
print("\n=== PHASE 4: Cutover to New Table ===")
self.execute_cutover()
def enable_dual_writes(self):
"""
Modify application to write to both old and new tables.
"""
# Deploy application change
print("Deploying dual-write logic...")
# Verify writes going to both tables
time.sleep(60)
old_count = self.get_recent_row_count(self.old_table, minutes=5)
new_count = self.get_recent_row_count(self.new_table, minutes=5)
assert abs(old_count - new_count) < 10, "Dual writes not working!"
print(f"✅ Dual writes enabled: {old_count} rows in last 5 min")
def copy_historical_data(self):
"""
Copy historical data in batches to avoid locking.
"""
total_rows = self.count_rows(self.old_table)
migrated = 0
print(f"Total rows to migrate: {total_rows:,}")
# Find date range to migrate
min_date = self.get_min_date(self.old_table)
max_date = datetime.now() - timedelta(hours=1) # Leave buffer
current_date = min_date
while current_date < max_date:
batch_end = current_date + timedelta(days=1)
rows_copied = self.copy_date_range(
current_date,
batch_end
)
migrated += rows_copied
progress = (migrated / total_rows) * 100
print(f"Progress: {progress:.1f}% ({migrated:,}/{total_rows:,} rows)")
current_date = batch_end
# Rate limiting to avoid overwhelming database
time.sleep(2)
def copy_date_range(self, start_date, end_date):
"""
Copy a specific date range using batched INSERT.
"""
query = f"""
INSERT INTO {self.new_table}
SELECT * FROM {self.old_table}
WHERE created_at >= %s
AND created_at < %s
AND id NOT IN (
SELECT id FROM {self.new_table}
WHERE created_at >= %s AND created_at < %s
)
"""
# Execute in transaction
with self.db.transaction():
cursor = self.db.execute(query, (start_date, end_date, start_date, end_date))
rows_inserted = cursor.rowcount
return rows_inserted
def verify_consistency(self):
"""
Verify old and new tables have same data.
"""
print("Verifying data consistency...")
# Check row counts
old_count = self.count_rows(self.old_table)
new_count = self.count_rows(self.new_table)
print(f"Old table: {old_count:,} rows")
print(f"New table: {new_count:,} rows")
difference = abs(old_count - new_count)
if difference > 1000:
raise Exception(f"Row count mismatch: {difference:,} rows different")
# Spot check random samples
self.verify_random_samples(sample_size=1000)
print("✅ Data consistency verified")
def execute_cutover(self):
"""
The scary part: switch all reads to new table.
"""
print("=" * 50)
print("EXECUTING CUTOVER - POINT OF NO RETURN")
print("=" * 50)
# Step 1: Rename tables (atomic operation)
with self.db.transaction():
self.db.execute(f"ALTER TABLE {self.old_table} RENAME TO {self.old_table}_old")
self.db.execute(f"ALTER TABLE {self.new_table} RENAME TO {self.old_table}")
self.db.execute(f"ALTER TABLE {self.old_table}_old RENAME TO {self.new_table}")
print("✅ Tables renamed atomically")
# Step 2: Update application to disable dual-write
print("Deploying application changes...")
self.deploy_application_update()
# Step 3: Monitor for issues
print("Monitoring for 30 minutes...")
self.monitor_cutover(duration_minutes=30)
print("✅ Cutover completed successfully!")
The Migration Night: Hour by Hour
Hour 1 (6:00 PM): Start migration
✅ Dual-write deployed
✅ Monitoring confirmed working
Hour 3 (8:00 PM): Historical data copy begins
- Copying at 5M rows/hour
- ETA: 18 hours remaining
Hour 8 (1:00 AM): 40% complete
- Team exhausted but system stable
- Query performance monitoring: normal
Hour 12 (5:00 AM): 75% complete
- Coffee consumption: critical levels
- Production traffic increasing (morning in Europe)
Hour 16 (9:00 AM): 95% complete
⚠️ Warning: Replication lag spiking
- Paused migration for 30 minutes
- Let replication catch up
Hour 18 (11:00 AM): Migration complete
✅ 198,543,221 rows migrated
✅ Data verification passed
✅ Ready for cutover
Hour 18.5 (11:30 AM): THE CUTOVER
- Tables renamed atomically
- Application deployed
- Holding breath...
Hour 19 (12:00 PM): Cutover successful!
✅ All systems green
✅ Query performance: 75% faster
✅ No customer impact
✅ Team collapses in relief
Phase 3: Sharding for Multi-Terabyte Scale
Partitioning bought us 6 months, but growth continued. At 35TB, we needed the next evolution: sharding.
The Sharding Strategy
# Our sharding architecture
class DatabaseShardingStrategy:
"""
Horizontal sharding for massive scale.
Strategy: Shard by customer_id to keep customer data co-located.
"""
def __init__(self):
self.shard_count = 8 # Started with 8 shards
self.shard_map = {}
self.routing_algorithm = 'consistent_hashing'
def get_shard_for_customer(self, customer_id):
"""
Determine which shard holds a customer's data.
Using consistent hashing for even distribution.
"""
# Consistent hashing implementation
hash_value = self.consistent_hash(customer_id)
shard_id = hash_value % self.shard_count
return f"shard_{shard_id}"
def consistent_hash(self, key):
"""
Consistent hashing for even distribution and minimal resharding.
"""
import hashlib
return int(hashlib.md5(str(key).encode()).hexdigest(), 16)
def route_query(self, query, customer_id=None):
"""
Route query to appropriate shard(s).
"""
if customer_id:
# Single-shard query
shard = self.get_shard_for_customer(customer_id)
return self.execute_on_shard(shard, query)
else:
# Cross-shard query (expensive!)
return self.execute_on_all_shards(query)
def execute_on_all_shards(self, query):
"""
Execute query across all shards and merge results.
WARNING: This is slow and expensive. Avoid when possible.
"""
results = []
# Parallel execution across shards
with ThreadPoolExecutor(max_workers=self.shard_count) as executor:
futures = []
for shard_id in range(self.shard_count):
future = executor.submit(
self.execute_on_shard,
f"shard_{shard_id}",
query
)
futures.append(future)
for future in futures:
results.extend(future.result())
return results
The Sharding Implementation
# Sharding architecture
Shard 0 (customers 0-12.5%):
Database: postgres-shard-0.cluster.us-east-1.rds.amazonaws.com
Size: 6.2 TB
Queries/sec: 3,200
Shard 1 (customers 12.5-25%):
Database: postgres-shard-1.cluster.us-east-1.rds.amazonaws.com
Size: 6.4 TB
Queries/sec: 3,100
... (shards 2-6)
Shard 7 (customers 87.5-100%):
Database: postgres-shard-7.cluster.us-east-1.rds.amazonaws.com
Size: 6.1 TB
Queries/sec: 3,300
Routing Layer:
- PgBouncer connection pooling
- Custom sharding proxy
- Automatic failover
- Read replica routing
The Results After Sharding
performance_improvements = {
'before_sharding': {
'db_size': '35 TB (single database)',
'avg_query_time': '4,200 ms',
'p95_query_time': '12,000 ms',
'max_connections': '400',
'monthly_cost': '$67,000'
},
'after_sharding': {
'db_size': '8 × 6.2 TB (sharded)',
'avg_query_time': '180 ms (96% improvement)',
'p95_query_time': '850 ms (93% improvement)',
'max_connections': '3,200 (8× more)',
'monthly_cost': '$84,000 (26% increase for 8× capacity)'
}
}
Phase 4: Advanced Optimizations
With the architecture in place, we focused on fine-tuning:
Connection Pooling Evolution
# PgBouncer configuration for enterprise scale
class ConnectionPoolManager:
"""
Sophisticated connection pool management.
"""
def __init__(self):
self.pool_configs = {
'transaction_pool': {
'mode': 'transaction',
'pool_size': 100,
'max_client_conn': 10000,
'default_pool_size': 20,
'reserve_pool_size': 5
},
'session_pool': {
'mode': 'session',
'pool_size': 20,
'max_client_conn': 500
}
}
def configure_pgbouncer(self):
"""
PgBouncer configuration for optimal performance.
"""
config = """
[databases]
* = host=shard-{shard_id} port=5432 pool_size=100
[pgbouncer]
pool_mode = transaction
max_client_conn = 10000
default_pool_size = 100
reserve_pool_size = 10
server_idle_timeout = 600
server_connect_timeout = 15
query_timeout = 30
# Connection recycling
server_lifetime = 3600
server_reset_query = DISCARD ALL
"""
return config
Query Caching Strategy
class IntelligentQueryCache:
"""
Multi-layer caching for database queries.
"""
def __init__(self):
self.redis_client = redis.Redis()
self.local_cache = {}
self.cache_ttls = {
'real_time': 30, # 30 seconds
'near_real_time': 300, # 5 minutes
'analytical': 3600 # 1 hour
}
def get_cached_query(self, query, cache_type='near_real_time'):
"""
Retrieve cached query results.
"""
cache_key = self.generate_cache_key(query)
# Check local cache first (fastest)
if cache_key in self.local_cache:
return self.local_cache[cache_key]
# Check Redis (fast)
cached = self.redis_client.get(cache_key)
if cached:
result = json.loads(cached)
# Populate local cache
self.local_cache[cache_key] = result
return result
# Cache miss - execute query
return None
def cache_query_result(self, query, result, cache_type='near_real_time'):
"""
Cache query results with appropriate TTL.
"""
cache_key = self.generate_cache_key(query)
ttl = self.cache_ttls[cache_type]
# Store in Redis
self.redis_client.setex(
cache_key,
ttl,
json.dumps(result)
)
# Store in local cache
self.local_cache[cache_key] = result
The Final Results: 25x Scale Achievement
Performance Metrics
final_results = {
'database_size': {
'before': '2 TB (single)',
'after': '50 TB (8 shards)',
'growth': '25x'
},
'query_performance': {
'avg_query_time': {
'before': '200 ms',
'after': '35 ms',
'improvement': '82%'
},
'p95_query_time': {
'before': '850 ms',
'after': '120 ms',
'improvement': '86%'
},
'p99_query_time': {
'before': '2,400 ms',
'after': '380 ms',
'improvement': '84%'
}
},
'capacity': {
'concurrent_connections': {
'before': 400,
'after': 3200,
'increase': '8x'
},
'queries_per_second': {
'before': 8000,
'after': 45000,
'increase': '5.6x'
}
},
'costs': {
'monthly_database_cost': {
'before': '$12,000',
'after': '$84,000',
'increase': '7x'
},
'cost_per_tb': {
'before': '$6,000/TB',
'after': '$1,680/TB',
'improvement': '72%'
},
'cost_per_query': {
'before': '$0.000062',
'after': '$0.000021',
'improvement': '66%'
}
}
}
The Lessons: What Actually Matters at Scale
1. Vertical Scaling Has Limits
No amount of RAM or CPU can overcome fundamental architectural issues. We learned this after spending $150K on larger instances that only bought us 6 months.
2. Partitioning Before Sharding
Always implement time-based partitioning before moving to sharding. It’s less complex and solves 80% of performance problems for 20% of the effort.
3. Measure Everything
We built comprehensive monitoring before making changes:
class DatabaseMetricsTracker:
"""
Comprehensive database performance tracking.
"""
metrics_to_track = [
'query_execution_time',
'connection_pool_utilization',
'cache_hit_rate',
'index_usage',
'table_bloat',
'replication_lag',
'disk_io_utilization',
'cpu_usage',
'memory_usage',
'lock_contention',
'deadlock_rate'
]
4. Zero-Downtime Migrations Are Possible
Our dual-write strategy worked flawlessly:
- Deploy dual-write logic
- Copy historical data in background
- Verify consistency
- Atomic table swap
- Remove old table
5. Cost Per Query Matters More Than Total Cost
While our monthly costs increased 7x, our cost per query decreased 66% and our capacity increased 25x. This is the metric that actually matters.
Key Takeaways
✅ Start with partitioning for time-series data
✅ Measure performance before optimizing
✅ Implement comprehensive monitoring
✅ Use dual-write for zero-downtime migrations
✅ Shard by customer/tenant for SaaS applications
✅ Connection pooling is non-negotiable at scale
✅ Multi-layer caching reduces database load 60-80%
✅ Plan for 3x resource overhead with sharding
✅ Test migrations in staging with production-like data
✅ Have rollback plans for every major change
Conclusion: Database Scaling is a Journey
Scaling from 2TB to 50TB took us 18 months, cost $200K in migration expenses, and required fundamental architectural changes. But the results speak for themselves:
- 85% faster queries
- 8x more capacity
- 72% better cost efficiency
- 99.99% uptime maintained
The key insight: database scaling requires architecture changes, not just bigger servers.
For more on advanced database performance patterns, database replication strategies, sharding patterns, and zero-downtime migrations, check out CrashBytes.
Additional Resources
These resources were invaluable during our scaling journey:
- PostgreSQL Performance Tuning Guide
- AWS RDS Best Practices
- Database Sharding Patterns
- PgBouncer Documentation
- Patroni High Availability
- pg_partman Extension
- PostgreSQL Monitoring with pgwatch2
- Citus Data Sharding
- TimescaleDB for Time-Series
- PostgreSQL at Scale (Uber Engineering)
- Scaling PostgreSQL (Instagram Engineering)
- Database Reliability Engineering (O’Reilly)
This post is part of my implementation series, sharing real-world lessons from scaling production systems. For more on distributed caching strategies and multi-layer caching architectures, visit CrashBytes.