The Database Migration That Cost Us $800K in Lost Revenue

We planned for 6 hours of downtime. We got 43 hours. Here's the painful story of our database migration disaster and the zero-downtime patterns we use now.

Hour 1: “This Is Going Great”

Saturday, 8:00 AM. Our database migration was starting right on schedule.

We had 6 hours to migrate our primary PostgreSQL database from an aging on-premise server to AWS RDS. The plan was solid. We’d practiced on staging. We’d done our checklist.

By 2 PM, we’d be back online with a faster, more reliable database.

By Sunday afternoon, we’d still be down. By Monday morning, we’d have lost $800,000 in revenue and our CTO would be drafting his resignation letter.

This is the story of what went catastrophically wrong—and how we rebuilt our approach to database migrations so it would never happen again.

For the technical framework we should have used from the start, check out the comprehensive guide on zero-downtime database migrations on CrashBytes.

The Context: Why We Were Migrating

Our SaaS platform ran on a single PostgreSQL 11 instance hosted on a physical server in a data center we were vacating. The timeline was non-negotiable—the data center lease expired in 3 weeks.

Our database was:

  • 850 GB of data
  • ~2,000 transactions per second during peak hours
  • The backbone of a platform serving 15,000 paying customers
  • Running 24/7 with zero tolerance for extended downtime (customers processed financial transactions)

We’d decided on a “big bang” migration approach: take the service down, migrate everything, bring it back up. Simple.

The plan:

  1. Saturday 8 AM: Put site in maintenance mode (6 hour window)
  2. Take final backup of source database
  3. Restore to AWS RDS
  4. Run validation tests
  5. Update application connection strings
  6. Saturday 2 PM: Bring site back online

What could possibly go wrong?

Hour 3: The First Sign of Trouble

The backup completed. 103 GB compressed dump file. All looking good.

We started the restore to AWS RDS. PostgreSQL’s pg_restore utility chugged away.

Time estimate: 4 hours.

Wait. Four hours?

We’d tested this on staging. Staging took 45 minutes. Why was production taking 4 hours?

The answer, we’d later learn: Our staging database was 6 months old and had 1/10th the data volume of production. We’d massively underestimated restore time.

First mistake: Testing with staging data that didn’t reflect production scale.

Hour 7: “We Need to Extend the Window”

2 PM came and went. Restore was at 68%.

We sent an email to customers: “Maintenance extended to 6 PM.”

Immediately, our support inbox exploded. Customers were furious. We were a financial services platform. 6 hours of downtime meant they couldn’t process transactions. They were losing money.

We thought we’d be done in 2 hours. We’d be fine.

Second mistake: Underestimating how critical our uptime was to our customers’ business operations.

Hour 11: The Restore Completes. The Nightmare Begins.

6:30 PM. The restore finally finished.

We started our validation tests:

# Check row counts
psql -h new-db.amazonaws.com -c "SELECT COUNT(*) FROM users;"
# 58,293 users

psql -h old-db.local -c "SELECT COUNT(*) FROM users;"  
# 58,293 users

# Perfect! Counts match!

We ran through our validation checklist:

  • ✅ User count matches
  • ✅ Transaction count matches
  • ✅ All tables present
  • ✅ Foreign keys valid

Everything looked good. We updated the application config to point to the new database. Deployed. Started the application.

7:15 PM: Site was back online.

Customers started logging in. Transactions started flowing. We were monitoring dashboards nervously.

7:22 PM: Error rate spiked to 45%.

What. The. Hell.

Hour 12: The Investigation

Errors everywhere in our logs:

PSQLException: column "user_preference" does not exist
PSQLException: relation "audit_log" does not exist
PSQLException: invalid input syntax for type uuid

Wait. Those columns definitely exist. I just checked.

Our senior engineer started digging. What he found was horrifying:

The columns existed. But the data types were wrong. And some indexes were missing. And some constraints were disabled.

It turned out pg_restore had partially failed during the restore. PostgreSQL happily continues restoring even when some commands fail. We hadn’t checked the restore logs carefully.

Result: We had a database that looked complete but was actually subtly corrupted.

Third mistake: Not comprehensively validating the restore beyond simple row counts.

Hour 14: The Emergency Rollback Decision

8:30 PM. We had two options:

  1. Try to fix the new database (unknown time to resolution)
  2. Roll back to the old database (maybe 2 hours)

We chose rollback. Get customers back up ASAP.

But here’s the problem: we’d already processed 90 minutes of transactions on the new database.

When we rolled back to the old database, we’d lose those transactions.

We made the call: “Roll back. We’ll deal with the lost data later.”

9:00 PM: Rolled back to old database. Site back online.

Immediate problem: We’d processed $127,000 worth of customer transactions on the corrupted database. Those transactions were gone.

Our CTO started drafting communication to affected customers. Our legal team got involved. It was going to be a long night.

Fourth mistake: No rollback plan that accounted for transactions processed during the failed migration.

The Weekend From Hell

We spent Sunday:

  1. Reconstructing lost transactions from application logs and email receipts
  2. Manually re-processing 847 transactions
  3. Apologizing to customers
  4. Giving refunds/credits to angry clients

Monday morning, we calculated the damage:

  • $127K in lost transactions (mostly recovered)
  • $680K in lost revenue from 43 hours of downtime/degraded service
  • 47 customers canceled (representing $890K annual recurring revenue)
  • Team morale destroyed
  • CTO offered resignation (CEO refused, which was the right call)

Total cost of our “simple” database migration: ~$2M when you factor in customer churn.

What We Should Have Done: The Zero-Downtime Approach

After the disaster, we brought in a database consultant to help us rebuild our approach. Here’s what we learned:

Approach 1: Change Data Capture (CDC) with Gradual Cutover

What we should have done:

  1. Setup AWS RDS instance in parallel with production
  2. Use CDC tools (like Debezium or AWS DMS) to continuously replicate from old to new
  3. Let replication catch up (could run for weeks if needed)
  4. Run validation continuously while both systems are live
  5. When ready, flip application traffic with zero downtime

How this solves our problems:

✅ No maintenance window needed
✅ Continuous validation of data integrity
✅ Can pause/resume migration anytime
✅ No lost transactions
✅ Can instantly rollback by flipping traffic back

Time to implement: 2-3 weeks of setup, but zero customer downtime

Approach 2: Blue-Green Database Migration

What we should have done:

  1. Create new database (Green) alongside old (Blue)
  2. Application writes to both databases (dual-write pattern)
  3. Validate data consistency between Blue and Green
  4. When validated, flip reads to Green
  5. Monitor for issues, can instantly switch back to Blue
  6. After validation period, stop dual-writes

How this solves our problems:

✅ Zero downtime cutover
✅ Instant rollback capability
✅ No data loss
✅ Can validate thoroughly before committing

Time to implement: 1-2 weeks application changes, but zero customer downtime

Approach 3: Read Replica Promotion (For Simpler Cases)

What we should have done (simplest option for PostgreSQL → RDS):

  1. Setup AWS RDS as read replica of on-premise database
  2. Let replication catch up and stabilize
  3. During low-traffic window, promote replica to standalone
  4. Update application (< 5 minute downtime)

How this solves our problems:

✅ Minimal downtime (< 5 minutes)
✅ Validation via replica lag monitoring
✅ Can rollback by pointing app back to old master
✅ Built-in PostgreSQL feature (no special tools)

Time to implement: 2-3 days, < 5 minutes customer downtime

What We Actually Implemented: CDC with Gradual Cutover

Three months later, we finally completed our database migration. Here’s how:

Week 1: Setup CDC Infrastructure

# AWS DMS Configuration
source_endpoint:
  engine: postgres
  server: old-db.local
  port: 5432
  database: production

target_endpoint:
  engine: postgres  
  server: new-db.amazonaws.com
  database: production

replication_task:
  migration_type: full-load-and-cdc
  table_mappings:
    - schema: public
      table: "*"
  validation: 
    enabled: true
    failure_handling: suspend

We setup AWS DMS to:

  1. Do initial full load (took 6 hours, didn’t affect production)
  2. Start continuous CDC replication
  3. Monitor replication lag continuously

Week 2: Validation and Monitoring

We built comprehensive validation:

# Continuous data validation
def validate_databases():
    tables = [
        'users', 'transactions', 'accounts', 
        'audit_logs', 'settings', 'payments'
    ]
    
    for table in tables:
        # Row counts
        old_count = query_old_db(f"SELECT COUNT(*) FROM {table}")
        new_count = query_new_db(f"SELECT COUNT(*) FROM {table}")
        assert old_count == new_count, f"{table} counts don't match"
        
        # Sample data validation  
        old_sample = query_old_db(f"SELECT * FROM {table} ORDER BY id DESC LIMIT 1000")
        new_sample = query_new_db(f"SELECT * FROM {table} ORDER BY id DESC LIMIT 1000")
        assert old_sample == new_sample, f"{table} data doesn't match"
        
        # Check for missing indexes
        old_indexes = get_indexes(old_db, table)
        new_indexes = get_indexes(new_db, table)
        assert old_indexes == new_indexes, f"{table} indexes don't match"

# Run every 5 minutes
while True:
    validate_databases()
    check_replication_lag()  # Alert if > 5 seconds
    time.sleep(300)

This validation ran continuously for 2 weeks. We caught and fixed:

  • 3 missing indexes
  • 2 column type mismatches
  • 1 foreign key constraint issue

By the time we were ready to cutover, we had 100% confidence data was identical.

Week 3: The 5-Minute Cutover

During a low-traffic window (Tuesday 3 AM), we did the cutover:

# 1. Put application in read-only mode (30 seconds downtime)
# 2. Wait for replication lag to hit zero (2 minutes)
# 3. Stop DMS replication
# 4. Update application config to point to new database
# 5. Restart application (90 seconds)
# 6. Monitor for 30 minutes

# Total downtime: 4 minutes 30 seconds

Success metrics:

  • ⏱️ Downtime: 4.5 minutes (not 43 hours)
  • 💰 Revenue lost: $0
  • 😡 Customer complaints: 0
  • ✅ Data integrity: 100%

The Real Lessons

Lesson 1: Zero-Downtime Isn’t Optional

Old thinking: “We can take 6 hours of downtime for critical migrations.”

Reality: 6 hours of downtime costs us $800K in a SaaS business. Zero-downtime migrations aren’t a nice-to-have—they’re mandatory for business continuity.

Lesson 2: Test With Production-Scale Data

Old thinking: “Staging is good enough for testing.”

Reality: Staging doesn’t reflect production complexity. Performance issues, data integrity problems, and edge cases only appear at production scale.

Now we:

  • Test migrations with production-scale data (anonymized clones)
  • Load test at 2x peak production traffic
  • Practice rollback procedures under stress

Lesson 3: Validation Is Everything

Old thinking: “Check row counts, we’re good.”

Reality: Data can be subtly corrupted in ways row counts don’t catch.

Now we validate:

  • ✅ Row counts (table-level)
  • ✅ Column data types
  • ✅ Indexes (names, types, uniqueness)
  • ✅ Constraints (foreign keys, checks, defaults)
  • ✅ Sequences (for auto-increment fields)
  • ✅ Triggers and functions
  • ✅ Sample data comparison (hash-based)
  • ✅ Query performance benchmarks

Lesson 4: Rollback Plans Must Account for New Data

Old thinking: “Rollback = point application back to old database.”

Reality: If you’ve processed transactions on the new database, rolling back loses that data.

Now our rollback plans include:

  • Transaction log preservation
  • Application-level transaction queuing
  • Data reconciliation procedures
  • Customer communication templates

Lesson 5: Communication Is As Important As Technology

Old thinking: “Tell customers we’ll be down 6 hours, extend if needed.”

Reality: Poor communication destroyed customer trust and led to churn.

Now we:

  • Under-promise, over-deliver on timelines
  • Communicate proactively when things go wrong
  • Provide detailed status updates every 30 minutes during migrations
  • Offer compensation proactively when we miss SLAs

The Current State: Continuous Migration Capability

18 months later, we’ve done 23 more database migrations using our CDC-based approach. Zero customer-impacting incidents.

We can now:

  • Migrate databases with < 5 minutes downtime
  • Validate continuously before committing
  • Rollback instantly if issues arise
  • Run migrations during business hours (not just weekends)

The cost of that disaster: ~$2M in direct costs + probably another $1M in lost trust and opportunity cost.

The value of learning from it: Priceless. We now have a migration capability that’s a competitive advantage.

For the complete framework and patterns we use now, check out the Zero-Downtime Database Migration Guide on CrashBytes.

The bottom line: You can learn from our expensive mistakes, or you can make them yourself. We highly recommend the former.


Have you lived through a database migration disaster? I’d love to hear your war stories. Reach out at michael@michaeleakins.com or share in the comments.