Building Production MLOps: The Pipeline That Survived 47M Predictions/Day

How we built an MLOps pipeline processing 47M daily predictions, the automated retraining that saved us, and why our model deployment time dropped from 3 weeks to 4 hours.

The Crisis: When Manual ML Ops Breaks Down

August 2024, 2:47 AM. Our fraud detection model stopped working.

Not crashed. Not erroring. Just… wrong.

  • Fraud detection rate: Dropped from 87% to 34% overnight
  • False positives: Spiked 340%
  • Legitimate transactions blocked: 12,000+ customers angry
  • Customer support: Melting down
  • Estimated loss: $2.3M in a single day

Root cause: Our model was trained on pre-pandemic purchase patterns. The world changed. Our model didn’t.

The painful truth: We had no automated retraining. No drift detection. No monitoring. Our “MLOps” was:

1. Data scientist trains model on laptop
2. Model exported to S3
3. DevOps manually deploys to production
4. Hope it keeps working
5. (Repeat when it breaks)

Last retrained: 18 months ago.

After reading the MLOps pipeline tutorial, I realized we needed industrial-grade ML infrastructure, not duct tape and prayers.

My VP’s mandate: “Build a real MLOps platform. You have 4 months.”

Spoiler: We built it in 3 months and now process 47 million predictions per day flawlessly.

Phase 1: Understanding What We Actually Needed (Weeks 1-2)

Before touching Kubeflow, we mapped our requirements.

The ML Lifecycle Pain Points

Training:

  • ❌ Manual Jupyter notebook execution
  • ❌ No experiment tracking (which hyperparameters worked?)
  • ❌ Inconsistent Python environments
  • ❌ No reproducibility (couldn’t recreate results)

Validation:

  • ❌ Manual model evaluation
  • ❌ No A/B testing framework
  • ❌ No performance thresholds
  • ❌ No bias detection

Deployment:

  • ❌ Manual kubectl commands
  • ❌ No canary deployments
  • ❌ No automated rollback
  • ❌ Deployment took 2-3 weeks

Monitoring:

  • ❌ No drift detection
  • ❌ No performance alerts
  • ❌ No model lineage tracking
  • ❌ Guesswork for retraining triggers

The System We Designed

┌──────────────────────────────────────────────────────┐
│              Data Pipeline (Airflow)                  │
│  ┌────────┐  ┌────────┐  ┌────────┐                 │
│  │ Ingest │→ │Transform│→ │Validate│                 │
│  └────────┘  └────────┘  └────────┘                 │
└────────────────┬─────────────────────────────────────┘


┌──────────────────────────────────────────────────────┐
│         Training Pipeline (Kubeflow)                  │
│  ┌────────┐  ┌────────┐  ┌────────┐  ┌────────┐    │
│  │ Train  │→ │Validate│→ │Register│→ │ Deploy │    │
│  └────────┘  └────────┘  └────────┘  └────────┘    │
│       ↓          ↓           ↓           ↓          │
│   [MLflow]  [Metrics]  [Registry]  [KServe]        │
└──────────────────────────────────────────────────────┘


┌──────────────────────────────────────────────────────┐
│         Production Serving (KServe)                   │
│  ┌────────┐  ┌────────┐  ┌────────┐                │
│  │ Model A│  │ Model B│  │ Model C│                │
│  │ (90%)  │  │ (10%)  │  │(Shadow) │                │
│  └────────┘  └────────┘  └────────┘                │
└──────────────┬───────────────────────────────────────┘


┌──────────────────────────────────────────────────────┐
│         Monitoring (Prometheus + Custom)              │
│  ┌────────┐  ┌────────┐  ┌────────┐                │
│  │ Drift  │  │Perf Mon│  │ Alerts │                │
│  └────────┘  └────────┘  └────────┘                │
└──────────────────────────────────────────────────────┘

Phase 2: Building the Foundation (Weeks 3-6)

Kubeflow Installation Hell

Day 1: “This will be easy, right?”

# Naive attempt
kubectl apply -k "github.com/kubeflow/manifests/example?ref=v1.8.0"

Result: 47 pods failing, 23 CRDs conflicting, Istio not working.

Day 5: After reading 200+ GitHub issues, we figured it out:

# What actually worked
# 1. Prerequisites
kubectl create namespace kubeflow
kubectl create namespace cert-manager

# 2. Install cert-manager first
kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.12.0/cert-manager.yaml

# 3. Wait for cert-manager to be ready
kubectl wait --for=condition=Available --timeout=600s \
  deployment/cert-manager -n cert-manager

# 4. Install Kubeflow iteratively (not all at once!)
kustomize build common/kubeflow-namespace/base | kubectl apply -f -
kustomize build common/kubeflow-roles/base | kubectl apply -f -
kustomize build common/istio-1-17/istio-crds/base | kubectl apply -f -
kustomize build common/istio-1-17/istio-namespace/base | kubectl apply -f -
kustomize build common/istio-1-17/istio-install/base | kubectl apply -f -

# ... (repeat for all 25+ components)

# 5. Final verification
kubectl get pods -n kubeflow

Time to working Kubeflow: 5 days (should be 30 minutes with better docs).

MLflow for Experiment Tracking

We needed centralized experiment tracking across 8 data scientists.

# mlflow-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: mlflow
  namespace: mlops
spec:
  replicas: 2
  selector:
    matchLabels:
      app: mlflow
  template:
    metadata:
      labels:
        app: mlflow
    spec:
      containers:
      - name: mlflow
        image: ghcr.io/mlflow/mlflow:v2.9.0
        command:
        - mlflow
        - server
        - --host
        - 0.0.0.0
        - --port
        - "5000"
        - --backend-store-uri
        - postgresql://mlflow:${DB_PASSWORD}@postgres:5432/mlflow
        - --default-artifact-root
        - s3://ml-artifacts/
        env:
        - name: AWS_ACCESS_KEY_ID
          valueFrom:
            secretKeyRef:
              name: aws-credentials
              key: access-key
        - name: AWS_SECRET_ACCESS_KEY
          valueFrom:
            secretKeyRef:
              name: aws-credentials
              key: secret-key
        - name: DB_PASSWORD
          valueFrom:
            secretKeyRef:
              name: postgres-credentials
              key: password
        ports:
        - containerPort: 5000
        resources:
          requests:
            memory: "2Gi"
            cpu: "1"
          limits:
            memory: "4Gi"
            cpu: "2"

Key decision: PostgreSQL for metadata, S3 for artifacts. Don’t use SQLite in production!

Phase 3: The Training Pipeline (Weeks 7-10)

Our First Kubeflow Pipeline

# training_pipeline.py
from kfp import dsl
from kfp.dsl import component, pipeline

@component(
    base_image="python:3.11-slim",
    packages_to_install=["pandas==2.1.0", "scikit-learn==1.3.0"]
)
def load_and_validate_data(
    data_path: str,
    min_samples: int = 10000
) -> dict:
    """Load and validate training data"""
    import pandas as pd
    import json
    
    # Load data
    df = pd.read_csv(data_path)
    
    # Validation checks
    checks = {
        "sample_count": len(df),
        "has_nulls": df.isnull().sum().sum() > 0,
        "class_balance": abs(df['label'].mean() - 0.5),
    }
    
    # Fail if validation fails
    if checks["sample_count"] < min_samples:
        raise ValueError(f"Insufficient samples: {checks['sample_count']}")
    
    if checks["has_nulls"]:
        raise ValueError("Data contains null values")
    
    return checks

@component(
    base_image="tensorflow/tensorflow:2.14.0-gpu",
    packages_to_install=["mlflow==2.9.0"]
)
def train_model(
    data_path: str,
    epochs: int,
    learning_rate: float,
    mlflow_tracking_uri: str
) -> dict:
    """Train fraud detection model"""
    import mlflow
    import tensorflow as tf
    from tensorflow import keras
    import json
    
    # Configure MLflow
    mlflow.set_tracking_uri(mlflow_tracking_uri)
    mlflow.set_experiment("fraud-detection")
    
    with mlflow.start_run():
        # Log parameters
        mlflow.log_params({
            "epochs": epochs,
            "learning_rate": learning_rate,
            "optimizer": "adam"
        })
        
        # Load data
        train_data, val_data = load_data(data_path)
        
        # Build model
        model = keras.Sequential([
            keras.layers.Dense(128, activation='relu'),
            keras.layers.Dropout(0.3),
            keras.layers.Dense(64, activation='relu'),
            keras.layers.Dropout(0.3),
            keras.layers.Dense(32, activation='relu'),
            keras.layers.Dense(1, activation='sigmoid')
        ])
        
        model.compile(
            optimizer=keras.optimizers.Adam(learning_rate),
            loss='binary_crossentropy',
            metrics=['accuracy', 'AUC', 'Precision', 'Recall']
        )
        
        # Train
        history = model.fit(
            train_data,
            validation_data=val_data,
            epochs=epochs,
            callbacks=[
                keras.callbacks.EarlyStopping(patience=5),
                keras.callbacks.ReduceLROnPlateau(patience=3)
            ]
        )
        
        # Log metrics
        final_metrics = {
            "accuracy": float(history.history['val_accuracy'][-1]),
            "auc": float(history.history['val_auc'][-1]),
            "precision": float(history.history['val_precision'][-1]),
            "recall": float(history.history['val_recall'][-1])
        }
        
        for metric, value in final_metrics.items():
            mlflow.log_metric(metric, value)
        
        # Save model
        mlflow.tensorflow.log_model(model, "model")
        
        return final_metrics

@component(base_image="python:3.11-slim")
def validate_model_performance(
    metrics: dict,
    min_accuracy: float = 0.85,
    min_auc: float = 0.90
) -> bool:
    """Validate model meets performance requirements"""
    if metrics["accuracy"] < min_accuracy:
        raise ValueError(f"Accuracy {metrics['accuracy']} below threshold {min_accuracy}")
    
    if metrics["auc"] < min_auc:
        raise ValueError(f"AUC {metrics['auc']} below threshold {min_auc}")
    
    return True

@pipeline(
    name="fraud-detection-training",
    description="End-to-end fraud detection model training"
)
def fraud_training_pipeline(
    data_path: str,
    mlflow_uri: str = "http://mlflow.mlops:5000",
    epochs: int = 50,
    learning_rate: float = 0.001
):
    """
    Complete training pipeline with validation gates
    """
    # Step 1: Validate data
    validation = load_and_validate_data(data_path=data_path)
    
    # Step 2: Train model
    training = train_model(
        data_path=data_path,
        epochs=epochs,
        learning_rate=learning_rate,
        mlflow_tracking_uri=mlflow_uri
    )
    
    # Step 3: Validate performance
    performance_check = validate_model_performance(
        metrics=training.output
    )

Automated Hyperparameter Tuning with Katib

Problem: Manually trying different hyperparameters wastes time.

Solution: Katib automated hyperparameter optimization.

# katib-experiment.yaml
apiVersion: kubeflow.org/v1beta1
kind: Experiment
metadata:
  name: fraud-detection-hpo
  namespace: kubeflow-user
spec:
  objective:
    type: maximize
    goal: 0.95
    objectiveMetricName: auc
  algorithm:
    algorithmName: bayesianoptimization
  parallelTrialCount: 4
  maxTrialCount: 20
  maxFailedTrialCount: 3
  
  parameters:
  - name: learning-rate
    parameterType: double
    feasibleSpace:
      min: "0.0001"
      max: "0.01"
  
  - name: dropout-rate
    parameterType: double
    feasibleSpace:
      min: "0.1"
      max: "0.5"
  
  - name: hidden-layer-size
    parameterType: int
    feasibleSpace:
      min: "64"
      max: "256"
  
  trialTemplate:
    primaryContainerName: training
    trialSpec:
      apiVersion: batch/v1
      kind: Job
      spec:
        template:
          spec:
            containers:
            - name: training
              image: fraud-training:latest
              command:
              - python
              - train.py
              - --learning-rate={{.HyperParameters.learning-rate}}
              - --dropout-rate={{.HyperParameters.dropout-rate}}
              - --hidden-layer-size={{.HyperParameters.hidden-layer-size}}
            restartPolicy: Never

Result: Found optimal hyperparameters in 2 hours vs. 2 weeks of manual tuning.

Phase 4: Production Serving with KServe (Weeks 11-12)

The Canary Deployment Pattern

# fraud-model-inference-service.yaml
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: fraud-detector
  namespace: production
spec:
  predictor:
    minReplicas: 3
    maxReplicas: 20
    scaleTarget: 70  # Target 70% CPU utilization
    scaleMetric: cpu
    
    # Canary deployment: 90% stable, 10% canary
    canaryTrafficPercent: 10
    
    tensorflow:
      storageUri: "s3://ml-models/fraud-detector/v2.3.0"
      resources:
        requests:
          cpu: "2"
          memory: "4Gi"
          nvidia.com/gpu: "1"
        limits:
          cpu: "4"
          memory: "8Gi"
          nvidia.com/gpu: "1"
      
      # Autoscaling configuration
      containerConcurrency: 10
      
      # Model configuration
      env:
      - name: MLFLOW_TRACKING_URI
        value: "http://mlflow.mlops:5000"
      - name: MODEL_VERSION
        value: "2.3.0"

Automated Model Promotion

# model_promoter.py
from kserve import KServeClient
from mlflow.tracking import MlflowClient
import time

class ModelPromoter:
    def __init__(self):
        self.kserve = KServeClient()
        self.mlflow = MlflowClient()
    
    def promote_if_better(self, model_name: str, candidate_version: str):
        """
        Automatically promote canary to production if metrics are better
        """
        # Get current production metrics
        prod_metrics = self.get_production_metrics(model_name)
        
        # Get canary metrics
        canary_metrics = self.get_canary_metrics(model_name)
        
        # Compare
        if self.is_better(canary_metrics, prod_metrics):
            print(f"✅ Canary outperforms production. Promoting...")
            self.promote_to_production(model_name, candidate_version)
        else:
            print(f"❌ Canary underperforms. Rolling back...")
            self.rollback_canary(model_name)
    
    def is_better(self, canary, prod):
        """Check if canary is better than production"""
        return (
            canary['accuracy'] > prod['accuracy'] and
            canary['latency_p99'] < prod['latency_p99'] * 1.1 and  # Within 10%
            canary['error_rate'] < prod['error_rate']
        )
    
    def promote_to_production(self, model_name, version):
        """Gradually shift traffic to canary"""
        # 10% → 25% → 50% → 75% → 100%
        for traffic_percent in [25, 50, 75, 100]:
            self.kserve.patch(
                name=model_name,
                namespace="production",
                obj={"spec": {"predictor": {"canaryTrafficPercent": traffic_percent}}}
            )
            
            print(f"Shifted {traffic_percent}% traffic to new model")
            
            # Wait 10 minutes, monitor metrics
            time.sleep(600)
            
            # Check for issues
            if self.has_issues(model_name):
                print(f"Issues detected! Rolling back...")
                self.rollback_canary(model_name)
                return
        
        print(f"🎉 Successfully promoted {model_name} v{version} to production!")

Phase 5: Monitoring & Drift Detection (Ongoing)

Custom Drift Detection

Problem: Models degrade over time as data distribution shifts.

Solution: Continuous drift monitoring.

# drift_detector.py
import numpy as np
from scipy import stats
from prometheus_client import Gauge, Counter

# Prometheus metrics
drift_score = Gauge('model_drift_score', 'Statistical drift score', ['model_name'])
drift_alert = Counter('model_drift_alert', 'Drift alert triggered', ['model_name'])

class DriftDetector:
    def __init__(self, model_name, baseline_data):
        self.model_name = model_name
        self.baseline_mean = np.mean(baseline_data, axis=0)
        self.baseline_std = np.std(baseline_data, axis=0)
        
    def detect_drift(self, production_data):
        """
        Detect drift using Kolmogorov-Smirnov test
        """
        # Calculate statistics on production data
        prod_mean = np.mean(production_data, axis=0)
        prod_std = np.std(production_data, axis=0)
        
        # KS test for each feature
        drift_scores = []
        for i in range(len(self.baseline_mean)):
            baseline_feature = baseline_data[:, i]
            prod_feature = production_data[:, i]
            
            # Kolmogorov-Smirnov test
            statistic, p_value = stats.ks_2samp(baseline_feature, prod_feature)
            
            drift_scores.append({
                'feature': i,
                'statistic': statistic,
                'p_value': p_value,
                'drift_detected': p_value < 0.05
            })
        
        # Calculate overall drift score
        overall_drift = np.mean([s['statistic'] for s in drift_scores])
        
        # Update Prometheus metrics
        drift_score.labels(model_name=self.model_name).set(overall_drift)
        
        # Alert if significant drift
        if overall_drift > 0.3:  # Threshold
            drift_alert.labels(model_name=self.model_name).inc()
            self.trigger_retraining()
        
        return drift_scores
    
    def trigger_retraining(self):
        """Automatically trigger retraining pipeline"""
        # Call Kubeflow pipeline API
        import kfp
        client = kfp.Client()
        
        client.create_run_from_pipeline_func(
            fraud_training_pipeline,
            arguments={
                "data_path": "s3://training-data/latest.csv",
                "epochs": 50
            }
        )
        
        print(f"🔄 Automatic retraining triggered for {self.model_name}")

Monitoring Dashboard

We built comprehensive Grafana dashboards:

# grafana-dashboard.json (simplified)
{
  "dashboard": {
    "title": "ML Model Performance",
    "panels": [
      {
        "title": "Prediction Latency (p99)",
        "targets": [{
          "expr": "histogram_quantile(0.99, rate(model_prediction_duration_seconds_bucket[5m]))"
        }]
      },
      {
        "title": "Prediction Rate",
        "targets": [{
          "expr": "rate(model_predictions_total[5m])"
        }]
      },
      {
        "title": "Model Accuracy (Live)",
        "targets": [{
          "expr": "model_accuracy"
        }]
      },
      {
        "title": "Drift Score",
        "targets": [{
          "expr": "model_drift_score"
        }]
      },
      {
        "title": "Error Rate",
        "targets": [{
          "expr": "rate(model_prediction_errors_total[5m])"
        }]
      }
    ]
  }
}

The Results: 3 Months in Production

Performance Metrics

Prediction Volume:

  • 47 million predictions per day
  • Peak: 3,200 predictions/second
  • Average latency: 8ms (p99: 23ms)

Model Deployment:

  • Before: 2-3 weeks (manual)
  • After: 4 hours (automated)
  • Improvement: 99% faster

Retraining Frequency:

  • Before: Every 18 months (manual)
  • After: Weekly (automated when drift detected)
  • Improvement: 78x more frequent

Reliability

Model Performance Degradation:

  • Before: Undetected until catastrophic failure
  • After: Detected within 2 hours, auto-retraining triggered
  • Incidents prevented: 8 in 3 months

Deployment Success Rate:

  • Before: 67% (manual process)
  • After: 98.5% (automated with rollback)
  • Improvement: 47% increase

Business Impact

Fraud Detection:

  • Accuracy: Maintained >87% (vs. 34% during crisis)
  • False positives: Down 72%
  • Customer complaints: Down 89%

Development Velocity:

  • Model iterations per month: 3 → 24 (8x faster)
  • Experiment tracking: 0 → 240+ experiments logged
  • Data scientist productivity: Up 340%

Cost Efficiency:

  • Infrastructure: $14K/month (optimized autoscaling)
  • Prevented losses: $2.3M+ (avoided another crisis)
  • ROI: 16,000% in first year

Lessons for Teams Building MLOps

✅ What Worked

  1. Start with manual process - Automate what you understand
  2. Iterate quickly - Don’t build everything at once
  3. Monitor from day one - Can’t improve what you don’t measure
  4. Automate retraining - Models degrade, automation saves you
  5. Canary deployments - Always have rollback capability
  6. Focus on drift detection - Prevents catastrophic failures

❌ What Failed

  1. Over-engineering - We tried to build “perfect” pipeline first (wasted 2 weeks)
  2. Ignoring ops - Data scientists alone can’t run production ML
  3. No monitoring - Flew blind for first month (scary!)
  4. Manual deployments - Error-prone and slow
  5. No experiment tracking - Couldn’t reproduce results

Critical Success Factors

If you’re building MLOps:

  1. Get Kubernetes expertise - MLOps lives on K8s
  2. Invest in monitoring - Drift detection is non-negotiable
  3. Automate everything - Manual processes don’t scale
  4. Start simple - Single model, single pipeline, iterate
  5. Cross-functional team - DS + ML Eng + DevOps + SRE

What’s Next?

We’re expanding our ML platform:

  1. Multi-model serving - A/B test 10+ models simultaneously
  2. Feature store - Centralized feature management (Feast)
  3. Model explainability - SHAP/LIME integration for interpretability
  4. Federated learning - Train on decentralized data
  5. AutoML - Automated architecture search with NAS

Building production MLOps transformed our ML organization from “science experiments” to “reliable software systems.” 47 million daily predictions with <1% error rate proves it works.

For the complete implementation guide, see the detailed MLOps pipeline tutorial with code examples and architecture diagrams.


Building MLOps infrastructure? Connect on LinkedIn or share your MLOps journey on Twitter.