The Audit That Changed Everything
March 2024. FDA audit notification arrives. Our team has 45 days to demonstrate AI governance for 47 production ML models in our healthcare platform.
Problem: We had no formal governance framework. Models were deployed ad-hoc. No centralized model registry. Compliance documentation scattered across Google Docs and Slack threads.
Stakes: $2.3M in potential fines. Possible shutdown of AI features serving 2.4 million patients.
Time pressure: 45 days to build governance from scratch.
After reading about AI governance frameworks for regulated industries, I realized we needed more than technical controls - we needed organizational transformation.
This is the story of how we built governance that worked, the mistakes that almost killed us, and the patterns that saved our AI program.
Phase 1: The “Compliance Theater” Mistake (Weeks 1-3)
My first instinct was wrong: create a massive policy document and declare victory.
Our First Framework: Beautiful and Useless
I spent 2 weeks creating a 127-page “AI Governance Framework” document covering:
- Model development lifecycle
- Risk assessment matrices
- Approval workflows
- Monitoring requirements
- Incident response procedures
Result: Zero adoption. Engineers ignored it. Compliance team couldn’t understand it.
Why it failed:
- Too complex: Nobody knew where to start
- Not actionable: Vague requirements like “ensure model fairness”
- No automation: Everything required manual process
- Disconnected from reality: Designed for how we should work, not how we actually work
The audit was in 3 weeks. We needed to pivot.
Phase 2: The Minimum Viable Governance (Weeks 4-6)
I threw away the 127-page document. We started with three questions:
- Can we find every model in production?
- Can we prove each model was validated?
- Can we show we’re monitoring for issues?
Building the Model Registry (48 Hours)
We built a dead-simple registry in Airtable:
// Model registration webhook
app.post('/api/models/register', async (req, res) => {
const {
modelName,
version,
purpose,
trainingData,
validationMetrics,
owner,
riskLevel
} = req.body;
// Create registry entry
const record = await airtable('Model Registry').create({
'Model Name': modelName,
'Version': version,
'Purpose': purpose,
'Training Data Source': trainingData,
'Validation Accuracy': validationMetrics.accuracy,
'Validation Date': new Date().toISOString(),
'Owner': owner,
'Risk Level': riskLevel,
'Status': 'Pending Approval',
'Created At': new Date().toISOString()
});
// Auto-trigger approval workflow based on risk
if (riskLevel === 'HIGH') {
await triggerApprovalWorkflow(record.id, ['medical-director', 'compliance-lead']);
} else if (riskLevel === 'MEDIUM') {
await triggerApprovalWorkflow(record.id, ['technical-lead']);
}
res.json({ registryId: record.id });
});
Key insight: We didn’t build perfect tooling. We built minimal tooling that integrated with existing workflows.
Retroactive Model Documentation (Week 5)
We had 47 models deployed. Zero had proper documentation.
We created a “Model Card” template and ran a 3-day sprint where every team documented their models:
# Model Card: Sepsis Risk Predictor v2.3
## Intended Use
Predict sepsis risk in ICU patients based on vital signs and lab values.
Assists (not replaces) clinical decision-making.
## Training Data
- Dataset: MIMIC-III ICU database (40,000 admissions)
- Time period: 2012-2016
- Demographic distribution:
- Age: Mean 65 (SD 18), Range 18-95
- Gender: 54% male, 46% female
- Race: 73% white, 16% black, 7% Hispanic, 4% other/unknown
## Model Details
- Architecture: XGBoost gradient boosting
- Input features: 23 vital signs + lab values
- Output: Risk score 0-1 (probability of sepsis in next 6 hours)
- Training: 80/20 train/test split, 5-fold cross-validation
## Performance Metrics
- AUROC: 0.87 (95% CI: 0.85-0.89)
- Sensitivity: 0.82 at specificity 0.80
- Calibration: Brier score 0.15
## Limitations & Biases
- Trained on single hospital data (generalization unknown)
- Under-represents Hispanic and Asian populations
- Performance degrades for age < 25 (limited training data)
- Does NOT account for medication history
## Monitoring Plan
- Weekly performance checks (AUROC, calibration)
- Daily demographic distribution checks
- Monthly bias audits across race/gender groups
</markdown>
Result: We documented all 47 models in 5 days. Quality wasn’t perfect, but we had something to show the auditors.
The Audit: Day 1-3
The FDA auditors arrived. Three days of intense scrutiny.
Day 1: They Found Everything We Missed
Auditor: “Show me how you monitor for model drift.”
Me: “We… um… check dashboards weekly?”
Auditor: “Show me the checks from last month.”
Me: frantically searches Grafana “Well, we… we don’t have automated tracking of those checks…”
Finding #1: No systematic drift monitoring. Major deficiency.
Day 2: The Bias Audit That Exposed Us
Auditor: “Your sepsis model. Show me performance across demographic groups.”
We pulled up our dashboard showing overall AUROC: 0.87.
Auditor: “Now break it down by race.”
We ran the analysis live:
- White patients: AUROC 0.88
- Black patients: AUROC 0.79
- Hispanic patients: AUROC 0.71
Finding #2: Significant performance disparities. Critical deficiency.
We had never checked this. Our “governance” document said to check for bias, but we never actually did it.
Day 3: The One Thing That Saved Us
Auditor: “Walk me through your incident response process.”
We had one real example: A model had started misfiring in production 3 months prior. We:
- Detected the issue through alerts
- Immediately rolled back to previous version
- Root-caused the problem (data pipeline bug)
- Implemented fixes and safeguards
- Documented everything in a post-mortem
Auditor: “This is exactly what we want to see. Why didn’t you do this for everything?”
Finding #3: Good incident response, but inconsistent application. Observation (not a deficiency).
The 90-Day Remediation Plan
We had 90 days to fix the major deficiencies or face shutdown.
Fix 1: Automated Drift Monitoring
We built a monitoring system that actually worked:
# Automated drift detection
class ModelDriftMonitor:
def __init__(self, model_id, baseline_data):
self.model_id = model_id
self.baseline_dist = self._compute_distribution(baseline_data)
def check_drift(self, production_data, window='1d'):
"""Check for distribution drift in production data"""
prod_dist = self._compute_distribution(production_data)
# KS test for numerical features
drift_scores = {}
for feature in self.baseline_dist['numerical']:
statistic, p_value = ks_2samp(
self.baseline_dist['numerical'][feature],
prod_dist['numerical'][feature]
)
drift_scores[feature] = {
'statistic': statistic,
'p_value': p_value,
'drift_detected': p_value < 0.05
}
# Chi-square test for categorical features
for feature in self.baseline_dist['categorical']:
statistic, p_value = chisquare(
prod_dist['categorical'][feature],
self.baseline_dist['categorical'][feature]
)
drift_scores[feature] = {
'statistic': statistic,
'p_value': p_value,
'drift_detected': p_value < 0.05
}
# Alert if drift detected
drifted_features = [
f for f, score in drift_scores.items()
if score['drift_detected']
]
if drifted_features:
self._send_alert({
'model_id': self.model_id,
'drifted_features': drifted_features,
'window': window,
'severity': 'HIGH' if len(drifted_features) > 3 else 'MEDIUM'
})
return drift_scores
Deployment: Automated checks every 6 hours for all 47 models.
Result: We detected drift in 8 models within the first week. All were investigated and resolved.
Fix 2: Systematic Bias Auditing
We built demographic performance tracking:
# Fairness metrics across groups
class FairnessAuditor:
def __init__(self, model, protected_attributes):
self.model = model
self.protected_attributes = protected_attributes
def audit(self, X, y_true, y_pred):
"""Compute fairness metrics across protected groups"""
results = {}
for attr in self.protected_attributes:
groups = X[attr].unique()
results[attr] = {}
for group in groups:
mask = X[attr] == group
# Performance metrics by group
results[attr][group] = {
'n': mask.sum(),
'accuracy': accuracy_score(y_true[mask], y_pred[mask]),
'precision': precision_score(y_true[mask], y_pred[mask]),
'recall': recall_score(y_true[mask], y_pred[mask]),
'auroc': roc_auc_score(y_true[mask], y_pred[mask])
}
# Compute fairness metrics
aurocs = [results[attr][g]['auroc'] for g in groups]
results[attr]['disparity_ratio'] = min(aurocs) / max(aurocs)
results[attr]['max_disparity'] = max(aurocs) - min(aurocs)
# Alert if disparity > threshold
if results[attr]['max_disparity'] > 0.10: # 10% disparity threshold
self._send_fairness_alert({
'model_id': self.model.id,
'attribute': attr,
'disparity': results[attr]['max_disparity'],
'groups': groups
})
return results
Schedule: Monthly audits for all high-risk models, quarterly for medium-risk.
Result: Found and fixed 4 models with significant disparities.
Building the Long-Term Governance Framework (Months 4-18)
After passing the audit, we built sustainable governance.
The 5 Pillars of Our Framework
1. Model Development Standards
Pre-deployment checklist (automated in CI/CD):
# .github/workflows/model-governance.yml
name: Model Governance Checks
on:
pull_request:
paths:
- 'models/**'
jobs:
governance-checks:
runs-on: ubuntu-latest
steps:
- name: Validate Model Card Exists
run: |
if [ ! -f "models/${{ github.event.pull_request.head.ref }}/model_card.md" ]; then
echo "ERROR: Model card required"
exit 1
fi
- name: Check Training Data Documentation
run: python scripts/validate_data_documentation.py
- name: Validate Performance Metrics
run: |
python scripts/check_minimum_performance.py \
--min-auroc 0.75 \
--min-samples 1000
- name: Bias Audit
run: python scripts/audit_fairness.py \
--max-disparity 0.15
- name: Explainability Check
run: python scripts/validate_explainability.py
Deployment gates: Models can’t deploy without passing all checks.
2. Risk-Based Approval Workflows
Three risk tiers:
Risk Level | Examples | Approval Required | Monitoring Frequency |
---|---|---|---|
HIGH | Clinical decision support, automated diagnosis | Medical Director + Compliance Lead + Technical Lead | Daily |
MEDIUM | Patient triage, appointment scheduling | Technical Lead + Domain Expert | Weekly |
LOW | Search ranking, content recommendations | Technical Lead | Monthly |
3. Continuous Monitoring
Real-time dashboards for each model showing:
- Prediction volume (requests/hour)
- Performance metrics (accuracy, AUROC, calibration)
- Drift scores (feature distributions)
- Fairness metrics (performance by demographic)
- Alert history
Example alert configuration:
# Model monitoring alerts
alerts = {
'performance_degradation': {
'metric': 'auroc',
'threshold': 0.75, # Alert if AUROC drops below 0.75
'window': '7d',
'severity': 'CRITICAL'
},
'calibration_drift': {
'metric': 'brier_score',
'threshold': 0.20,
'window': '24h',
'severity': 'HIGH'
},
'prediction_volume_anomaly': {
'metric': 'request_count',
'threshold': '3_sigma', # 3 standard deviations from mean
'window': '1h',
'severity': 'MEDIUM'
},
'fairness_violation': {
'metric': 'demographic_disparity',
'threshold': 0.15, # Max 15% disparity
'window': '7d',
'severity': 'HIGH'
}
}
4. Incident Response Procedures
Defined escalation paths:
Level 1: Automated Alert → On-call engineer investigates
↓ (if unresolved in 30 min)
Level 2: Page team lead + disable model if critical
↓ (if unresolved in 2 hours)
Level 3: Emergency response team + executive notification
Post-incident requirements:
- Root cause analysis within 48 hours
- Remediation plan within 5 business days
- Model registry updated with incident details
- Lessons learned shared with all teams
5. Documentation & Audit Trail
Everything tracked in our registry:
- Model metadata (architecture, features, training data)
- Validation results (performance metrics, bias audits)
- Approval history (who approved, when, why)
- Deployment history (versions, rollbacks, incidents)
- Monitoring data (drift detection, performance trends)
Retention: 7 years (regulatory requirement).
The Results: 18 Months Later
Governance Metrics
Model Registry:
- 47 models documented and monitored
- 100% compliance with documentation standards
- Zero models deployed without approval
Monitoring Coverage:
- 100% of production models monitored 24/7
- Average detection time for issues: 8 minutes (was >24 hours)
- 127 drift alerts triggered, 89 investigated, 23 required action
Bias Audits:
- 564 monthly fairness checks completed
- 4 significant disparities found and remediated
- Average disparity reduced from 0.14 to 0.07
Incident Response:
- 12 model incidents in 18 months
- Average time to resolution: 2.1 hours (was >48 hours)
- Zero patient safety events attributed to ML models
Business Impact
Audit Performance:
- Passed second FDA audit with zero findings
- Audit duration reduced from 3 days to 4 hours
- Auditor feedback: “Best-in-class AI governance”
Development Velocity:
- Model deployment time: 3.5 days (was 2-3 weeks)
- Approval cycle time: 18 hours (was 5-7 days)
- Documentation time: 2 hours per model (was 16+ hours)
Cost Efficiency:
- Compliance overhead: 4% of engineering time (was 18%)
- No regulatory fines: $0 (avoided $2.3M)
- Reduced manual monitoring: $340K annual savings
The Lessons: What Actually Matters
1. Start Minimal, Iterate Fast
Don’t build the perfect framework. Build the minimal viable governance that solves your immediate pain points.
Our 127-page policy document: Useless. Our Airtable registry + automated checks: Game-changer.
2. Automate or Die
Manual governance doesn’t scale. Every check must be automated.
If a governance requirement can’t be automated, it won’t be followed.
3. Integrate with Existing Workflows
Don’t create separate “governance tools.” Embed governance into existing workflows.
- Model registration: Part of deployment pipeline
- Bias audits: Automated in CI/CD
- Monitoring: Integrated with existing dashboards
4. Risk-Based Approach
Not all models need the same governance. High-risk models get scrutiny. Low-risk models get automation.
Treating every model the same creates bottlenecks.
5. Documentation is Code
Model cards, audit logs, incident reports - treat documentation like code.
- Version controlled (Git)
- Reviewed in PRs
- Automatically generated when possible
- Kept up-to-date or CI fails
What We’re Building Next
1. Federated Learning Governance
Challenge: How do you govern models trained on data you can’t see?
We’re exploring:
- Differential privacy guarantees
- Federated model cards
- Distributed audit trails
2. LLM-Specific Governance
Challenge: Traditional governance doesn’t work for foundation models.
New requirements:
- Prompt injection monitoring
- Hallucination detection
- Reasoning transparency
3. Real-Time Compliance Reporting
Challenge: Auditors want instant access to governance data.
Building:
- Self-service audit dashboard
- Automated compliance reports
- Real-time alert notifications to regulators
The Bottom Line
AI governance in regulated industries is hard, but not impossible.
Keys to success:
- Start simple, iterate based on real needs
- Automate everything possible
- Integrate governance into development workflows
- Risk-based approach (not one-size-fits-all)
- Treat documentation like code
18 months later:
- Zero regulatory fines
- Faster model deployment
- Better model quality
- Engineers who actually follow governance (because it helps them)
For more on AI governance frameworks, see the comprehensive VP implementation guide that helped shape our approach.
Building AI governance? Connect on LinkedIn or share your governance challenges on Twitter.