The Breach That Changed Everything
February 2024, 3:42 AM. Security Operations Center detects anomalous lateral movement.
An attacker had compromised a single developer laptop through a phishing email. Within 18 minutes, they:
- Accessed our internal wiki (no authentication required)
- Discovered database credentials (hardcoded in wiki documentation)
- Pivoted to production database servers (network-level trust)
- Exfiltrated 2.3 million customer records
Total time from initial compromise to data exfiltration: 47 minutes.
Our “security” model: “Inside the network = trusted”
This archaic model almost destroyed our company.
After reading about Zero Trust architecture, our CISO declared: “We’re going Zero Trust. Six months. No excuses.”
This is the story of that migration - the painful transition, the unexpected benefits, and why we should have done this years ago.
The Traditional Model: A House of Cards
Before Zero Trust, our security looked like this:
┌─────────────────────────────────────┐
│ Corporate VPN (Sacred) │
│ ┌────────────────────────────────┐ │
│ │ Internal Network (Trusted) │ │
│ │ ┌──────────┐ ┌──────────┐ │ │
│ │ │ Wiki │ │ DB │ │ │
│ │ │ (HTTP) │ │ (no auth)│ │ │
│ │ └──────────┘ └──────────┘ │ │
│ │ ┌──────────┐ ┌──────────┐ │ │
│ │ │ Jenkins │ │ K8s API │ │ │
│ │ │ (no MFA)│ │(network) │ │ │
│ │ └──────────┘ └──────────┘ │ │
│ └────────────────────────────────┘ │
└─────────────────────────────────────┘
The problem: Once an attacker breached the VPN, they owned everything.
Decision Point: Zero Trust Architecture
What Zero Trust Actually Means
Traditional security: “Trust, then verify” Zero Trust: “Never trust, always verify”
Core principles:
- Verify explicitly: Authenticate and authorize every request
- Least privilege: Minimum access necessary, just-in-time
- Assume breach: Minimize blast radius through microsegmentation
Our Implementation Strategy
We chose a phased approach:
Phase 1 (Months 1-2): Identity & Access
- Implement SSO everywhere
- Deploy MFA universally
- Build identity-aware proxy
Phase 2 (Months 3-4): Network Segmentation
- Implement service mesh with mTLS
- Deploy network policies
- Remove network-level trust
Phase 3 (Months 5-6): Continuous Verification
- Deploy device posture checks
- Implement session monitoring
- Build automated threat response
Phase 1: Identity & Access (The Hardest Part)
Challenge 1: 850 Services, Zero SSO
We had 850 internal services with authentication ranging from “excellent” to “nonexistent”:
- 23% had SSO integration
- 41% had username/password (often shared credentials)
- 36% had no authentication (network-level trust)
The migration nightmare:
# Services with no authentication
$ kubectl get services -A -l auth=none | wc -l
307
# Services with shared passwords
$ grep -r "SHARED_PASSWORD" . | wc -l
189
# Services with hardcoded credentials
$ grep -r "password=" . | wc -l
441
Solution: Identity-Aware Proxy
We deployed Pomerium as our identity-aware proxy:
# Pomerium configuration for internal wiki
routes:
- from: https://wiki.internal.example.com
to: http://wiki-backend.default.svc.cluster.local
policy:
- allow:
and:
- email:
is: "@example.com" # Only company emails
- groups:
has: "engineering" # Must be in engineering group
pass_identity_headers: true # Forward user identity
- from: https://db-admin.internal.example.com
to: http://db-admin.default.svc.cluster.local
policy:
- allow:
and:
- email:
is: "@example.com"
- groups:
has: "database-admins" # Restricted group
- device_posture:
- is: "healthy" # Device must be compliant
mfa: true # Require MFA for sensitive resources
Migration pattern for services:
// Before: No authentication
app.get('/admin/users', (req, res) => {
const users = db.query('SELECT * FROM users');
res.json(users);
});
// After: Identity from proxy headers
app.get('/admin/users', (req, res) => {
// Pomerium injects verified identity headers
const userEmail = req.headers['x-pomerium-claim-email'];
const userGroups = JSON.parse(req.headers['x-pomerium-claim-groups']);
// Verify identity was provided
if (!userEmail) {
return res.status(401).json({ error: 'Unauthorized' });
}
// Check authorization
if (!userGroups.includes('admin')) {
return res.status(403).json({ error: 'Forbidden' });
}
const users = db.query('SELECT * FROM users');
res.json(users);
});
Results after 2 months:
- 850 services behind identity-aware proxy
- 100% SSO coverage
- Zero services without authentication
Challenge 2: MFA Fatigue
After deploying universal MFA, we got massive pushback from engineers:
“I have to MFA 12 times per day just to do my job!”
The problem: Traditional MFA required re-authentication every session.
The solution: Risk-based MFA + session management:
# Risk-based MFA policy
mfa_policy:
# Low risk: No MFA required
- condition:
location: office_network
device: managed
recent_mfa: within_8_hours
action: allow
# Medium risk: MFA required
- condition:
location: home_network
device: managed
action: require_mfa
mfa_frequency: 8_hours
# High risk: MFA + additional verification
- condition:
location: unknown
OR:
device: unmanaged
action: require_mfa
mfa_frequency: 1_hour
additional_verification: true
Result: MFA prompts dropped from 12/day to 1-2/day. Engineer happiness: restored.
Phase 2: Network Segmentation (The Technical Challenge)
Implementing Service Mesh with mTLS
Goal: Every service-to-service communication must be mutually authenticated and encrypted.
We chose Istio for our service mesh:
# Istio PeerAuthentication - enforce mTLS everywhere
apiVersion: security.istio.io/v1beta1
kind: PeerAuthentication
metadata:
name: default
namespace: istio-system
spec:
mtls:
mode: STRICT # Reject all non-mTLS traffic
The migration problem: Enforcing STRICT mode immediately would break 40% of services.
The solution: Progressive migration with monitoring:
# Phase 1: PERMISSIVE mode (allow both mTLS and plaintext)
apiVersion: security.istio.io/v1beta1
kind: PeerAuthentication
metadata:
name: default
spec:
mtls:
mode: PERMISSIVE # Accept mTLS or plaintext
---
# Phase 2: Monitor non-mTLS traffic
apiVersion: telemetry.istio.io/v1alpha1
kind: Telemetry
metadata:
name: mtls-monitoring
spec:
metrics:
- providers:
- name: prometheus
overrides:
- match:
metric: REQUEST_COUNT
tagOverrides:
connection_security_policy:
value: connection.mtls | "unknown"
Results after monitoring:
- Identified 127 services making non-mTLS calls
- Fixed all service-to-service communication
- Switched to STRICT mode with zero downtime
Network Policies: Microsegmentation
We implemented default-deny network policies:
# Default deny all traffic
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny-all
namespace: production
spec:
podSelector: {}
policyTypes:
- Ingress
- Egress
---
# Explicit allow for web → api communication
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: web-to-api
namespace: production
spec:
podSelector:
matchLabels:
app: api
policyTypes:
- Ingress
ingress:
- from:
- podSelector:
matchLabels:
app: web
ports:
- protocol: TCP
port: 8080
The challenge: Documenting 850 services’ communication patterns.
The solution: Automated network policy generation from Istio telemetry:
# Generate network policies from Istio metrics
def generate_network_policies(namespace):
# Query Prometheus for service-to-service traffic
query = f'''
sum by (source_workload, destination_workload, destination_port) (
rate(istio_tcp_connections_opened_total{{
destination_namespace="{namespace}"
}}[7d])
) > 0
'''
results = prometheus.query(query)
policies = []
for result in results:
source = result['source_workload']
dest = result['destination_workload']
port = result['destination_port']
# Generate network policy YAML
policy = {
'apiVersion': 'networking.k8s.io/v1',
'kind': 'NetworkPolicy',
'metadata': {
'name': f'{source}-to-{dest}',
'namespace': namespace
},
'spec': {
'podSelector': {'matchLabels': {'app': dest}},
'ingress': [{
'from': [{'podSelector': {'matchLabels': {'app': source}}}],
'ports': [{'protocol': 'TCP', 'port': int(port)}]
}]
}
}
policies.append(policy)
return policies
Result: Generated 2,347 network policies automatically. Manual review took 1 week vs. estimated 6 months to create manually.
Phase 3: Continuous Verification (The Game-Changer)
Device Posture Checks
Requirement: Only compliant devices can access internal resources.
Implementation: Osquery + Kolide integration:
# Device compliance policy
compliance_policy:
required_checks:
- name: "Disk Encryption"
query: "SELECT encrypted FROM disk_encryption WHERE encrypted = 1"
- name: "OS Patches"
query: "SELECT * FROM os_version WHERE patch_version >= '10.15.7'"
- name: "Firewall Enabled"
query: "SELECT * FROM alf WHERE global_state >= 1"
- name: "No Jailbreak/Root"
query: "SELECT * FROM system_info WHERE hardware_vendor != 'unknown'"
- name: "Antivirus Running"
query: "SELECT * FROM processes WHERE name LIKE '%antivirus%'"
Enforcement: Devices failing checks → blocked from accessing internal resources.
Exceptions: Emergency access with additional verification + automatic ticket for remediation.
Session Monitoring & Threat Response
We implemented continuous session monitoring:
# Real-time session risk scoring
class SessionRiskScorer:
def calculate_risk_score(self, session):
risk = 0
# Location risk
if session.location not in self.known_locations:
risk += 30
if session.location.country != session.user.home_country:
risk += 20
# Device risk
if not session.device.is_managed:
risk += 40
if session.device.posture_status != 'healthy':
risk += 30
# Behavior risk
if session.access_pattern != session.user.normal_pattern:
risk += 25
if session.accessed_sensitive_resource:
risk += 15
# Time risk
if session.time not in session.user.normal_hours:
risk += 10
return min(risk, 100)
def take_action(self, session, risk_score):
if risk_score >= 80:
# High risk: terminate session immediately
self.terminate_session(session)
self.alert_security_team(session, 'HIGH')
elif risk_score >= 60:
# Medium risk: require re-authentication
self.require_step_up_auth(session)
self.alert_security_team(session, 'MEDIUM')
elif risk_score >= 40:
# Low risk: increase monitoring
self.increase_monitoring(session)
Real-world example: Detected compromised developer account within 4 minutes:
- Login from unusual location (New Jersey, user normally in California)
- Access to resources user rarely touches (database admin panel)
- Unusual time (2:47 AM, user normally 9-5)
- Risk score: 87/100
- Action: Session terminated, security team alerted, user contacted
Outcome: Prevented breach. User confirmed account compromise. Credentials rotated.
The Performance Impact: Our Biggest Surprise
We expected Zero Trust to slow things down. We were wrong.
Latency Measurements
Before Zero Trust (VPN-based):
- VPN handshake: 1,200-2,400ms
- Average request latency: 450ms
- p99 latency: 2,100ms
After Zero Trust (identity-aware proxy + service mesh):
- Identity verification: 8-15ms (cached identity)
- Average request latency: 180ms
- p99 latency: 420ms
Result: 60% faster than VPN-based access. Stunned everyone.
Why faster?:
- No VPN overhead (direct access through cloud edge)
- Optimized routing (service mesh load balancing)
- Connection pooling (persistent mTLS connections)
- Regional edge deployment (identity verification at edge)
The Security Improvements: Quantified
Metrics After 6 Months
Lateral Movement:
- Before: Attacker moved from wiki → database in 18 minutes
- After: Impossible - every service requires authentication
Breach Detection:
- Before: Average detection time 4.2 hours
- After: Average detection time 4 minutes (97% improvement)
Attack Surface:
- Before: 307 services with no authentication
- After: 0 services without authentication
Credential Exposure:
- Before: 441 hardcoded credentials in code
- After: 0 credentials in code (all externalized to secrets management)
Phishing Resilience:
- Before: Compromised laptop = compromised network
- After: Compromised laptop + device posture + MFA = limited blast radius
Cost Impact
Security infrastructure:
- Identity-aware proxy: $45K/year
- Service mesh: $120K/year (mostly personnel)
- Secrets management: $80K/year
- Device management: $85K/year
- Total: $330K/year
Breaches prevented (calculated risk):
- Previous breach cost: $8.7M (incident response, customer notification, regulatory fines)
- Estimated breach probability without Zero Trust: 60% over 3 years
- Estimated breach probability with Zero Trust: 5% over 3 years
- Risk reduction value: $4.5M over 3 years
ROI: 1,364% over 3 years.
The Organizational Challenge: People > Technology
Engineer Resistance
Initial reaction: “This is security theater that will slow us down!”
Our approach:
- Demo the breach: Showed engineers how the attack happened
- Measure performance: Proved Zero Trust was actually faster
- Improve DevEx: Made security transparent to developers
- Listen to feedback: Iterated on MFA policies based on user experience
Result: 89% engineer satisfaction (up from 34% initially).
Security Culture Shift
Before: “Security is IT’s problem” After: “Security is everyone’s responsibility”
How we shifted culture:
- Security champions: Embedded security engineers in product teams
- Blameless post-mortems: Focused on systems, not individuals
- Automated security: Made secure defaults the easy path
- Education: Monthly security training, gamified learning
Lessons for Teams Considering Zero Trust
✅ Do This:
- Start with identity: SSO + MFA everywhere, before anything else
- Measure first: Monitor before enforcing policies
- Progressive rollout: PERMISSIVE → monitor → STRICT
- Automate policy generation: Don’t manually create 2,000+ policies
- Focus on user experience: Security that frustrates users will be bypassed
❌ Don’t Do This:
- Big bang migration: You will break things
- Skip monitoring phase: You don’t know what you don’t know
- Ignore user feedback: If MFA is annoying, people will find workarounds
- Forget about devices: Compromised devices = compromised sessions
- Neglect incident response: Zero Trust detects threats, but humans respond
What’s Next?
We’re now exploring:
- Workload identity: Extend Zero Trust to service-to-service communication
- API gateway integration: Zero Trust for external APIs
- AI-powered threat detection: ML models for anomaly detection
- Zero Trust for data: Encrypt data at rest with access policies
Zero Trust transformed our security posture from “perimeter defense” to “assume breach.” It’s harder to implement than traditional security, but the risk reduction is worth every hour.
For more on Zero Trust architecture patterns, see the comprehensive cloud security guide that helped inform our strategy.
Implementing Zero Trust? Connect on LinkedIn or share your security journey on Twitter.