Zero Trust Journey: From 'Trust the Network' to 'Trust No One' in 6 Months

Migrating 850 services to Zero Trust architecture - the ransomware attack that forced our hand, the $470K security upgrade, and why lateral movement dropped 99.7%.

The Breach That Changed Everything

February 2024, 3:42 AM. Security Operations Center detects anomalous lateral movement.

An attacker had compromised a single developer laptop through a phishing email. Within 18 minutes, they:

  • Accessed our internal wiki (no authentication required)
  • Discovered database credentials (hardcoded in wiki documentation)
  • Pivoted to production database servers (network-level trust)
  • Exfiltrated 2.3 million customer records

Total time from initial compromise to data exfiltration: 47 minutes.

Our “security” model: “Inside the network = trusted”

This archaic model almost destroyed our company.

After reading about Zero Trust architecture, our CISO declared: “We’re going Zero Trust. Six months. No excuses.”

This is the story of that migration - the painful transition, the unexpected benefits, and why we should have done this years ago.

The Traditional Model: A House of Cards

Before Zero Trust, our security looked like this:

┌─────────────────────────────────────┐
│       Corporate VPN (Sacred)        │
│  ┌────────────────────────────────┐ │
│  │   Internal Network (Trusted)   │ │
│  │  ┌──────────┐   ┌──────────┐  │ │
│  │  │   Wiki   │   │   DB     │  │ │
│  │  │  (HTTP)  │   │ (no auth)│  │ │
│  │  └──────────┘   └──────────┘  │ │
│  │  ┌──────────┐   ┌──────────┐  │ │
│  │  │  Jenkins │   │  K8s API │  │ │
│  │  │  (no MFA)│   │(network) │  │ │
│  │  └──────────┘   └──────────┘  │ │
│  └────────────────────────────────┘ │
└─────────────────────────────────────┘

The problem: Once an attacker breached the VPN, they owned everything.

Decision Point: Zero Trust Architecture

What Zero Trust Actually Means

Traditional security: “Trust, then verify” Zero Trust: “Never trust, always verify”

Core principles:

  1. Verify explicitly: Authenticate and authorize every request
  2. Least privilege: Minimum access necessary, just-in-time
  3. Assume breach: Minimize blast radius through microsegmentation

Our Implementation Strategy

We chose a phased approach:

Phase 1 (Months 1-2): Identity & Access

  • Implement SSO everywhere
  • Deploy MFA universally
  • Build identity-aware proxy

Phase 2 (Months 3-4): Network Segmentation

  • Implement service mesh with mTLS
  • Deploy network policies
  • Remove network-level trust

Phase 3 (Months 5-6): Continuous Verification

  • Deploy device posture checks
  • Implement session monitoring
  • Build automated threat response

Phase 1: Identity & Access (The Hardest Part)

Challenge 1: 850 Services, Zero SSO

We had 850 internal services with authentication ranging from “excellent” to “nonexistent”:

  • 23% had SSO integration
  • 41% had username/password (often shared credentials)
  • 36% had no authentication (network-level trust)

The migration nightmare:

# Services with no authentication
$ kubectl get services -A -l auth=none | wc -l
307

# Services with shared passwords
$ grep -r "SHARED_PASSWORD" . | wc -l
189

# Services with hardcoded credentials
$ grep -r "password=" . | wc -l
441

Solution: Identity-Aware Proxy

We deployed Pomerium as our identity-aware proxy:

# Pomerium configuration for internal wiki
routes:
  - from: https://wiki.internal.example.com
    to: http://wiki-backend.default.svc.cluster.local
    policy:
      - allow:
          and:
            - email:
                is: "@example.com"  # Only company emails
            - groups:
                has: "engineering"   # Must be in engineering group
    pass_identity_headers: true  # Forward user identity
    
  - from: https://db-admin.internal.example.com
    to: http://db-admin.default.svc.cluster.local
    policy:
      - allow:
          and:
            - email:
                is: "@example.com"
            - groups:
                has: "database-admins"  # Restricted group
            - device_posture:
                - is: "healthy"         # Device must be compliant
    mfa: true  # Require MFA for sensitive resources

Migration pattern for services:

// Before: No authentication
app.get('/admin/users', (req, res) => {
  const users = db.query('SELECT * FROM users');
  res.json(users);
});

// After: Identity from proxy headers
app.get('/admin/users', (req, res) => {
  // Pomerium injects verified identity headers
  const userEmail = req.headers['x-pomerium-claim-email'];
  const userGroups = JSON.parse(req.headers['x-pomerium-claim-groups']);
  
  // Verify identity was provided
  if (!userEmail) {
    return res.status(401).json({ error: 'Unauthorized' });
  }
  
  // Check authorization
  if (!userGroups.includes('admin')) {
    return res.status(403).json({ error: 'Forbidden' });
  }
  
  const users = db.query('SELECT * FROM users');
  res.json(users);
});

Results after 2 months:

  • 850 services behind identity-aware proxy
  • 100% SSO coverage
  • Zero services without authentication

Challenge 2: MFA Fatigue

After deploying universal MFA, we got massive pushback from engineers:

“I have to MFA 12 times per day just to do my job!”

The problem: Traditional MFA required re-authentication every session.

The solution: Risk-based MFA + session management:

# Risk-based MFA policy
mfa_policy:
  # Low risk: No MFA required
  - condition:
      location: office_network
      device: managed
      recent_mfa: within_8_hours
    action: allow
    
  # Medium risk: MFA required
  - condition:
      location: home_network
      device: managed
    action: require_mfa
    mfa_frequency: 8_hours
    
  # High risk: MFA + additional verification
  - condition:
      location: unknown
      OR:
        device: unmanaged
    action: require_mfa
    mfa_frequency: 1_hour
    additional_verification: true

Result: MFA prompts dropped from 12/day to 1-2/day. Engineer happiness: restored.

Phase 2: Network Segmentation (The Technical Challenge)

Implementing Service Mesh with mTLS

Goal: Every service-to-service communication must be mutually authenticated and encrypted.

We chose Istio for our service mesh:

# Istio PeerAuthentication - enforce mTLS everywhere
apiVersion: security.istio.io/v1beta1
kind: PeerAuthentication
metadata:
  name: default
  namespace: istio-system
spec:
  mtls:
    mode: STRICT  # Reject all non-mTLS traffic

The migration problem: Enforcing STRICT mode immediately would break 40% of services.

The solution: Progressive migration with monitoring:

# Phase 1: PERMISSIVE mode (allow both mTLS and plaintext)
apiVersion: security.istio.io/v1beta1
kind: PeerAuthentication
metadata:
  name: default
spec:
  mtls:
    mode: PERMISSIVE  # Accept mTLS or plaintext

---
# Phase 2: Monitor non-mTLS traffic
apiVersion: telemetry.istio.io/v1alpha1
kind: Telemetry
metadata:
  name: mtls-monitoring
spec:
  metrics:
    - providers:
        - name: prometheus
      overrides:
        - match:
            metric: REQUEST_COUNT
          tagOverrides:
            connection_security_policy:
              value: connection.mtls | "unknown"

Results after monitoring:

  • Identified 127 services making non-mTLS calls
  • Fixed all service-to-service communication
  • Switched to STRICT mode with zero downtime

Network Policies: Microsegmentation

We implemented default-deny network policies:

# Default deny all traffic
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-all
  namespace: production
spec:
  podSelector: {}
  policyTypes:
    - Ingress
    - Egress

---
# Explicit allow for web → api communication
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: web-to-api
  namespace: production
spec:
  podSelector:
    matchLabels:
      app: api
  policyTypes:
    - Ingress
  ingress:
    - from:
        - podSelector:
            matchLabels:
              app: web
      ports:
        - protocol: TCP
          port: 8080

The challenge: Documenting 850 services’ communication patterns.

The solution: Automated network policy generation from Istio telemetry:

# Generate network policies from Istio metrics
def generate_network_policies(namespace):
    # Query Prometheus for service-to-service traffic
    query = f'''
        sum by (source_workload, destination_workload, destination_port) (
            rate(istio_tcp_connections_opened_total{{
                destination_namespace="{namespace}"
            }}[7d])
        ) > 0
    '''
    
    results = prometheus.query(query)
    
    policies = []
    for result in results:
        source = result['source_workload']
        dest = result['destination_workload']
        port = result['destination_port']
        
        # Generate network policy YAML
        policy = {
            'apiVersion': 'networking.k8s.io/v1',
            'kind': 'NetworkPolicy',
            'metadata': {
                'name': f'{source}-to-{dest}',
                'namespace': namespace
            },
            'spec': {
                'podSelector': {'matchLabels': {'app': dest}},
                'ingress': [{
                    'from': [{'podSelector': {'matchLabels': {'app': source}}}],
                    'ports': [{'protocol': 'TCP', 'port': int(port)}]
                }]
            }
        }
        policies.append(policy)
    
    return policies

Result: Generated 2,347 network policies automatically. Manual review took 1 week vs. estimated 6 months to create manually.

Phase 3: Continuous Verification (The Game-Changer)

Device Posture Checks

Requirement: Only compliant devices can access internal resources.

Implementation: Osquery + Kolide integration:

# Device compliance policy
compliance_policy:
  required_checks:
    - name: "Disk Encryption"
      query: "SELECT encrypted FROM disk_encryption WHERE encrypted = 1"
      
    - name: "OS Patches"
      query: "SELECT * FROM os_version WHERE patch_version >= '10.15.7'"
      
    - name: "Firewall Enabled"
      query: "SELECT * FROM alf WHERE global_state >= 1"
      
    - name: "No Jailbreak/Root"
      query: "SELECT * FROM system_info WHERE hardware_vendor != 'unknown'"
      
    - name: "Antivirus Running"
      query: "SELECT * FROM processes WHERE name LIKE '%antivirus%'"

Enforcement: Devices failing checks → blocked from accessing internal resources.

Exceptions: Emergency access with additional verification + automatic ticket for remediation.

Session Monitoring & Threat Response

We implemented continuous session monitoring:

# Real-time session risk scoring
class SessionRiskScorer:
    def calculate_risk_score(self, session):
        risk = 0
        
        # Location risk
        if session.location not in self.known_locations:
            risk += 30
        if session.location.country != session.user.home_country:
            risk += 20
            
        # Device risk  
        if not session.device.is_managed:
            risk += 40
        if session.device.posture_status != 'healthy':
            risk += 30
            
        # Behavior risk
        if session.access_pattern != session.user.normal_pattern:
            risk += 25
        if session.accessed_sensitive_resource:
            risk += 15
            
        # Time risk
        if session.time not in session.user.normal_hours:
            risk += 10
            
        return min(risk, 100)
        
    def take_action(self, session, risk_score):
        if risk_score >= 80:
            # High risk: terminate session immediately
            self.terminate_session(session)
            self.alert_security_team(session, 'HIGH')
            
        elif risk_score >= 60:
            # Medium risk: require re-authentication
            self.require_step_up_auth(session)
            self.alert_security_team(session, 'MEDIUM')
            
        elif risk_score >= 40:
            # Low risk: increase monitoring
            self.increase_monitoring(session)

Real-world example: Detected compromised developer account within 4 minutes:

  1. Login from unusual location (New Jersey, user normally in California)
  2. Access to resources user rarely touches (database admin panel)
  3. Unusual time (2:47 AM, user normally 9-5)
  4. Risk score: 87/100
  5. Action: Session terminated, security team alerted, user contacted

Outcome: Prevented breach. User confirmed account compromise. Credentials rotated.

The Performance Impact: Our Biggest Surprise

We expected Zero Trust to slow things down. We were wrong.

Latency Measurements

Before Zero Trust (VPN-based):

  • VPN handshake: 1,200-2,400ms
  • Average request latency: 450ms
  • p99 latency: 2,100ms

After Zero Trust (identity-aware proxy + service mesh):

  • Identity verification: 8-15ms (cached identity)
  • Average request latency: 180ms
  • p99 latency: 420ms

Result: 60% faster than VPN-based access. Stunned everyone.

Why faster?:

  1. No VPN overhead (direct access through cloud edge)
  2. Optimized routing (service mesh load balancing)
  3. Connection pooling (persistent mTLS connections)
  4. Regional edge deployment (identity verification at edge)

The Security Improvements: Quantified

Metrics After 6 Months

Lateral Movement:

  • Before: Attacker moved from wiki → database in 18 minutes
  • After: Impossible - every service requires authentication

Breach Detection:

  • Before: Average detection time 4.2 hours
  • After: Average detection time 4 minutes (97% improvement)

Attack Surface:

  • Before: 307 services with no authentication
  • After: 0 services without authentication

Credential Exposure:

  • Before: 441 hardcoded credentials in code
  • After: 0 credentials in code (all externalized to secrets management)

Phishing Resilience:

  • Before: Compromised laptop = compromised network
  • After: Compromised laptop + device posture + MFA = limited blast radius

Cost Impact

Security infrastructure:

  • Identity-aware proxy: $45K/year
  • Service mesh: $120K/year (mostly personnel)
  • Secrets management: $80K/year
  • Device management: $85K/year
  • Total: $330K/year

Breaches prevented (calculated risk):

  • Previous breach cost: $8.7M (incident response, customer notification, regulatory fines)
  • Estimated breach probability without Zero Trust: 60% over 3 years
  • Estimated breach probability with Zero Trust: 5% over 3 years
  • Risk reduction value: $4.5M over 3 years

ROI: 1,364% over 3 years.

The Organizational Challenge: People > Technology

Engineer Resistance

Initial reaction: “This is security theater that will slow us down!”

Our approach:

  1. Demo the breach: Showed engineers how the attack happened
  2. Measure performance: Proved Zero Trust was actually faster
  3. Improve DevEx: Made security transparent to developers
  4. Listen to feedback: Iterated on MFA policies based on user experience

Result: 89% engineer satisfaction (up from 34% initially).

Security Culture Shift

Before: “Security is IT’s problem” After: “Security is everyone’s responsibility”

How we shifted culture:

  1. Security champions: Embedded security engineers in product teams
  2. Blameless post-mortems: Focused on systems, not individuals
  3. Automated security: Made secure defaults the easy path
  4. Education: Monthly security training, gamified learning

Lessons for Teams Considering Zero Trust

✅ Do This:

  1. Start with identity: SSO + MFA everywhere, before anything else
  2. Measure first: Monitor before enforcing policies
  3. Progressive rollout: PERMISSIVE → monitor → STRICT
  4. Automate policy generation: Don’t manually create 2,000+ policies
  5. Focus on user experience: Security that frustrates users will be bypassed

❌ Don’t Do This:

  1. Big bang migration: You will break things
  2. Skip monitoring phase: You don’t know what you don’t know
  3. Ignore user feedback: If MFA is annoying, people will find workarounds
  4. Forget about devices: Compromised devices = compromised sessions
  5. Neglect incident response: Zero Trust detects threats, but humans respond

What’s Next?

We’re now exploring:

  1. Workload identity: Extend Zero Trust to service-to-service communication
  2. API gateway integration: Zero Trust for external APIs
  3. AI-powered threat detection: ML models for anomaly detection
  4. Zero Trust for data: Encrypt data at rest with access policies

Zero Trust transformed our security posture from “perimeter defense” to “assume breach.” It’s harder to implement than traditional security, but the risk reduction is worth every hour.

For more on Zero Trust architecture patterns, see the comprehensive cloud security guide that helped inform our strategy.


Implementing Zero Trust? Connect on LinkedIn or share your security journey on Twitter.