AWS Account Recovery in 24 Hours: The 14-Step Lockdown

⚡ Incident Response · DevSecOps

AWS estimated 4 days. We restored full operations in under 24 and hardened the environment in 48 — a protocol later presented at AWS re:Inforce. This is the playbook, with the actual IAM policies, CloudTrail queries, and S3 lockdown patterns.

By: Tech Stack Playbook Engineering · 14 min read · Security · IaC · IAM

cloudtrail-alert.sh — ssh admin@tsp-incident-response

$ aws cloudtrail lookup-events --lookup-attributes AttributeKey=Username,AttributeValue=root [2026-04-19 02:14:07 UTC] Retrieving CloudTrail events... ⚠ ALERT: Root account login detected — source: 185.220.101.47 (Tor exit node) ⚠ ALERT: 247,382 SES SendEmail API calls in past 4 hours ⚠ ALERT: 14 S3 PutObject events to s3://client-redacted/malware/ ⚠ AWS Account Suspension triggered by AWS Trust & Safety [2026-04-19 02:14:11 UTC] AWS estimated resolution: 4 business days $ page tsp-security-oncall --severity=P0 ✔ Paged. Response team mobilizing.

<24h

Full Recovery

Lockdown Steps

75%

Faster Than AWS Est.

re:Inforce

Protocol Published

TL;DR

An executive advisory firm's AWS root account was compromised — attacker launched a quarter-million-email abuse campaign, uploaded malware to S3, and got the account suspended. Tech Stack Playbook's 48-hour incident response restored operations in under 24 hours against AWS's 4-day estimate.

The lockdown protocol that followed — 14 steps, from root-credential elimination to JWT-gated API re-architecture — is below. Every step includes the specific controls we'd ship today. Presented at AWS re:Inforce as a reference framework for developer-led incident response.

01 / ANATOMYHow the Breach Actually Unfolded

The breach followed the oldest playbook in cloud security: unprotected root credentials, no MFA, broad blast radius. What made it memorable was the speed of escalation — from first unauthorized API call to full account suspension, under four hours. Senior engineers have seen plenty of security engagements, but this one was instructive: every step the attacker took exploited a default that should never have been the default.

◉ Attack Escalation — T+0h to T+4h

STAGE 01

Root Key Compromise

Unsecured credentials exposed. No MFA gate.

STAGE 02

SES Abuse Campaign

247K+ unauthorized emails in one evening.

STAGE 03

S3 Malware Upload

Payloads dropped in 14 buckets. Public ACLs.

STAGE 04

Account Suspended

AWS Trust & Safety pulls the plug.

The tell at Stage 02 was volume: legitimate SES usage for this workload averaged ~400 emails per day. A CloudTrail lookup across a 4-hour window surfaced the divergence immediately — and it's the first query we run in any modern incident response engagement:

bash

# Surface SES API anomalies in the last 4 hours — volume + source IP
aws cloudtrail lookup-events \
  --lookup-attributes AttributeKey=EventName,AttributeValue=SendEmail \
  --start-time $(date -u -d '4 hours ago' +%s) \
  --query 'Events[].{Time:EventTime,User:Username,IP:CloudTrailEvent}' \
  --output table \
  | awk '/sourceIPAddress/ {print $NF}' | sort | uniq -c | sort -rn

02 / TIMELINEThe 48-Hour Recovery Arc

Drag the marker to scrub through the 48 hours. Each checkpoint is a real decision point — where most incident responses either compound or contain.

◉ Recovery Timeline T + 24:00h

T+0h T+12h T+24h T+36h T+48h

T + 24:00h

Full Operations Restored

Platform back online. All client-facing APIs serving traffic. Root credentials retired, initial MFA controls live, S3 bucket policies audited and locked. 75% ahead of AWS's stated recovery estimate.

03 / PROTOCOLThe 14-Step Security Lockdown

With operations restored, the real work began: hardening the environment so the same class of breach couldn't happen twice. Click into any step to see the specific controls, policies, and code we'd ship today. This protocol pulls directly from lessons learned on our multi-account AWS governance work and adjacent DevSecOps engagements.

The root account is not an operator account. First 60 minutes: rotate the root password, force-retire every access key under root, enable MFA (hardware preferred), and enforce an SCP that denies root usage for anything except the handful of tasks that genuinely require it (billing signup, account close, root-only service endpoints).

json · SCP

{
  "Version": "2012-10-17",
  "Statement": [{
    "Sid": "DenyRootUserActions",
    "Effect": "Deny",
    "Action": "*",
    "Resource": "*",
    "Condition": {
      "StringLike": {
        "aws:PrincipalArn": "arn:aws:iam::*:root"
      }
    }
  }]
}

Long-lived IAM users are an anti-pattern in 2026. Every human operator gets federated access through IAM Identity Center (formerly AWS SSO) with short-lived credentials. Every remaining IAM user gets deleted or disabled. Every long-lived access key gets rotated to a role-based assumption pattern.

bash · audit

# Find every active long-lived access key older than 90 days
aws iam list-users --query 'Users[*].UserName' --output text \
  | tr '\t' '\n' \
  | xargs -I {} aws iam list-access-keys --user-name {} \
      --query 'AccessKeyMetadata[?Status==`Active`]' \
      --output table

The attacker had AdministratorAccess because the compromised identity had it. Rewrite every policy to grant the minimum set of actions on the minimum set of resources. Restrict IAM creation itself to a small, named admin group — protected by both MFA and an SCP boundary.

Fastest path: use IAM Access Analyzer's policy generation to diff intended vs. actual privileges, then prune.

Enable Block Public Access at the account level, not just per-bucket. Audit every bucket policy for "Principal": "*" and kill it on sight. Every object upload triggers a malware scan (GuardDuty Malware Protection or an S3 EventBridge → Lambda → ClamAV pipeline).

bash · block public

aws s3control put-public-access-block \
  --account-id $AWS_ACCOUNT_ID \
  --public-access-block-configuration \
    BlockPublicAcls=true,IgnorePublicAcls=true,\
BlockPublicPolicy=true,RestrictPublicBuckets=true

Databases move to private subnets with no public ingress. API Gateway endpoints enforce JWT authorization at the edge, validated against a managed auth layer (Cognito or equivalent). Client data is never served through an unauthenticated path.

Sandbox SES. Explicit sending authorization policies. Configuration sets with per-send quotas and bounce/complaint tracking. Anomaly detection on send volume (in this case, anything above 2× rolling 7-day baseline pages on-call immediately).

Organization-wide CloudTrail, multi-region, log file validation enabled, delivered to a dedicated log-archive account with S3 Object Lock in compliance mode. If an attacker can rewrite your audit trail, you don't have an audit trail.

Continuous threat detection via GuardDuty. Findings aggregated in Security Hub across every account. EventBridge rules route high-severity findings to PagerDuty with a documented on-call rotation. No silent findings.

Customer-managed KMS keys for every tier of sensitive data. Key policies scoped to specific IAM principals. Automatic annual rotation. Client transcripts, PII, and payment data never stored in plaintext anywhere.

No plaintext secrets in env vars, config files, or repos. Every credential lives in Secrets Manager with Lambda-backed rotation (we default to 58-day cycles to stay well inside the 60-day quarterly audit window). Applications pull at runtime, not at deploy.

Every infrastructure change lands through Terraform with peer review. Console access is read-mostly. Drift detection runs nightly. This is the same IaC-first pattern we implemented during our multi-account AWS modernization — it pays compounding dividends the minute an incident happens.

AWS WAF in front of every internet-facing API with managed rule groups (OWASP Top 10, bad bots, known-bad IPs) and custom rate-limit rules scoped per authenticated identity.

The first IR engagement is always the most expensive, because there's no runbook. We left the client with documented runbooks for root compromise, data exfiltration, SES abuse, and account suspension — plus quarterly tabletop drills that test them.

Security is never done. Quarterly reviews walk the full attack surface, audit new IAM policies added since last cycle, re-validate that GuardDuty/Security Hub have no stale findings, and update the threat model as the platform evolves.

◉ Key Insight

Client coaching transcripts — private strategy sessions with nine- and ten-figure entrepreneurs — had been stored in publicly accessible S3 buckets and served through unauthenticated API endpoints. The breach was the forcing function. The lockdown was the product.

04 / ARCHITECTUREBefore & After — What Actually Changed

The post-lockdown architecture isn't exotic. It's what a security-first AWS account looks like when every default has been explicitly hardened instead of inherited.

BEFORE Pre-Breach Architecture

IAM

Root keys active · No MFA

S3 Buckets

Public ACLs · No scanning

SES

Out of sandbox · No quota

CloudTrail

Single region · No validation

API Gateway

Unauthenticated endpoints

AFTER Lockdown Architecture

IAM Identity Center

SSO · Short-lived creds

S3 + BPA

Block Public + GuardDuty

SES + Quotas

Sandboxed · Rate-limited

Org CloudTrail

Multi-region · Object Lock

API + JWT + WAF

Authenticated · Rate-limited

05 / OUTCOMESWhat Shipped

<24h

Recovery Time

Full platform restored vs. AWS's 4-day estimate — minimizing downtime for ultra-high-net-worth clientele.

14/14

Lockdown Complete

All hardening steps implemented and validated within 48 hours of initial engagement.

Exposed Transcripts

Previously public client data locked down with KMS envelope encryption and JWT-gated delivery.

re:Inforce

Methodology Published

Protocol later presented at AWS re:Inforce as a framework for developer-led incident response.

Stack

IAM

Foundational identity service powering every role, policy, and resource boundary.

IAM Identity Center

Federated SSO replacing every long-lived IAM user with short-lived credentials.

Cognito

Managed identity provider issuing JWTs validated at the API edge.

AWS WAF

Edge-layer OWASP rules and per-identity rate limiting on every public API.

GuardDuty

Continuous threat detection and malware scanning across the entire account.

Security Hub

Cross-account findings aggregated into a single security posture dashboard.

CloudTrail

Org-wide audit log delivered to a locked-down archive account with Object Lock.

CloudWatch

Real-time logs, metrics, and alarms across every account and service.

EventBridge

Event bus routing GuardDuty and Security Hub findings to on-call workflows.

KMS

Customer-managed envelope encryption for transcripts, PII, and payment data.

Secrets Manager

Runtime credential storage with Lambda-backed 58-day automatic rotation.

S3 Block Public Access

Account-level public-access controls plus per-object malware scanning.

Terraform

Every infrastructure change shipped through peer-reviewed IaC. No click-ops.

06 / TAKEAWAYWhy Developer-Led IR Beats a Ticket Queue

AWS's 4-day estimate isn't a judgment on AWS — it's a function of the generic queue every support request sits in. Developer-led incident response is faster because the people triaging the breach are the same people who can rewrite an SCP, ship a JWT middleware, and redeploy infrastructure as code before lunch. That's the difference between a ticket and a response team. It's also the argument for having security-minded engineers embedded before anything goes wrong, not after.

If you're reading this and thinking about your own root account MFA status — that's the right instinct. Go check right now. We'll wait.

Need an incident-response-ready AWS foundation?

We partner with enterprise teams to design, build, and ship production-grade cloud systems that are secure-by-design from day one — so breaches are contained, not catastrophic.

Book a strategy call

Explore more

Full incident response case study DevSecOps engagements Meet the TSP team

We Recovered a Compromised AWS Account in Under 24 Hours: Here's the 14-Step Lockdown Protocol