Session log — 2026-04-22 / 2026-04-23 UTC — Phase 8 prod AWS mirror

Scope

Build the entire AWS prod stack in askflorence-prod (039624954211), mirroring staging 1:1 with prod-scoped values + HA. Get the current Next.js app (consumer marketplace, agent flows, APIs) running behind a CloudFront + WAF edge on a private canary URL (prod-canary.askflorence.health) for end-to-end validation before Phase 10 cutover. Vercel prod (askflorence.health + www) continues to serve real user traffic throughout. The apex does not move in this phase.

Actor

Human: Taha Abbasi.
Agent: Claude Opus 4.7 (1M context), running in Claude Code CLI.

Tickets

Advances Issue #47 Phase 8.
Identifies a pre-existing prod Vercel bug worth a follow-up: MONGODB_WRITE_URI on Vercel prod is empty-string, so writes from Vercel have been failing ~6 days. Taha sets the new URI on Vercel post-session to restore Vercel writes until cutover.

External systems touched

AWS (prod account `039624954211`)

Network module applied — VPC 10.20.0.0/16, 2 AZs (us-east-1a/b), 2 NAT gateways (HA vs staging's 1), 6 VPC endpoints multi-AZ (kms, secretsmanager, bedrock-runtime interface + S3 gateway + ECR api/dkr), 90-day flow-log retention. Disjoint from staging (10.40.0.0/16) and the future log-archive CIDR for eventual org-wide peering.
KMS — new CMK alias/askflorence-prod, rotation on, 30-day deletion window. Same service-principal grants as staging (Secrets Manager + CloudWatch Logs).
Secrets Manager — 15 prod shells under prod/*. Populated during the session:
- prod/mongodb/app-read — from Vercel prod env (the app-read user's SRV URI).
- prod/mongodb/app-write — freshly generated URI using a rotated app-write password (via atlas dbusers update). Safe rotation because Vercel's MONGODB_WRITE_URI was empty, so nothing on Vercel currently used app-write credentials.
- prod/mongodb/{waitlist,survey,agents}-write + prod/mongodb/agents-admin — stopgap-populated with the same app-write URI until a follow-up #56 prod session creates narrow-scoped users on the prod Atlas project.
- prod/mongodb/audit-read — populated with the app-read URI.
- prod/cms-api-key + prod/resend-api-key + prod/unsubscribe-token-secret + prod/posthog-key — copied from Vercel prod env.
- Florence / Bedrock / Whisper shells left as PLACEHOLDER.
ACM cert — askflorence.health + www.askflorence.health + *.askflorence.health in us-east-1 (required for CloudFront). DNS validation via 2 CNAMEs Taha added at Cloudflare. Status ISSUED in ~3 min after records landed.
SES identity — updates.askflorence.health verified (6 records added at Cloudflare by Taha: 3 DKIM CNAMEs + MX for MAIL FROM + SPF TXT + DMARC TXT p=quarantine). DKIM SUCCESS, MAIL FROM SUCCESS, VerifiedForSending true. SES account still in sandbox (production-access request filed separately).
ECR — askflorence-app repo with immutable tags (prod-strict — each tag can only be written once, no :latest drift), scan-on-push, 50-image lifecycle retention, CMK-encrypted.
ECS — cluster askflorence-prod, task definition family askflorence-prod-app-task (0.5 vCPU / 1 GB per task), service askflorence-prod-app desired 2 (HA across AZs), SES-send inline policy on task role, 90-day CloudWatch Logs retention.
ALB — askflorence-prod-alb-1177205004.us-east-1.elb.amazonaws.com. HTTPS listener with the prod cert, HTTP→HTTPS redirect. Deletion protection ON (prevents accidental terraform destroy from nuking the hostname CloudFront points at).
CloudFront distribution E9RU8LOGSYL9I (d1pnfyzua893hx.cloudfront.net). Serves 3 aliases: askflorence.health, www.askflorence.health, prod-canary.askflorence.health. PriceClass_All, HTTP/2+HTTP/3, TLSv1.2_2021, the same WAFv2 rule set used on staging (5 managed groups + rate rule 2000 req/5min/IP). Same response-headers policy — HSTS + CSP + X-Frame-Options DENY + Server override to AskFlorence.
Atlas prod peering pcx-0cefe999865679045 — Atlas-initiated, accepted on AWS, AllowDnsResolutionFromRemoteVpc=true on accepter side, routes added in both prod private route tables (192.168.248.0/21 → pcx). Atlas IP access list adds 10.20.0.0/16. Legacy 0.0.0.0/0 entry tagged dev stays in place until Phase 10 cutover (Vercel still needs reachability).
deploy-prod.yml GitHub Actions workflow — manual workflow_dispatch trigger (GitHub Team plan doesn't support required-reviewers on private-repo environments; workflow_dispatch is the approval surrogate). OIDC federation to arn:aws:iam::039624954211:role/GitHubActionsDeployRole. Smokes origin.askflorence.health/api/health (direct ALB, not CloudFront) because WAF managed rule groups false-positive-block GitHub Actions runner IPs on the prod-canary.* path.

Cloudflare (zone `askflorence.health`)

DNS records added manually by Taha during the session. All DNS-only (proxy OFF):

2 × CNAME for ACM validation (_<hex>.askflorence.health, _<hex>.www.askflorence.health)
3 × CNAME for SES DKIM (<token>._domainkey.updates)
1 × MX + 1 × TXT for SES MAIL FROM (mail.updates)
1 × TXT for DMARC (_dmarc.updates)
1 × CNAME origin.askflorence.health → ALB DNS (CloudFront origin handshake target)
1 × CNAME prod-canary.askflorence.health → CloudFront d1pnfyzua893hx.cloudfront.net (private canary URL for validation)

Apex + www CNAMEs stay pointed at Vercel through this phase. Phase 10 is the swap.

MongoDB Atlas (prod project `69dc20c64005b222804dafa4`)

Peering connection from prod project to the new prod AWS VPC. Added project-level route to 10.20.0.0/16 in prod VPC.
app-write user password rotated via atlas dbusers update — because Vercel's MONGODB_WRITE_URI is empty (pre-existing Vercel bug), nothing consumer-facing was using the old password so rotation is a pure no-op from Vercel's perspective.
IP access list — added 10.20.0.0/16. Kept the legacy 0.0.0.0/0 entry (tag dev) because Vercel still serves real traffic and uses unpredictable egress IPs.
Prod cluster data — untouched.

Vercel

Not touched. Apex DNS unchanged, env vars unchanged. Vercel continues serving real production traffic through this phase.

GitHub

Repo variables NEXT_PUBLIC_POSTHOG_PROJECT_TOKEN + NEXT_PUBLIC_POSTHOG_HOST already set during Phase 5.5 — the prod workflow reuses them.
No production environment protection rule created — GitHub Team plan limits required-reviewers to public repos. workflow_dispatch is the approval gate instead.

What shipped (chronological)

Terraform scaffolding applied in prod. Phase 3 had already planted versions.tf, providers.tf, github-oidc.tf for the prod env — they came up clean on terraform plan. Added network.tf, kms.tf.
terraform apply 1 — 28 resources: VPC + subnets + NAT HA + 6 VPC endpoints + KMS CMK + flow logs + IGW + RTs.
terraform apply 2 — 31 resources: 15 Secrets Manager shells + ACM cert (request only; validation pending DNS). Paused for Taha to add ACM validation CNAMEs at Cloudflare.
SES identity applied alongside ACM. Paused again for Taha to add 6 SES CNAMEs + MX + TXT at Cloudflare.
Polled ACM + SES status until all four signals green (ACM ISSUED, DKIM SUCCESS, MAIL FROM SUCCESS, VerifiedForSending true). ~5 min total propagation after DNS.
terraform apply 3 — 24 resources: ECR + ECS cluster/task-def/service (desired=0) + ALB + CloudFront distribution (slow first-create) + WAFv2 web ACL + response-headers policy + CloudWatch log groups.
Secrets populated with values pulled from vercel env pull. Discovered MONGODB_WRITE_URI="" on Vercel; rotated app-write password via Atlas CLI, populated prod/mongodb/app-write, applied the same URI to the 4 narrow-scoped write secrets as a stopgap until #56 prod session.
Atlas prod peering handshake via atlas networking peering create aws + aws ec2 accept-vpc-peering-connection. Routes added in both private RTs, allowlist entry added. Terraform-imported the accepter + routes into peering.tf.
deploy-prod.yml workflow written and pushed to main. First invocation ran, image built, ECS deployed, tasks came up, smoke step blocked by WAF on the runner IP (false-positive from AnonymousIpList/AmazonIpReputationList). Fix: switched smoke target from prod-canary.* (CloudFront + WAF) to origin.* (direct ALB). Second invocation failed on ECR immutable-tag rejection of :latest. Fix: stopped pushing :latest on prod + switched buildx cache from inline-in-ECR to GHA cache backend. Third invocation fully green.
Full canary validation — all endpoints serve correctly on prod-canary.askflorence.health with parity against Vercel responses; Mongo write over the peered network path succeeds with a real waitlist_submission_id; WAF blocks SQLi probes; CloudFront security headers + server override all correct.

Two gotchas worth preserving

(1) workflow_dispatch as approval gate on private Team-plan repos. The original plan had push: branches: [main] + a production GitHub environment with required-reviewers. Confirmed GitHub Team does not support required-reviewers on environments attached to private repos — that's an Enterprise ($21/user) feature. workflow_dispatch gives us the same "nothing deploys without Taha clicking a button" guarantee without a plan upgrade or making the repo public. Any Vercel-era "release on merge" habit doesn't apply.

(2) Immutable ECR tags + inline buildx cache are incompatible. --cache-to type=inline embeds the cache manifest into the image's own manifest, which means re-pushing the same tag. Fine with staging's immutable_tags=false. On prod with immutable_tags=true, every cached rebuild attempts a tag rewrite and gets rejected. The resolution — ditch :latest entirely on prod (task defs pin :<sha>, so no one consumes :latest) and use GitHub Actions' layer cache (type=gha) instead. Side benefit: GHA cache is account-scoped, not repository-scoped, so it doesn't leak container bits outside the account.

Verification

From operator laptop, direct public internet, against https://prod-canary.askflorence.health (via Cloudflare CNAME → CloudFront edge → origin.askflorence.health CNAME → ALB → ECS):

GET /api/health → 200 {"status":"ok","commit":"a189041…","env":"prod"}
GET /api/counties?state=TX&zip=75001 → 200 identical JSON to Vercel prod (CMS proxy)
GET /api/counties?state=NY&zip=10001 → 200 identical JSON to Vercel prod (owned-data path)
POST /api/waitlist → 200 with waitlist_submission_id (Mongo write via peering — NAT never touched)
Response headers from CloudFront: server: AskFlorence, strict-transport-security: max-age=31536000; includeSubDomains; preload, x-frame-options: DENY, content-security-policy: …
GET /?id=1' OR '1'='1 → 403 blocked by WAF SQLiRuleSet
ECS state: desired 2, running 2, rollout COMPLETED, task def revision :4 (after the narrow-user secret populate + force-new-deployment)
CloudWatch aws-waf-logs-askflorence-prod-web-acl receiving WAF logs

SES send path attempted on /api/waitlist flow returned with the expected sandbox rejection — recipient taha@askflorence.health is verified in the staging SES account, not prod. Non-blocker: app returns 200 because sendEmail is fire-and-forget; the code path is exercised and will work on first SES production-access approval + prod-side sandbox recipient verification.

What this session does NOT do

Does not move production DNS. Cloudflare apex still points at Vercel.
Does not retire Vercel. Vercel continues to serve real users exactly as before.
Does not populate prod secrets that Florence + Bedrock + Whisper will eventually use — those shells stay as PLACEHOLDER until the relevant workloads ship.
Does not provision narrow-scoped prod Atlas users. Stopgap points the app_writer_* secrets at the broad app-write URI. The proper scoped users land in a follow-up #56 prod session.
Does not remove the legacy 0.0.0.0/0 entry from the prod Atlas allowlist. Removing that would break Vercel right now. Phase 10 cutover is where it comes out.
Does not request SES production access from the prod account. Separate manual request; not blocking because nothing real sends email from the prod AWS stack yet.

Phase 9 — canary bake. Real-ish synthetic traffic against prod-canary.askflorence.health for 48 h. Full audit tier 1-5 parity run. GuardDuty + Security Hub clean. Nothing in Phase 9 touches apex DNS.
Phase 10 — Cloudflare apex CNAME flip askflorence.health + www from Vercel edges to d1pnfyzua893hx.cloudfront.net. After 48 h of clean cutover: pull 0.0.0.0/0 from prod Atlas allowlist, retire Vercel prod.
Phase 11 — post-cutover hardening. Resend retirement, PostHog self-host/replace decision, Drata read-only role activation, annual pen test vendor selection.
Phase 12 — SOC 2 + HIPAA + EDE control mapping docs closed out against the operating state established from Phase 2 onward.