AskFlorence — System Architecture

Status: Living document. Last updated 2026-05-11. Version: 0.4.0 (post-Phase-11 architecture: AWS-only frontend + cross-cluster Atlas PrivateLink + Agent terminology) Recent changes: see session log and infrastructure change log for timestamped history.

TL;DR

AskFlorence is a 30 federal + NY ACA marketplace with real-time subsidized pricing (own-data path), 19 SBE states + DC handled via redirect, and a doctor + Rx coverage layer that calls CMS Marketplace API directly with a non-PHI §1311 reference dataset on a separate Atlas cluster as fallback.

The whole platform runs on AWS commercial us-east-1. Frontend (askflorence.health, www.askflorence.health) is served from ECS Fargate behind CloudFront + WAFv2. Vercel was retired at Phase 10 cutover (2026-04-23). Compliance posture is HIPAA-active (AWS Org BAA signed 2026-04-18 + Atlas HIPAA tier on prod cluster), SOC 2 Type II evidence collection in progress (window through January 2027), CMS EDE Phase 3 submission targeted for September 2026 with the entire stack inside FedRAMP Moderate boundary.

This doc is the first thing a new contributor, agent, or auditor reads. The diagrams below are authoritative — when they disagree with code, the code is wrong.

AWS Organizations topology

Account responsibilities:

Account	Account ID	Responsibility
mgmt	`778477254880`	Org root, SCPs, SSO Identity Center, AWS Artifact BAA + reports, Terraform state, `askflorence-data` S3 bucket (source-file audit trail)
prod	`039624954211`	Production workloads — apex `askflorence.health`, prod ECS Fargate, prod Atlas peering, prod SES
staging	`549136075525`	Staging workloads — `stage.askflorence.health`, staging ECS, §1311 reference data ingest pipeline + staging Atlas cluster
log-archive	`754660694122`	Organization CloudTrail, Config aggregator, WAF logs (Kinesis Firehose), Security Hub findings, VPC Flow Logs export. Object Lock 7-year retention (HIPAA 6yr + buffer)

SCP ScpBaseline (p-oy7xxdzz, attached to all 3 workload OUs):

Deny root user actions (with carve-out for one-time bootstrap actions like MFA setup)
Deny organizations:LeaveOrganization
Deny disabling CloudTrail / Config / GuardDuty / Security Hub
Region lock to us-east-1 (global services exempted)

Request flow — apex prod

Key contract: the application uses two distinct MongoDB connection pools:

getDb() — primary cluster (MONGODB_URI). Reads + writes for app data (plans, ZIPs, agents, audit log, surveys).
getReferenceDb() — reference cluster via PrivateLink (MONGODB_REFERENCE_URI). Required — no silent fallback. Read-only access to formularies_staging, providers_staging, plans (audit purposes), mrpuf_issuers_staging.

The CloudFront response-headers policy strips upstream identifiers (Server, X-Powered-By, Via, X-Amz-*, X-Cache) and overrides Server: AskFlorence. CSP + HSTS + X-Frame-Options DENY + X-Content-Type-Options nosniff + Referrer-Policy strict-origin-when-cross-origin are enforced at the edge.

State routing

State universe:

Set	Count	Source
Federal-30	30	CMS PUF age-rated premiums, own DB
NY	1	NY DFS / NYSOH community-rated, own DB
SBE redirect	19 + DC	`sbeRedirect` docs in `zip_county`, MongoDB-first since v0.34 (commit `ccad089`)

Source of truth for the state list: src/lib/constants.ts → STATE_BASED_MARKETPLACES. Federal-30 ∪ {NY} = the "own-data" universe; everything else in STATE_BASED_MARKETPLACES redirects.

zip_county collection coverage is at TRUE 100% match against CMS for federal + NY (Tier 1 audit) and for SBE redirect docs (Tier 1.5 audit) post the Tier 0.5 USPS-completeness audit (2026-05-01) + drive-to-100% completion.

Atlas topology — two projects, two clusters, one cross-cluster path

Project isolation per ADR 0001 — project-scoped users, roles, allowlists, alerts. A mistake on staging cannot widen prod's surface. Cross-cluster reads use AWS PrivateLink per ADR 0004 — saves ~$326/mo vs duplicating reference data onto a prod M30, keeps PHI / non-PHI audit boundaries clean.

MongoDB collections

Where each collection lives

Prod cluster (askflorence-prod-01, M10 HIPAA):

App data: plan_years, plans, regions, zip_county, audit_log
Agent platform: agent_waitlist_submissions, agent_survey_responses, agent_unsubscribes, agents, agencies, agent_sessions, admins, agent_audit_log, agent_nipr_records, agent_id_verifications, hubspot_sync_log

Staging cluster (askflorence-staging, M30, non-PHI reference data):

formularies_staging — 12,557 RxCUIs / ~30M drug-plan tuples (§1311 MRF ingest)
providers_staging — 2.14M NPI provider docs
drug_search_index — ENG-425 derived drug-name search read-model (~6,100 ingredient+form groups; full strength union + brand/generic names + coverage-breadth commonality). Search-only (NO coverage). Rebuilt post-ingest from formularies_staging by scripts/db/derive-drug-search-index.js --apply.
mrpuf_issuers_staging — issuer reference data for §1311 reconciliation
plans — staging copy of the §1311 plan reference (for audit harnesses)

Prod app reads from staging cluster read-only via getReferenceDb() (PrivateLink). The §1311 ingest pipeline runs in the staging AWS account and writes to the staging cluster only — prod picks up refreshes automatically through PrivateLink, no double-ingest.

Append-only audit log

agent_audit_log is enforced append-only at the DB role layer per ADR 0002. Both role_writer_agents and role_admin_agents have FIND + INSERT only on this collection — no UPDATE, no REMOVE. Application code cannot bypass this — the trust boundary is at the Atlas role JSON, not at the Node layer.

Narrow-scoped MongoDB users

Per ADR 0003 — a compromise of any single app-tier credential is bounded by that credential's collection scope. Production retirement of the legacy app-write user (broad readWrite) is a pre-launch gate, tracked at #56.

User	Role	Env var	Purpose
`app-read`	built-in `read@askflorence`	`MONGODB_URI`	Read for all paths (legacy, replaced on staging by `app_read_local_staging`)
`app_read_staging`	custom `role_reader_reference` (4 collections: `formularies_staging`, `providers_staging`, `plans`, `mrpuf_issuers_staging`; ENG-425 adds `drug_search_index` as a 5th — grant the live role FIND on it in lock-step with `infra/atlas/access-matrix.ts` BEFORE pointing `/plans`/Florence at `/api/drugs/search`)	`MONGODB_REFERENCE_URI` (PrivateLink endpoint)	Cross-cluster reference reads from prod into staging cluster
`app_writer_survey`	`role_writer_survey` (FIND/INSERT/UPDATE/REMOVE on `agent_survey_responses`)	`MONGODB_URI_SURVEY_WRITE`	Discovery survey writes
`app_writer_waitlist`	`role_writer_waitlist` (`agent_waitlist_submissions`, `agent_unsubscribes`)	`MONGODB_URI_WAITLIST_WRITE`	Waitlist + unsubscribe writes
`app_writer_plans`	`role_writer_plans` (`plans`, `zip_county`, `regions`, `plan_years`, `audit_log` + index management)	`MONGODB_URI_PLANS_WRITE`	Ingest scripts
`app_writer_agents`	`role_writer_agents` (`agents`, `agencies`, `agent_sessions`, FIND+INSERT only on `agent_audit_log`, NO access to `admins`)	`MONGODB_URI_AGENTS_WRITE`	Agent platform runtime writes
`app_admin_agents`	`role_admin_agents` (everything in `role_writer_agents` + FIND/INSERT/UPDATE/REMOVE on `admins`, still append-only on `agent_audit_log`)	`MONGODB_URI_AGENTS_ADMIN`	Admin promotion path
`audit_reader`	`role_audit_reader` (FIND on `agent_audit_log` only)	`MONGODB_URI_AUDIT_READ`	Compliance review access
`app_admin_schema`	`role_admin_schema` (index ensure across collections)	`MONGODB_URI_ADMIN_SCHEMA`	Deploy-time index ensure via in-VPC ECS RunTask (per ENG-266)
`hubspot-sync-write`	scoped writer on `agents`, `agencies`, `hubspot_sync_log`	`MONGODB_URI_HUBSPOT_SYNC_WRITE`	HubSpot ↔ Mongo bidirectional sync

Open follow-ups:

#56 — retire app-write from prod cluster; rotate app-read + app-write passwords (they sat in long-lived local files).
#166 — pattern critique: the narrow-scoped user model has produced silent regressions when env vars bind to credentials that lack the consuming code path's permissions. "Broaden first, narrow with discipline" alternative under consideration.
#271 — completed: split staging cross-cluster reader (app_read_staging) from staging deployment-local reader (app_read_local_staging) so cross-cluster role narrowing doesn't break staging's own app.

CI guard chain (per #100) catches drift in two complementary layers:

Static check — scripts/audit/staging-collections-guard.ts runs on every PR; fails build if getReferenceDb() is called against any collection not in STAGING_ALLOWED_COLLECTIONS.
Live nightly drift check — scripts/audit/staging-cluster-drift.ts runs at 08:00 UTC daily; audits the actual Atlas state of app_read_staging against the canonical 4-collection matrix; opens a P1 GitHub issue on drift.

Cross-cluster reference reads — Phase 11 detail

Live since 2026-05-08. Architecturally documented in ADR 0004; session log at 2026-05-08-phase-11-cross-cluster-privatelink.md.

Network posture (verified for HIPAA §164.312(e)(1) Transmission Security):

No public internet path — traffic stays on the AWS backbone end-to-end
Identity-bound — the endpoint connection requires the prod AWS account credentials AND the Atlas user password
Doubly-protected encryption — AWS-backbone-only network layer + TLS 1.2+ at app layer
Read-only — role_reader_reference grants only FIND action on the canonical 4 collections; CI guard verifies daily

Revisit triggers (from ADR 0004):

Staging cluster cost > $500/mo sustained for >2 months → evaluate M20 with delta refresh (#98)
Cross-cluster read p99 > 250ms → evaluate Path A (duplicate to prod cluster)
Auditor flags cross-cluster path under EDE Phase 3 → migrate both clusters to Atlas for Government (same architecture transfers; PrivateLink stays)
Any PHI ever needs to land on staging cluster → immediate cutover (the CI guard at #100 is the early-warning system)

Doctor + Rx coverage layer

Pivot decision 2026-05-03 (docs/decisions/2026-05-03-pivot-cms-api-direct.md): the consumer-facing coverage flow calls CMS Marketplace API at query time, not the §1311 owned-data mirror.

CMS API limits (verified 2026-05-03):

Limit	Value	Source
Rate limit (per second)	200 RPS	`x-ratelimit-limit-second` header
Rate limit (per minute)	1,000 / min	`x-ratelimit-limit-minute` header
Per-call max (NPIs × plans)	100 combos	tested
Per-call max NPIs alone	10	tested

Open Enrollment scale ceiling: the single-key 1,000/min limit binds before per-second. Mitigations under design:

Multiple CMS API keys rotated server-side (each carries its own 1,000/min budget)
Server-side coverage cache (plan_id, npi) and (plan_id, rxcui) with hours-long TTL
CloudFront edge cache on proxy routes
Tier the coverage check — top 3 plans immediately, lazy-load the rest
Fall back to §1311 ingested data behind a feature flag at high load (99.94% accuracy, degraded but live)

Mitigations 1–3 should land before OE 2027. Mitigation 5 is the long-term bridge if §1311 ingest resumes.

Email — AWS SES

Email-provider abstraction at src/lib/email.ts. Single typed sendEmail() API; provider hardcoded to AWS SES post-v0.33.0 (Resend retired).

Surface	Path
Waitlist confirmations	`/api/waitlist` → `sendEmail()` → AWS SES
Agent waitlist	`/api/waitlist` (with `interest: "agent"`) → SES
Agent discovery survey	`/api/agents/discovery` → SES
Resume email (partial save)	inline from partial-save handler → SES
15-min reminder	EventBridge Scheduler → Lambda → SES (per ADR 0005)
Unsubscribe	`/api/unsubscribe` with HMAC token

DKIM + SPF + DMARC verified on askflorence.health and stage.askflorence.health. SES sends from hello@updates.askflorence.health. Production access enabled (50,000 messages / 24h, 14 msg/sec).

Email brand polish per ENG-243: CAN-SPAM compliance + wordmark + em-dash sweep applied to all transactional templates.

Delayed jobs — EventBridge Scheduler + Lambda + SQS

Per ADR 0005 — AWS-native primitives for "execute work at a specific later time, conditional on intermediate state."

Applied:

15-min discovery survey reminder (ENG-242) — wired 2026-05-09
24h / 72h / 7d marketing nudges — same pattern, different window

Future: agent activation email (Phase 5), renewal alerts, personalized nudges from Florence AI.

Rejected alternatives: Inngest (BAA enterprise-only, $500-1,500/mo+), Trigger.dev managed cloud (no HIPAA), Convex (BAA enterprise-only), Temporal Cloud ($100-500/mo floor — too heavy for our stage). The AWS-native path uses services already on the AWS Org BAA.

Florence AI (planned, Phase 1 reserved)

Foundation in place per v0.19.0 (2026-04-21). Phase 1 ships on Amazon Bedrock from day one (under AWS Org BAA + FedRAMP Moderate inheritance). The direct Anthropic API will not be called from prod.

VPC endpoint for Bedrock Runtime is provisioned in prod VPC (Phase 4 reservation)
IAM role permissions for bedrock:InvokeModel are stubbed in the ECS task role policy
Secrets Manager has prod/anthropic-api-key and prod/openai-whisper-api-key slots reserved from Phase 4 — both remain empty (unused; we don't call those services in prod)
Architecture brief: docs/briefs/BRIEF_FLORENCE_AI_ARCHITECTURE.md
Research track: #61 (Linear: ENG-204)

Voice (if/when it ships): all three AWS-hosted options stay open (AWS Transcribe managed ASR, self-hosted Whisper on SageMaker / ECS / EC2, Bedrock Marketplace if available). All under the AWS BAA. No direct OpenAI API path.

Compliance posture

Observability + control evidence running since Phase 2 (2026-04-17):

Control	Service	Status
Audit trail (org-wide)	CloudTrail org trail → log-archive S3, Object Lock 7-year	✅ since 2026-04-17
Configuration drift	AWS Config + aggregator on log-archive	✅ since 2026-04-17
Threat detection	GuardDuty org-wide (S3 data events, EBS malware, ECS Fargate runtime)	✅ since 2026-04-17
Continuous compliance	Security Hub org-wide (CIS + AWS FSBP + HIPAA conformance pack)	✅ since 2026-04-17
WAF	WAFv2 managed rule sets + rate limit + scope-down exemptions for PostHog `/ingest/*` + crawler UAs	✅ since Phase 6
VPC Flow Logs	exported to log-archive	✅ since Phase 4
Encryption at rest	CMK per account; Secrets Manager + KMS-encrypted; Atlas AES-256 default	✅
Encryption in transit	TLS 1.2+ throughout; PrivateLink for cross-cluster (HIPAA §164.312(e)(1))	✅
SSO / no long-lived keys	OIDC federation for GitHub Actions; SSO Identity Center for human access	✅
Root MFA	virtual MFA on all 4 account roots; SCP carve-out limits root actions	✅ since 2026-04-18
Vendor BAAs	see `docs/security-compliance/vendor-register.md`	active — Atlas signed-copy PDF + HubSpot DPA pending (#145, vendor register)

EDE Phase 3 / SOC 2 Type II / HIPAA framework mapping is in the Phase 12 compliance docs (foundation landed 2026-05-12 per #71).

EDE readiness — current row-by-row

Requirement	Status	Notes
HIPAA BAA chain (compute + data)	✅	AWS Org BAA 2026-04-18; Atlas HIPAA-tier active
FedRAMP Moderate boundary	✅	commercial `us-east-1`; Bedrock + Transcribe + SageMaker in scope
Identity verification	❌	Phase 5 agent platform — vendor TBD
CMS Marketplace API integration	✅	proxied through our routes; opacity layer enforced
Hardware MFA on root + privileged	⚠️	virtual MFA today; YubiKey enrollment pending (#67)
Vendor BAA register	✅	`docs/security-compliance/vendor-register.md` is canonical
Audit trail	✅	org CloudTrail + agent_audit_log append-only
Pen test	❌	annual scheduled — vendor selection in flight (#143)
Compliance automation	❌	Drata / Vanta / Sprinto selection in flight (#142)

CI/CD

GitHub Actions with OIDC federation per AWS account — zero long-lived keys
Branch staging → deploys to AWS staging account (stage.askflorence.health)
Branch main → deploys to AWS prod account (askflorence.health) with environment-protection approval gate
Container build: multi-stage Dockerfile, Next.js standalone output, non-root user
Image scan: ECR enhanced scanning + Inspector
Pre-deploy CI guards:
- validate-secrets.yml — fail build on whitespace/newline/empty bug class (the Resend \n incident)
- staging-collections-guard.yml — fail build if getReferenceDb() is called against a collection not in STAGING_ALLOWED_COLLECTIONS
- GET-handler discipline (#109) — fail build on side-effect-triggering GET endpoints without auth gate
Post-deploy smoke: in-progress per #156 — expand smoke + add PR-level full-flow tests after 3 silent-failure incidents (ENG-242, 272, 274)

Cost (actuals, 2026-05)

Component	Monthly	Notes
Atlas prod cluster (M10 HIPAA)	$56	App data, PHI-eligible
Atlas staging cluster (M30)	$382	§1311 reference data, non-PHI
AWS ECS Fargate (prod, 2 tasks)	~$30	Multi-AZ, on-demand
AWS ECS Fargate (staging, 1 task FARGATE_SPOT)	~$10
AWS NAT Gateways (prod, 2 AZ)	~$65	$0.045/h × 2
AWS NAT Gateways (staging)	$0	VPC Endpoints instead — flat $35/mo
VPC Endpoints (Secrets, KMS, ECR, Logs, S3 Gateway, PrivateLink)	~$45 prod + ~$35 staging
AWS PrivateLink endpoint (cross-cluster)	~$10	Interface endpoint + data egress
CloudFront prod (PriceClass_All)	~$15	Volume-dependent
CloudFront staging (PriceClass_100)	~$3
WAFv2 per env	~$10	$5/web-ACL + rule cost
Secrets Manager	~$5	$0.40/secret × ~12 secrets
KMS CMK	~$3	$1/key × 3
Observability (CloudTrail / Config / GuardDuty / Security Hub / WAF logs)	~$15	Org-wide
SES	~$0.10/1K messages	At current volume: ~$2
Total	~$440/mo	Pre-launch / pre-OE scale

vs Path A (duplicate reference data onto prod M30): would have been ~$764/mo. ADR 0004 saves $326/mo recurring.

Scaling forward at OE 2027: M20 prod, M30 staging (delta-aware refresh per #98), Fargate 4 tasks. Projected ~$800/mo.

Migration path — done / in progress / planned

Done

[x] Phase 1 (2026-04-17) — AWS Organizations + 4 accounts + SSO + SCPs + budgets
[x] Phase 2 (2026-04-17) — Org observability baseline (CloudTrail / Config / GuardDuty / Security Hub)
[x] Phase 3 (2026-04-18) — Terraform scaffolding + OIDC federation
[x] Phase 4 (2026-04-20) — Staging VPC + KMS + Secrets Manager + ACM cert
[x] Phase 5 (2026-04-21) — Staging ECS + ALB + ECR + first deploy (session log)
[x] Phase 6 (2026-04-22) — Staging CloudFront + WAF + DNS (session log)
[x] Phase 7 (2026-04-22) — Staging Atlas VPC peering + CI/CD (session log)
[x] Phase 8 (2026-04-22) — Prod account mirror (session log)
[x] Phase 9 (2026-04-23) — Prod canary validation
[x] Phase 10 (2026-04-23) — DNS cutover, Vercel retired (session log)
[x] Phase 11 (2026-05-08) — Cross-cluster Atlas reads via AWS PrivateLink (session log, ADR 0004)
[x] Phase D (2026-05-08) — Provider-network-tier fallback via cross-cluster (session log)
[x] Resend retired (2026-04-30) — SES-only email
[x] Tier 0 / Tier 0.5 / Tier 1 / Tier 1.5 ZIP audits — all at TRUE 100% match vs CMS

In progress

Phase 5 agent platform (auth, NIPR, ID verify) — #54
Hardware MFA (YubiKey) enrollment — #67
PostHog → OpenPanel + GlitchTip self-hosted + observability — sub-A shipped (#75); build at #342 / ENG-347 (ADR 0009)
Atlas IP allowlist cleanup (drop Vercel-era CIDRs) — #141
MongoDB Atlas signed BAA PDF collection — #145
Delta-aware MRF refresh pipeline — #98 / ENG-236
Compliance automation vendor selection (Drata / Vanta / Sprinto) — #142
Annual pen test vendor selection — #143
Mongo scoped-user simplification pattern review — #166
Post-deploy smoke + PR-level full-flow tests — #156

Planned

Atlas commercial → Atlas for Government migration at EDE Phase 3 cutover (~September 2026)
Florence AI Phase 1 on Bedrock — #61
Member portal — #44
Bedrock multi-region on staging to unblock Florence A0 — #62

IP opacity — hard rules enforced across the stack

A competitor inspecting network traffic, HTML, headers, or response bodies should not be able to identify our data sources. Enforced at multiple layers:

Every data call goes through /api/* on askflorence.health — no browser ever calls marketplace.api.healthcare.gov, Atlas, CMS, any SBE source, or any future vendor directly. Future integrations that physically cannot work that way (e.g., redirect-based ID verification) are documented exceptions and live on subdomains of our domain wherever possible.
Upstream response headers stripped at the edge — CloudFront response-headers policy removes Server, X-Powered-By, Via, X-Amz-*, X-Cache; overrides Server: AskFlorence.
Response bodies scrubbed — data_source field removed from client-facing responses; no CMS / Healthcare.gov URLs in JSON; carrier-hosted PDFs (SBC, formulary) proxied through /api/docs/[id] with askflorence.health URLs.
Generic error messages — API routes never surface upstream error text. 5xx returns { error: "temporary_upstream_issue", request_id } — detail in CloudWatch only.
No source maps in production builds — Dockerfile sets GENERATE_SOURCEMAP=false.
Security headers via CloudFront response-headers policy — strict CSP, X-Content-Type-Options nosniff, Referrer-Policy strict-origin-when-cross-origin, X-Frame-Options DENY, HSTS with preload.
No _cms / _healthcare_gov / _nysoh strings in public JS chunks, CSS classes, or route names.
Future vendor flows (Phase 5 ID verify, NIPR, etc.) must follow the same rule: proxied through our API by default.

Recent architecture decisions

ADR	Decision	Status
0001	Atlas project isolation for staging vs prod	Accepted 2026-04-17
0002	`agent_audit_log` enforced append-only at DB role layer	Accepted 2026-04-17
0003	Narrow-scoped MongoDB users per Issue #56	Accepted 2026-04-17 (pattern critique in flight per #166)
0004	Cross-cluster Atlas reads from prod via AWS PrivateLink	Accepted 2026-05-08, amended 2026-05-11 (ENG-257 — 4-collection canonical scope)
0005	Delayed-job architecture: EventBridge Scheduler + Lambda + SES (AWS-native)	Accepted 2026-05-09

Cross-references

Consumer & Agent Flow — end-to-end journey from quote to enrollment (Agent terminology applied)
Security & Compliance — auditor entry point: policies, control mappings, runbooks
MongoDB Setup Runbook — cluster details, indexes, peering, PrivateLink
AWS Setup Runbook — deploy, rollback, rotate, scale
Pivot decision — CMS API direct — why doctor + Rx coverage calls CMS at query time
Vendor / subprocessor register — canonical BAA / DPA / FedRAMP source of truth
Session log index — what each agent session shipped
Infrastructure change log — timestamped change history for SOC 2 CC8.1 evidence