Appearance
AskFlorence — System Architecture
Status: Living document. Last updated 2026-05-11. Version: 0.4.0 (post-Phase-11 architecture: AWS-only frontend + cross-cluster Atlas PrivateLink + Agent terminology) Recent changes: see session log and infrastructure change log for timestamped history.
TL;DR
AskFlorence is a 30 federal + NY ACA marketplace with real-time subsidized pricing (own-data path), 19 SBE states + DC handled via redirect, and a doctor + Rx coverage layer that calls CMS Marketplace API directly with a non-PHI §1311 reference dataset on a separate Atlas cluster as fallback.
The whole platform runs on AWS commercial us-east-1. Frontend (askflorence.health, www.askflorence.health) is served from ECS Fargate behind CloudFront + WAFv2. Vercel was retired at Phase 10 cutover (2026-04-23). Compliance posture is HIPAA-active (AWS Org BAA signed 2026-04-18 + Atlas HIPAA tier on prod cluster), SOC 2 Type II evidence collection in progress (window through January 2027), CMS EDE Phase 3 submission targeted for September 2026 with the entire stack inside FedRAMP Moderate boundary.
This doc is the first thing a new contributor, agent, or auditor reads. The diagrams below are authoritative — when they disagree with code, the code is wrong.
AWS Organizations topology
Account responsibilities:
| Account | Account ID | Responsibility |
|---|---|---|
| mgmt | 778477254880 | Org root, SCPs, SSO Identity Center, AWS Artifact BAA + reports, Terraform state, askflorence-data S3 bucket (source-file audit trail) |
| prod | 039624954211 | Production workloads — apex askflorence.health, prod ECS Fargate, prod Atlas peering, prod SES |
| staging | 549136075525 | Staging workloads — stage.askflorence.health, staging ECS, §1311 reference data ingest pipeline + staging Atlas cluster |
| log-archive | 754660694122 | Organization CloudTrail, Config aggregator, WAF logs (Kinesis Firehose), Security Hub findings, VPC Flow Logs export. Object Lock 7-year retention (HIPAA 6yr + buffer) |
SCP ScpBaseline (p-oy7xxdzz, attached to all 3 workload OUs):
- Deny root user actions (with carve-out for one-time bootstrap actions like MFA setup)
- Deny
organizations:LeaveOrganization - Deny disabling CloudTrail / Config / GuardDuty / Security Hub
- Region lock to
us-east-1(global services exempted)
Request flow — apex prod
Key contract: the application uses two distinct MongoDB connection pools:
getDb()— primary cluster (MONGODB_URI). Reads + writes for app data (plans, ZIPs, agents, audit log, surveys).getReferenceDb()— reference cluster via PrivateLink (MONGODB_REFERENCE_URI). Required — no silent fallback. Read-only access toformularies_staging,providers_staging,plans(audit purposes),mrpuf_issuers_staging.
The CloudFront response-headers policy strips upstream identifiers (Server, X-Powered-By, Via, X-Amz-*, X-Cache) and overrides Server: AskFlorence. CSP + HSTS + X-Frame-Options DENY + X-Content-Type-Options nosniff + Referrer-Policy strict-origin-when-cross-origin are enforced at the edge.
State routing
State universe:
| Set | Count | Source |
|---|---|---|
| Federal-30 | 30 | CMS PUF age-rated premiums, own DB |
| NY | 1 | NY DFS / NYSOH community-rated, own DB |
| SBE redirect | 19 + DC | sbeRedirect docs in zip_county, MongoDB-first since v0.34 (commit ccad089) |
Source of truth for the state list: src/lib/constants.ts → STATE_BASED_MARKETPLACES. Federal-30 ∪ {NY} = the "own-data" universe; everything else in STATE_BASED_MARKETPLACES redirects.
zip_county collection coverage is at TRUE 100% match against CMS for federal + NY (Tier 1 audit) and for SBE redirect docs (Tier 1.5 audit) post the Tier 0.5 USPS-completeness audit (2026-05-01) + drive-to-100% completion.
Atlas topology — two projects, two clusters, one cross-cluster path
Project isolation per ADR 0001 — project-scoped users, roles, allowlists, alerts. A mistake on staging cannot widen prod's surface. Cross-cluster reads use AWS PrivateLink per ADR 0004 — saves ~$326/mo vs duplicating reference data onto a prod M30, keeps PHI / non-PHI audit boundaries clean.
MongoDB collections
Where each collection lives
Prod cluster (askflorence-prod-01, M10 HIPAA):
- App data:
plan_years,plans,regions,zip_county,audit_log - Agent platform:
agent_waitlist_submissions,agent_survey_responses,agent_unsubscribes,agents,agencies,agent_sessions,admins,agent_audit_log,agent_nipr_records,agent_id_verifications,hubspot_sync_log
Staging cluster (askflorence-staging, M30, non-PHI reference data):
formularies_staging— 12,557 RxCUIs / ~30M drug-plan tuples (§1311 MRF ingest)providers_staging— 2.14M NPI provider docsdrug_search_index— ENG-425 derived drug-name search read-model (~6,100 ingredient+form groups; full strength union + brand/generic names + coverage-breadth commonality). Search-only (NO coverage). Rebuilt post-ingest fromformularies_stagingbyscripts/db/derive-drug-search-index.js --apply.mrpuf_issuers_staging— issuer reference data for §1311 reconciliationplans— staging copy of the §1311 plan reference (for audit harnesses)
Prod app reads from staging cluster read-only via getReferenceDb() (PrivateLink). The §1311 ingest pipeline runs in the staging AWS account and writes to the staging cluster only — prod picks up refreshes automatically through PrivateLink, no double-ingest.
Append-only audit log
agent_audit_log is enforced append-only at the DB role layer per ADR 0002. Both role_writer_agents and role_admin_agents have FIND + INSERT only on this collection — no UPDATE, no REMOVE. Application code cannot bypass this — the trust boundary is at the Atlas role JSON, not at the Node layer.
Narrow-scoped MongoDB users
Per ADR 0003 — a compromise of any single app-tier credential is bounded by that credential's collection scope. Production retirement of the legacy app-write user (broad readWrite) is a pre-launch gate, tracked at #56.
| User | Role | Env var | Purpose |
|---|---|---|---|
app-read | built-in read@askflorence | MONGODB_URI | Read for all paths (legacy, replaced on staging by app_read_local_staging) |
app_read_staging | custom role_reader_reference (4 collections: formularies_staging, providers_staging, plans, mrpuf_issuers_staging; ENG-425 adds drug_search_index as a 5th — grant the live role FIND on it in lock-step with infra/atlas/access-matrix.ts BEFORE pointing /plans/Florence at /api/drugs/search) | MONGODB_REFERENCE_URI (PrivateLink endpoint) | Cross-cluster reference reads from prod into staging cluster |
app_writer_survey | role_writer_survey (FIND/INSERT/UPDATE/REMOVE on agent_survey_responses) | MONGODB_URI_SURVEY_WRITE | Discovery survey writes |
app_writer_waitlist | role_writer_waitlist (agent_waitlist_submissions, agent_unsubscribes) | MONGODB_URI_WAITLIST_WRITE | Waitlist + unsubscribe writes |
app_writer_plans | role_writer_plans (plans, zip_county, regions, plan_years, audit_log + index management) | MONGODB_URI_PLANS_WRITE | Ingest scripts |
app_writer_agents | role_writer_agents (agents, agencies, agent_sessions, FIND+INSERT only on agent_audit_log, NO access to admins) | MONGODB_URI_AGENTS_WRITE | Agent platform runtime writes |
app_admin_agents | role_admin_agents (everything in role_writer_agents + FIND/INSERT/UPDATE/REMOVE on admins, still append-only on agent_audit_log) | MONGODB_URI_AGENTS_ADMIN | Admin promotion path |
audit_reader | role_audit_reader (FIND on agent_audit_log only) | MONGODB_URI_AUDIT_READ | Compliance review access |
app_admin_schema | role_admin_schema (index ensure across collections) | MONGODB_URI_ADMIN_SCHEMA | Deploy-time index ensure via in-VPC ECS RunTask (per ENG-266) |
hubspot-sync-write | scoped writer on agents, agencies, hubspot_sync_log | MONGODB_URI_HUBSPOT_SYNC_WRITE | HubSpot ↔ Mongo bidirectional sync |
Open follow-ups:
- #56 — retire
app-writefrom prod cluster; rotateapp-read+app-writepasswords (they sat in long-lived local files). - #166 — pattern critique: the narrow-scoped user model has produced silent regressions when env vars bind to credentials that lack the consuming code path's permissions. "Broaden first, narrow with discipline" alternative under consideration.
- #271 — completed: split staging cross-cluster reader (
app_read_staging) from staging deployment-local reader (app_read_local_staging) so cross-cluster role narrowing doesn't break staging's own app.
CI guard chain (per #100) catches drift in two complementary layers:
- Static check —
scripts/audit/staging-collections-guard.tsruns on every PR; fails build ifgetReferenceDb()is called against any collection not inSTAGING_ALLOWED_COLLECTIONS. - Live nightly drift check —
scripts/audit/staging-cluster-drift.tsruns at 08:00 UTC daily; audits the actual Atlas state ofapp_read_stagingagainst the canonical 4-collection matrix; opens a P1 GitHub issue on drift.
Cross-cluster reference reads — Phase 11 detail
Live since 2026-05-08. Architecturally documented in ADR 0004; session log at 2026-05-08-phase-11-cross-cluster-privatelink.md.
Network posture (verified for HIPAA §164.312(e)(1) Transmission Security):
- No public internet path — traffic stays on the AWS backbone end-to-end
- Identity-bound — the endpoint connection requires the prod AWS account credentials AND the Atlas user password
- Doubly-protected encryption — AWS-backbone-only network layer + TLS 1.2+ at app layer
- Read-only —
role_reader_referencegrants onlyFINDaction on the canonical 4 collections; CI guard verifies daily
Revisit triggers (from ADR 0004):
- Staging cluster cost > $500/mo sustained for >2 months → evaluate M20 with delta refresh (#98)
- Cross-cluster read p99 > 250ms → evaluate Path A (duplicate to prod cluster)
- Auditor flags cross-cluster path under EDE Phase 3 → migrate both clusters to Atlas for Government (same architecture transfers; PrivateLink stays)
- Any PHI ever needs to land on staging cluster → immediate cutover (the CI guard at #100 is the early-warning system)
Doctor + Rx coverage layer
Pivot decision 2026-05-03 (docs/decisions/2026-05-03-pivot-cms-api-direct.md): the consumer-facing coverage flow calls CMS Marketplace API at query time, not the §1311 owned-data mirror.
CMS API limits (verified 2026-05-03):
| Limit | Value | Source |
|---|---|---|
| Rate limit (per second) | 200 RPS | x-ratelimit-limit-second header |
| Rate limit (per minute) | 1,000 / min | x-ratelimit-limit-minute header |
| Per-call max (NPIs × plans) | 100 combos | tested |
| Per-call max NPIs alone | 10 | tested |
Open Enrollment scale ceiling: the single-key 1,000/min limit binds before per-second. Mitigations under design:
- Multiple CMS API keys rotated server-side (each carries its own 1,000/min budget)
- Server-side coverage cache
(plan_id, npi)and(plan_id, rxcui)with hours-long TTL - CloudFront edge cache on proxy routes
- Tier the coverage check — top 3 plans immediately, lazy-load the rest
- Fall back to §1311 ingested data behind a feature flag at high load (99.94% accuracy, degraded but live)
Mitigations 1–3 should land before OE 2027. Mitigation 5 is the long-term bridge if §1311 ingest resumes.
Email — AWS SES
Email-provider abstraction at src/lib/email.ts. Single typed sendEmail() API; provider hardcoded to AWS SES post-v0.33.0 (Resend retired).
| Surface | Path |
|---|---|
| Waitlist confirmations | /api/waitlist → sendEmail() → AWS SES |
| Agent waitlist | /api/waitlist (with interest: "agent") → SES |
| Agent discovery survey | /api/agents/discovery → SES |
| Resume email (partial save) | inline from partial-save handler → SES |
| 15-min reminder | EventBridge Scheduler → Lambda → SES (per ADR 0005) |
| Unsubscribe | /api/unsubscribe with HMAC token |
DKIM + SPF + DMARC verified on askflorence.health and stage.askflorence.health. SES sends from hello@updates.askflorence.health. Production access enabled (50,000 messages / 24h, 14 msg/sec).
Email brand polish per ENG-243: CAN-SPAM compliance + wordmark + em-dash sweep applied to all transactional templates.
Delayed jobs — EventBridge Scheduler + Lambda + SQS
Per ADR 0005 — AWS-native primitives for "execute work at a specific later time, conditional on intermediate state."
Applied:
- 15-min discovery survey reminder (ENG-242) — wired 2026-05-09
- 24h / 72h / 7d marketing nudges — same pattern, different window
Future: agent activation email (Phase 5), renewal alerts, personalized nudges from Florence AI.
Rejected alternatives: Inngest (BAA enterprise-only, $500-1,500/mo+), Trigger.dev managed cloud (no HIPAA), Convex (BAA enterprise-only), Temporal Cloud ($100-500/mo floor — too heavy for our stage). The AWS-native path uses services already on the AWS Org BAA.
Florence AI (planned, Phase 1 reserved)
Foundation in place per v0.19.0 (2026-04-21). Phase 1 ships on Amazon Bedrock from day one (under AWS Org BAA + FedRAMP Moderate inheritance). The direct Anthropic API will not be called from prod.
- VPC endpoint for Bedrock Runtime is provisioned in prod VPC (Phase 4 reservation)
- IAM role permissions for
bedrock:InvokeModelare stubbed in the ECS task role policy - Secrets Manager has
prod/anthropic-api-keyandprod/openai-whisper-api-keyslots reserved from Phase 4 — both remain empty (unused; we don't call those services in prod) - Architecture brief:
docs/briefs/BRIEF_FLORENCE_AI_ARCHITECTURE.md - Research track: #61 (Linear: ENG-204)
Voice (if/when it ships): all three AWS-hosted options stay open (AWS Transcribe managed ASR, self-hosted Whisper on SageMaker / ECS / EC2, Bedrock Marketplace if available). All under the AWS BAA. No direct OpenAI API path.
Compliance posture
Observability + control evidence running since Phase 2 (2026-04-17):
| Control | Service | Status |
|---|---|---|
| Audit trail (org-wide) | CloudTrail org trail → log-archive S3, Object Lock 7-year | ✅ since 2026-04-17 |
| Configuration drift | AWS Config + aggregator on log-archive | ✅ since 2026-04-17 |
| Threat detection | GuardDuty org-wide (S3 data events, EBS malware, ECS Fargate runtime) | ✅ since 2026-04-17 |
| Continuous compliance | Security Hub org-wide (CIS + AWS FSBP + HIPAA conformance pack) | ✅ since 2026-04-17 |
| WAF | WAFv2 managed rule sets + rate limit + scope-down exemptions for PostHog /ingest/* + crawler UAs | ✅ since Phase 6 |
| VPC Flow Logs | exported to log-archive | ✅ since Phase 4 |
| Encryption at rest | CMK per account; Secrets Manager + KMS-encrypted; Atlas AES-256 default | ✅ |
| Encryption in transit | TLS 1.2+ throughout; PrivateLink for cross-cluster (HIPAA §164.312(e)(1)) | ✅ |
| SSO / no long-lived keys | OIDC federation for GitHub Actions; SSO Identity Center for human access | ✅ |
| Root MFA | virtual MFA on all 4 account roots; SCP carve-out limits root actions | ✅ since 2026-04-18 |
| Vendor BAAs | see docs/security-compliance/vendor-register.md | active — Atlas signed-copy PDF + HubSpot DPA pending (#145, vendor register) |
EDE Phase 3 / SOC 2 Type II / HIPAA framework mapping is in the Phase 12 compliance docs (foundation landed 2026-05-12 per #71).
EDE readiness — current row-by-row
| Requirement | Status | Notes |
|---|---|---|
| HIPAA BAA chain (compute + data) | ✅ | AWS Org BAA 2026-04-18; Atlas HIPAA-tier active |
| FedRAMP Moderate boundary | ✅ | commercial us-east-1; Bedrock + Transcribe + SageMaker in scope |
| Identity verification | ❌ | Phase 5 agent platform — vendor TBD |
| CMS Marketplace API integration | ✅ | proxied through our routes; opacity layer enforced |
| Hardware MFA on root + privileged | ⚠️ | virtual MFA today; YubiKey enrollment pending (#67) |
| Vendor BAA register | ✅ | docs/security-compliance/vendor-register.md is canonical |
| Audit trail | ✅ | org CloudTrail + agent_audit_log append-only |
| Pen test | ❌ | annual scheduled — vendor selection in flight (#143) |
| Compliance automation | ❌ | Drata / Vanta / Sprinto selection in flight (#142) |
CI/CD
- GitHub Actions with OIDC federation per AWS account — zero long-lived keys
- Branch
staging→ deploys to AWS staging account (stage.askflorence.health) - Branch
main→ deploys to AWS prod account (askflorence.health) with environment-protection approval gate - Container build: multi-stage Dockerfile, Next.js standalone output, non-root user
- Image scan: ECR enhanced scanning + Inspector
- Pre-deploy CI guards:
validate-secrets.yml— fail build on whitespace/newline/empty bug class (the Resend\nincident)staging-collections-guard.yml— fail build ifgetReferenceDb()is called against a collection not inSTAGING_ALLOWED_COLLECTIONS- GET-handler discipline (#109) — fail build on side-effect-triggering GET endpoints without auth gate
- Post-deploy smoke: in-progress per #156 — expand smoke + add PR-level full-flow tests after 3 silent-failure incidents (ENG-242, 272, 274)
Cost (actuals, 2026-05)
| Component | Monthly | Notes |
|---|---|---|
| Atlas prod cluster (M10 HIPAA) | $56 | App data, PHI-eligible |
| Atlas staging cluster (M30) | $382 | §1311 reference data, non-PHI |
| AWS ECS Fargate (prod, 2 tasks) | ~$30 | Multi-AZ, on-demand |
| AWS ECS Fargate (staging, 1 task FARGATE_SPOT) | ~$10 | |
| AWS NAT Gateways (prod, 2 AZ) | ~$65 | $0.045/h × 2 |
| AWS NAT Gateways (staging) | $0 | VPC Endpoints instead — flat $35/mo |
| VPC Endpoints (Secrets, KMS, ECR, Logs, S3 Gateway, PrivateLink) | ~$45 prod + ~$35 staging | |
| AWS PrivateLink endpoint (cross-cluster) | ~$10 | Interface endpoint + data egress |
| CloudFront prod (PriceClass_All) | ~$15 | Volume-dependent |
| CloudFront staging (PriceClass_100) | ~$3 | |
| WAFv2 per env | ~$10 | $5/web-ACL + rule cost |
| Secrets Manager | ~$5 | $0.40/secret × ~12 secrets |
| KMS CMK | ~$3 | $1/key × 3 |
| Observability (CloudTrail / Config / GuardDuty / Security Hub / WAF logs) | ~$15 | Org-wide |
| SES | ~$0.10/1K messages | At current volume: ~$2 |
| Total | ~$440/mo | Pre-launch / pre-OE scale |
vs Path A (duplicate reference data onto prod M30): would have been ~$764/mo. ADR 0004 saves $326/mo recurring.
Scaling forward at OE 2027: M20 prod, M30 staging (delta-aware refresh per #98), Fargate 4 tasks. Projected ~$800/mo.
Migration path — done / in progress / planned
Done
- [x] Phase 1 (2026-04-17) — AWS Organizations + 4 accounts + SSO + SCPs + budgets
- [x] Phase 2 (2026-04-17) — Org observability baseline (CloudTrail / Config / GuardDuty / Security Hub)
- [x] Phase 3 (2026-04-18) — Terraform scaffolding + OIDC federation
- [x] Phase 4 (2026-04-20) — Staging VPC + KMS + Secrets Manager + ACM cert
- [x] Phase 5 (2026-04-21) — Staging ECS + ALB + ECR + first deploy (session log)
- [x] Phase 6 (2026-04-22) — Staging CloudFront + WAF + DNS (session log)
- [x] Phase 7 (2026-04-22) — Staging Atlas VPC peering + CI/CD (session log)
- [x] Phase 8 (2026-04-22) — Prod account mirror (session log)
- [x] Phase 9 (2026-04-23) — Prod canary validation
- [x] Phase 10 (2026-04-23) — DNS cutover, Vercel retired (session log)
- [x] Phase 11 (2026-05-08) — Cross-cluster Atlas reads via AWS PrivateLink (session log, ADR 0004)
- [x] Phase D (2026-05-08) — Provider-network-tier fallback via cross-cluster (session log)
- [x] Resend retired (2026-04-30) — SES-only email
- [x] Tier 0 / Tier 0.5 / Tier 1 / Tier 1.5 ZIP audits — all at TRUE 100% match vs CMS
In progress
- Phase 5 agent platform (auth, NIPR, ID verify) — #54
- Hardware MFA (YubiKey) enrollment — #67
- PostHog → OpenPanel + GlitchTip self-hosted + observability — sub-A shipped (#75); build at #342 / ENG-347 (ADR 0009)
- Atlas IP allowlist cleanup (drop Vercel-era CIDRs) — #141
- MongoDB Atlas signed BAA PDF collection — #145
- Delta-aware MRF refresh pipeline — #98 / ENG-236
- Compliance automation vendor selection (Drata / Vanta / Sprinto) — #142
- Annual pen test vendor selection — #143
- Mongo scoped-user simplification pattern review — #166
- Post-deploy smoke + PR-level full-flow tests — #156
Planned
- Atlas commercial → Atlas for Government migration at EDE Phase 3 cutover (~September 2026)
- Florence AI Phase 1 on Bedrock — #61
- Member portal — #44
- Bedrock multi-region on staging to unblock Florence A0 — #62
IP opacity — hard rules enforced across the stack
A competitor inspecting network traffic, HTML, headers, or response bodies should not be able to identify our data sources. Enforced at multiple layers:
- Every data call goes through
/api/*onaskflorence.health— no browser ever callsmarketplace.api.healthcare.gov, Atlas, CMS, any SBE source, or any future vendor directly. Future integrations that physically cannot work that way (e.g., redirect-based ID verification) are documented exceptions and live on subdomains of our domain wherever possible. - Upstream response headers stripped at the edge — CloudFront response-headers policy removes
Server,X-Powered-By,Via,X-Amz-*,X-Cache; overridesServer: AskFlorence. - Response bodies scrubbed —
data_sourcefield removed from client-facing responses; no CMS / Healthcare.gov URLs in JSON; carrier-hosted PDFs (SBC, formulary) proxied through/api/docs/[id]withaskflorence.healthURLs. - Generic error messages — API routes never surface upstream error text. 5xx returns
{ error: "temporary_upstream_issue", request_id }— detail in CloudWatch only. - No source maps in production builds — Dockerfile sets
GENERATE_SOURCEMAP=false. - Security headers via CloudFront response-headers policy — strict CSP, X-Content-Type-Options nosniff, Referrer-Policy strict-origin-when-cross-origin, X-Frame-Options DENY, HSTS with preload.
- No
_cms/_healthcare_gov/_nysohstrings in public JS chunks, CSS classes, or route names. - Future vendor flows (Phase 5 ID verify, NIPR, etc.) must follow the same rule: proxied through our API by default.
Recent architecture decisions
| ADR | Decision | Status |
|---|---|---|
| 0001 | Atlas project isolation for staging vs prod | Accepted 2026-04-17 |
| 0002 | agent_audit_log enforced append-only at DB role layer | Accepted 2026-04-17 |
| 0003 | Narrow-scoped MongoDB users per Issue #56 | Accepted 2026-04-17 (pattern critique in flight per #166) |
| 0004 | Cross-cluster Atlas reads from prod via AWS PrivateLink | Accepted 2026-05-08, amended 2026-05-11 (ENG-257 — 4-collection canonical scope) |
| 0005 | Delayed-job architecture: EventBridge Scheduler + Lambda + SES (AWS-native) | Accepted 2026-05-09 |
Cross-references
- Consumer & Agent Flow — end-to-end journey from quote to enrollment (Agent terminology applied)
- Security & Compliance — auditor entry point: policies, control mappings, runbooks
- MongoDB Setup Runbook — cluster details, indexes, peering, PrivateLink
- AWS Setup Runbook — deploy, rollback, rotate, scale
- Pivot decision — CMS API direct — why doctor + Rx coverage calls CMS at query time
- Vendor / subprocessor register — canonical BAA / DPA / FedRAMP source of truth
- Session log index — what each agent session shipped
- Infrastructure change log — timestamped change history for SOC 2 CC8.1 evidence