Appearance
Data Classification Policy
Status: Active. Last updated April 12, 2026. Purpose: SOC 2 evidence for CC6.1 (Logical Access), CC6.5 (Data Protection), A1.2 (Availability)
Classification Levels
| Level | Definition | Examples | Encryption | Retention |
|---|---|---|---|---|
| Public | No restrictions. Intentionally published. | Plan names, metal levels, issuer names, premium amounts | At rest (AES-256) | Indefinite |
| Internal | Business-sensitive. Not for external sharing. | SLCSP calculations, data source URLs, API keys | At rest + in transit (TLS) | Duration of use |
| PII | Personally identifiable information. | Email, name, phone, address | At rest + in transit + field-level (CSFLE) | Per purpose + 7yr audit |
| PHI | Protected health information (HIPAA). | SSN, DOB, income (with health context), enrollment records | At rest + in transit + field-level (CSFLE + KMS) | Per purpose + 7yr audit |
Collection Classification
Phase 1 Collections (Active)
| Collection | Classification | Contains PII/PHI? | Encryption | Retention | Access |
|---|---|---|---|---|---|
plan_years | Public | No | At rest (Atlas default) | Per plan year (keep all years) | app-read, app-write |
plans | Public | No | At rest (Atlas default) | Per plan year (keep all years) | app-read, app-write |
regions | Public | No | At rest (Atlas default) | Per plan year (keep all years) | app-read, app-write |
zip_county | Public | No | At rest (Atlas default) | Indefinite (geographic data) | app-read, app-write |
audit_log | Internal | May contain IP addresses | At rest (Atlas default) | 7 years (TTL index) | audit-write (insert), admin (read) |
Key: Phase 1 collections contain NO PII or PHI. All data is publicly available plan information from government sources (DFS filings, marketplace data, CMS PUF).
Cross-cluster reference collections (live on staging Atlas, read by prod via AWS PrivateLink — Phase 11)
| Collection | Classification | Contains PII/PHI? | Encryption | Retention | Access |
|---|---|---|---|---|---|
formularies_staging | Public | No (CMS §1311 MRF formulary data — RxCUI → plan tier mappings) | At rest (Atlas default) + TLS in transit + AWS PrivateLink (network layer) | Per plan year | app_read_staging (read-only, prod) + ingest pipeline (write, staging account) |
providers_staging | Public | No (NPPES public NPI directory — provider name, NPI, specialty, network membership) | At rest (Atlas default) + TLS in transit + AWS PrivateLink (network layer) | Per refresh cycle | app_read_staging (read-only, prod) + ingest pipeline (write, staging account) |
Where these live + read path: these collections live ONLY on the staging Atlas cluster (askflorence-staging, project_id 69e31af12fd2c0aef51bbb41). The prod app (askflorence.health) reads them via AWS PrivateLink endpoint vpce-0c81aea11e29bb928 using the read-only app_read_staging Atlas user. The §1311 ingest pipeline writes them from the staging AWS account; nothing on prod ever writes to these collections.
Why staging cluster, not prod cluster: keeps prod cluster on M10 HIPAA tier ($56/mo) instead of upgrading to M30 ($382/mo) to handle the 2.14M-doc + 30M-tuple footprint. Saves ~$326/mo recurring while keeping prod's audit boundary clean (only PHI processing on prod cluster). See ADR 0004 for the full decision.
Drift guard: #100 / ENG-239. Two-phase enforcement, both shipped:
- Phase 1 (static CI guard) shipped 2026-05-08 —
scripts/audit/staging-collections-guard.tsenforces the data-classification contract at PR time: fails the build if anygetReferenceDb()call references a collection not onSTAGING_ALLOWED_COLLECTIONS(defined insrc/lib/db.ts). Workflow at.github/workflows/staging-collections-guard.yml. Allow-list duplicated in the script (defense-in-depth). - Phase 2 (live nightly drift check) shipped 2026-05-09 —
scripts/audit/staging-cluster-drift.tsaudits the actual Atlas state ofapp_read_staging(the cross-cluster reader) at 08:00 UTC daily via.github/workflows/staging-cluster-drift.yml. Verifies the user has exactly one custom role (role_reader_reference@admin) granting onlyFINDon exactly the expected 2 collections (formularies_staging+providers_staging) — opens a P1 GitHub issue on drift. As part of Phase 2 the user's role was tightened from built-inread@askflorence(whole-DB scope) to per-collection custom rolerole_reader_reference; verified prod cross-cluster reads remain healthy after the tightening.
Together these protect the classification claim above: Phase 1 catches code-level drift at PR time; Phase 2 catches runtime drift (privilege escalation via Atlas Admin UI, out-of-band role changes, etc.).
Phase 2 Collections (Future — Not Yet Created)
| Collection | Classification | Contains PII/PHI? | Encryption | Retention | Access |
|---|---|---|---|---|---|
consumers | PHI | Yes (SSN, name, DOB, address) | At rest + CSFLE + KMS | Per purpose + 7yr audit trail | Scoped (per-consumer access) |
enrollments | PHI | Yes (links consumer to health plan) | At rest + CSFLE | Per purpose + 7yr audit trail | Broker (assigned only), consumer (own) |
broker_assignments | Internal | No (broker business info only) | At rest | Duration of relationship | Admin |
Phase 2 requires: MongoDB Client-Side Field Level Encryption (CSFLE) with AWS KMS before these collections are created. See docs/security-compliance/encryption-policy.md for the encryption policy + CSFLE roadmap.
Data Flow Classification
| Data Flow | Classification | Handling |
|---|---|---|
| User enters zip + age + income | Not stored | Stateless; used for calculation only; not persisted |
| Plan search results | Public | Returned to client; no PII |
| Waitlist email submission | PII | Stored via Resend API; not in MongoDB |
| Enrollment application (future) | PHI | Field-level encrypted in MongoDB; audit logged |
| Broker view of consumer data (future) | PHI access event | Decrypted on-demand; time-limited session; audit logged |
Source File Classification
| Source | Classification | Storage | Retention |
|---|---|---|---|
| DFS Final Exhibit ZIPs | Public (government filings) | S3 + local backup | Indefinite |
| NYSOH scraped HTML | Public (public marketplace data) | S3 + local backup | Indefinite |
| CMS PUF CSVs | Public (government data) | S3 + local backup | Indefinite |
| Official NY documents (PDFs) | Public | S3 + local backup | Indefinite |
| Data ingestion manifests | Internal | S3 (with source file checksums) | Indefinite |
Role-to-Collection Access Matrix
| Role | plan_years | plans | regions | zip_county | audit_log |
|---|---|---|---|---|---|
app-read | Read | Read | Read | Read | — |
app-write | Read/Write | Read/Write | Read/Write | Read/Write | — |
audit-write | — | — | — | — | Insert only |
| Atlas admin | Full | Full | Full | Full | Full |
SOC 2 Control Mapping
| Control | Evidence |
|---|---|
| CC6.1 (Logical Access) | Role-to-collection matrix, minimum necessary access |
| CC6.5 (Data Protection) | Classification levels, encryption requirements per level |
| A1.2 (Availability) | Retention policies, backup configuration |
| P6.1 (Privacy — Data Use) | Data flow classification, "not stored" for anonymous queries |