Appearance
CA Phase C/D — provider + drug coverage ingestion playbook
Linear: ENG-395 under M7 (CA Phase C/D).
Status (2026-05-26): discovery complete, methodology validated, ready to execute. POC PDF parse pending.
Reusable for other SBEs: the patterns documented below — CalHEERS-style anonymous APIs, SBP-standardized formulary PDFs, Symphony-style statewide provider directory utilities — are common across SBE states. NY has its own equivalents (NYSOH + DFS); WA / OR / CO / CT each have similar structures. Treat this playbook as the template for per-state execution.
Architecture overview
┌──────────────────────────────────────────────┐
│ CalHEERS plan-detail endpoint │
│ (gethealthplansbyids, anon, needs handOffId)│
│ → formularyUrl + providerUrl + tier copays │
└──────────────┬───────────────────────────────┘
│
one POST per CA rating area (~19)
│
v
┌───────────────────────────────────────┐
│ scripts/db/data/ca-plan-documents-2026.json │
│ map: planNumber → {formularyUrl, copays, ...} │
└────────┬──────────────────┬───────────┘
│ │
│ v
│ ┌────────────────────────────┐
│ │ Carrier formulary PDFs │
│ │ (~15-20 unique URLs) │
│ │ Molina, Kaiser, Anthem... │
│ └─────────┬──────────────────┘
│ │
│ v parse (pdftotext -layout + regex)
│ │
│ ┌────────────────────────────┐
│ │ formularies_staging │
│ │ {rxcui, planIds[], tier, │
│ │ requirements} │
│ └────────────────────────────┘
│
v
┌────────────────────────────────────┐
│ Update prod CA plan docs with │
│ puf.formularyId + puf.networkId │
│ (NEW additive fields only) │
└─────┬──────────────────────────────┘
│
v
┌─────────────────────────────────────┐
│ CalHEERS anon provider endpoint │
│ (getproviderdetails, no auth) │ ◄── Symphony Provider Directory (IHA/Availity)
│ → 16,944 SF providers in one call │ (SB 137 statewide source-of-truth)
└────────────────┬────────────────────┘
│
v in-VPC proxy (Fargate or Next.js route handler)
│
┌─────────────────────────────────────┐
│ providers_staging │ ◄── optional cache; the live proxy may be enough
│ {npi, name, specialty, networkIds[]} │
└─────────────────────────────────────┘
│
v
┌─────────────────────────────────────┐
│ /api/providers/search (CA branch) │
│ /api/drugs/search (CA branch) │
└─────────────────────────────────────┘
│
v
┌─────────────────────────────────────┐
│ <CoveragePanel /> ⚕ + ℞ live for CA │
│ Plan-card coverage pills │
└─────────────────────────────────────┘Hard constraints
- PROD CA plan collection is touched ONLY for
puf.formularyId+puf.networkIdforeign-key population (Phase 4). All formulary + provider data lives in staging Mongo (cost design — formularies + providers are 10+ GB; prod uses PrivateLink to staging for these queries per ADR 0004). - Existing FFM drug + provider entries must not be touched.
providers_staging(2.14M docs, 9.27 GB) andformularies_staging(12.5K docs, 919 MB) keep all existing entries byte-identical. CA writes are additive via MongoDB$addToSet— the operator literally cannot modify existing array elements (deep-equality compare on insertion). Single unified collection, source-tagged per entry (source: "ca_<carrier>_<year>_marketplace_formulary"vs FFM'ssource: "ffm_1311_mrf"). Year is part of the_idnatural key (<rxcui>:<year>/<npi>:<year>) so 2026 vs 2027 formularies are separate docs by construction. Pre/post FFM-entry count assertion in the ingest script catches any deviation immediately. - Snapshot staging Atlas BEFORE any writes. Atlas Cloud Backup enabled 2026-05-26; on-demand snapshot taken pre-ingest; snapshot ID logged in the ingest run.
- Cluster identity guard in every script. Refuses to run if MongoDB URI host does not match expected env.
- AWS Fargate RunTask for bulk operations. Crawls run in-VPC, Atlas writes via PrivateLink — no Starlink/home-bandwidth dependency. Local POC parse of one PDF is allowed for validation.
- No HubSpot egress for any data this pipeline produces. Provider/drug data is not PII per HIPAA in the form CalHEERS exposes (provider business addresses + drug catalog metadata).
- Rebuild
drug_search_indexafter the formulary ingest completes (ENG-425). After CA (or any) formulary docs land informularies_staging, runnode scripts/db/derive-drug-search-index.js --applyso the drug-name search read-model reflects the new drugs. It re-derives from the WHOLE collection (FFM + CA), so CA meds become searchable with brand/generic strength parity + commonality ranking. Search-only; coverage stays per-rxcui. Seedocs/decisions/2026-05-09-refresh-cadence.md§ "Post-ingest: rebuild derived collections".
Source-of-truth references
Provider directory data
- Upstream: Symphony Provider Directory operated by IHA (Integrated Healthcare Association), tech-partnered with Availity. Mandated by CA SB 137 (Hernandez, 2015). All 12 CA carriers contribute. Single statewide source-of-truth.
- Operational source for us: CalHEERS anonymous endpoint
POST https://apply.coveredca.com/enrollment/enrldriver/v1/alt/anon/getproviderdetails?size=50000 - Headers required:
Content-Type: application/json,Origin: https://apply.coveredca.com,Referer: https://apply.coveredca.com/static/lw-enrollment/anon/preferences/plan-preferences/ - Request body:
{"providerType":"P","zip":"94102","radius":"10","year":"2026"}— providerType"P"= Physician,"D"= Dentist - Response shape:json
{ "metaData": {"currentPageItems":16944, "totalItems":16944, "totalPages":1}, "providers": [ { "providerId": 8478620, "firstName": "Jonathan", "lastName": "Huynh", "networkId": "70285CAN011-2600|40513CAN001-2600", // pipe-delimited HIOS network IDs "specialty": "Surgery", "address": "365 Hawthorne Ave", "city": "Oakland", "state": "CA", "zip": "94609", "latitude": 37.820616223766, "longitude": -122.263393337961 } ] } - networkId field maps directly to our existing
puf.networkIdfield on CA plan docs. No mapping translation required. - Legal: unintentionally-public endpoint, scraped without IHA license. Acceptable as backend interim solution. NOT marketable as "powered by Symphony" until we license directly from IHA. Symphony customer login at
symphony.iha.org; contact IHA Oakland 510-208-1740 for downstream-data-consumer subscription pricing (no public price sheet; expect $5-20K/yr).
Drug formulary data
- Per carrier, per metal tier: each CA carrier publishes a marketplace formulary PDF. CC's Standard Benefit Design mandate means plans within a metal tier (Silver / Bronze / Gold / Platinum / Catastrophic) share the same formulary structure for a given carrier.
- URL source: extracted from CalHEERS
gethealthplansbyidsresponse'sformularyUrlfield per plan. - Discovery path (one-time, per CA rating area):
POST https://apply.coveredca.com/shopandcompare/screening → returns handOffId + planNumbers list POST https://apply.coveredca.com/enrollment/enrollment-shopping/v1/alt/anon/gethealthplansbyids Body: {"planNumbers":["..."],"handOffId":"..."} → returns full plan-detail array including formularyUrl, providerUrl, brochureUrl, sbcDocName, drugs.{mostGenericDrugsInNetwork, preferredBrandDrugInNetwork, nonPreferredBrandDrugsInNetwork, specialtyDrugsInNetwork} - Expected unique PDF count: ~15-20 total across all 12 carriers
- PDF format (validated against Molina CA 2026 Marketplace formulary, 166 pages, 2.5 MB):
Drug Name Tier Requirements/Limits acetaminophen rectal suppository 120 mg Tier 1 OTC abacavir sulfate oral tablet 300 mg Tier 1 QL (2 EA per 1 day) VIREAD ORAL POWDER 40 MG/GM (Tenofovir Disoproxil) Tier 2 QL (7.5 GM per 1 day)- BRAND DRUGS in ALL CAPS (per the PDF's own stated convention)
- generic drugs in lowercase (italic-bold visually, plain lowercase in text extraction)
- Generic equivalent in parentheses after brand
- Tier 1-5 (Tier 5 = preventative with $0 copay per ACA)
- Requirements:
PA(prior auth),ST(step therapy),QL N EA per N days(quantity limit),MAIL(mail order required),OTC,AGE LIMIT,LD(limited distribution) - Section headers between drug classes:
*Antiretrovirals - Rti-Nucleoside Analogues-Pyrimidines***
- Parse approach:
pdftotext -layout+ ~50 lines of regex. RxCUI resolution via existing FFM cache atscripts/db/data/rxcui-resolution-cache.json+ CMS autocomplete for misses.
Plan tier copays (bonus data)
- Embedded in same CalHEERS
gethealthplansbyidsresponse - Already standardized per carrier (Tier 1 generic $/copay, Preferred Brand $/copay, etc.)
- Can populate
puf.copays.{primaryCare, specialist, genericDrugs, ...}for richer plan-detail pages without per-PDF parse
SBC + brochure URLs
- Also embedded in same response (
sbcDocName,brochureUrl) - Populate
puf.urls.{sbc, brochure}on prod plan docs for plan-detail page Documents section
Compute path: AWS Fargate vs local
| Operation | Where | Why |
|---|---|---|
| One-PDF POC parse | Local (mac) | Quick iteration, validates the regex |
| 19 CA rating-area screenings + handOffId harvest | Fargate RunTask in-VPC | Many short HTTPS POSTs, throttled to avoid CalHEERS rate limits |
| Download ~15-20 PDFs | Fargate (or local — they're small, ~2-5 MB each) | Negligible bandwidth |
| Parse all PDFs | Fargate | CPU-bound, parallelizable across PDFs |
| Mongo upsert to staging | Fargate via PrivateLink | Atlas writes in-VPC are fastest + most reliable |
| One-time prod plan doc update (foreign-key population) | Fargate with explicit prod approval gate | Touch prod only when explicitly authorized |
Fargate task definition pattern
Reuse the ENG-325 esbuild bundle pattern (scripts/preflight.ts already does this for ensure-indexes):
- esbuild the ingest script + helpers into a single CJS bundle
- COPY into the existing
askflorence-appECR image at/app/scripts/ca-ingest.cjs - New
workflow_dispatchGitHub Actionca-ingest.ymlwith phase input (harvest / parse / upsert / linkfk) - ECS RunTask invokes the bundle with the phase argument
- Atlas writes go through the existing staging PrivateLink endpoint
- Secrets sourced from existing AWS Secrets Manager (
staging/mongodb/app-write)
Per-state reusability
This playbook is CA-specific in execution but the patterns repeat across SBEs:
| State | Marketplace | Equivalent of CalHEERS | Statewide provider directory | Per-carrier formulary PDFs |
|---|---|---|---|---|
| CA | Covered California | CalHEERS (Accenture-built) | Symphony (IHA + Availity, SB 137) | Yes — SBP-standardized |
| NY | NY State of Health | NYSOH | NYDFS provider directory + DOH plan adequacy reports | Yes — per carrier |
| MA | Massachusetts Health Connector | MMIS/HIX | MA DOI provider data + Sapphire-backed search | Yes — per carrier |
| WA | Washington Healthplanfinder | WAhealthPlanFinder (Deloitte) | OneHealthPort provider directory | Yes — per carrier |
| CO | Connect for Health Colorado | CFHCO | Colorado DOI files | Yes — per carrier |
| CT | Access Health CT | AHCT | CT InsuranceDept files | Yes — per carrier |
| MD | Maryland Health Connection | MHBE | Maryland Insurance Administration | Yes — per carrier |
| NJ | Get Covered NJ | GetCoveredNJ (Accenture, sibling to CalHEERS) | NJ DOI files | Yes — per carrier |
| PA | Pennie | Pennie (Deloitte) | PA Insurance Dept files | Yes — per carrier |
For each new state we add:
- Probe the state marketplace's SPA bundles for the equivalent anon endpoints (same technique as CalHEERS — pull the production config JSON, read the API URL constants)
- Identify the statewide provider directory utility (varies — some states have a centralized one like Symphony; others have per-carrier portals)
- Verify the formulary PDF format follows the same general 3-column pattern (most do — federal ACA guidance creates de-facto standardization)
- Reuse this playbook's parse code + RxCUI resolution layer
Phase-by-phase execution checklist
See ENG-395 for the live checklist. The phases mirror the architecture diagram above (Phase 0 pre-flight → Phase 1 provider proxy → Phase 2 URL harvest → Phase 3 PDF parse → Phase 4 staging writes → Phase 5 query layer + UI flip → Phase 6 docs + cleanup → Phase 7 Symphony license track).
Operational notes captured during execution
(this section appended as the work progresses — patterns, gotchas, deviations from plan, retroactive lessons)