Appearance
New York Phase C/D ingestion playbook
Companion to
ca-phase-c-d-ingestion-playbook.mdandsbe-state-watchouts.md. This is the NY-specific execution plan for bringing NY to end-to-end doctor + Rx coverage parity (ENG-412). Phase 0 (source discovery) results are captured here; Phases 1-4 execute against them.
Goal
A NY consumer enters a ZIP → gets plans (✅ already works) → searches their doctor by name → sees which plans cover them → searches medications → sees which plans cover them. Same flow as FFM + the completed CA work (ENG-395/408).
Verified starting state (2026-05-28)
| Capability | Status |
|---|---|
| ZIP → county, plans, pricing, subsidy | ✅ works (282 NY 2026 plans, calculateNyEligibility) |
| Doctor coverage | ❌ 0 NY providers in providers_staging |
| Drug coverage | ❌ 0 of 282 NY plans cover atorvastatin; formularies not ingested |
The NY advantage over CA
NY is NPPES-NPI-native. Provider identity across NY's directory (PNDS), §1311 MRFs, and our /api/providers/autocomplete (NPPES) is the SAME 10-digit NPI. So once NY providers land in providers_staging keyed by _id: npi (FFM-style), the autocomplete → coverage join works with no bridge gap — the CA limitation (ENG-410, NPPES↔Symphony) does not exist for NY. NY becomes the first fully-complete SBE provider surface.
Also: NY puf is populated (CA's is empty — see CA decision #1). So NY provider-plan mapping can use puf.networkId for per-network precision instead of CA's HIOS-prefix coarsening.
No route changes needed. NY is in OWNED_COVERAGE_STATES (ENG-411); /api/{drugs,providers}/covered already dispatch NY to the owned-data path. The flow lights up the moment data lands.
Phase 0 — source discovery (DONE; decisions locked)
Provider directory
| Candidate | Verdict |
|---|---|
PNDS — pndslookup.health.ny.gov (Provider Network Data System, NY DOH, operated by IPRO) | Confirms data exists + is NPI-native + statewide (the NY analog of CA's Symphony — enter provider → which plans cover them; updated every 3 months). BUT: reCAPTCHA-gated jQuery form-POST tool, NOT a clean anon JSON API. Per our security rules we do NOT bypass CAPTCHA → not a scrape target. Potential future licensed data feed from NY DOH / IPRO (the Symphony-license analog). |
| §1311 Transparency-in-Coverage MRFs (per NY carrier) | ✅ CHOSEN. Federally mandated, machine-readable, NPI-keyed, no CAPTCHA. NY carriers publish these regardless of marketplace type. Same pipeline we already run for FFM (scripts/db/ingest-mrf-providers.js) — NY providers slot into providers_staging keyed by NPPES NPI exactly like FFM, with per-plan network membership. This is the clean technical + legal path. |
| Per-carrier provider-directory portals | Fallback for carriers whose MRF is unusable; higher per-carrier effort. |
Decision: ingest NY providers from §1311 MRFs, reusing the FFM MRF provider pipeline. _id: npi (NOT a namespaced ny-sym: id — NY is NPI-native, unlike CA). PNDS is the consumer-facing proof + a future licensed-feed option, not the ingest source.
Drug formularies
Per-carrier formulary PDFs/files — same approach as CA (CA carrier-PDF parser playbook is the template). Confirmed sources for major NY medical carriers (verify + expand the full ~13-issuer list during Phase 1):
| Carrier (HIOS) | Formulary source |
|---|---|
| Fidelis / Ambetter (25303) | fideliscare.org/Portals/0/Formularies/QHP-2026-formulary-Fidelis-Care.pdf (QHP) + EP-2026-Formulary-Fidelis-Care.pdf (Essential Plan) |
| Healthfirst (91237) | healthfirst.org/formularies (landing page → per-plan PDFs) |
| MetroPlus (11177) | metroplus.org/wp-content/uploads/... per-plan PDFs |
| (various) | fm.formularynavigator.com/FBO/... — MMIT-hosted, SAME host as CA Anthem (e.g. NY_Essential_Formulary.pdf, 2026_QHP_Formulary.pdf). The CA FormularyNavigator handling (browser UA + backoff on 429) applies. |
| Excellus/Highmark (78124), MVP (56184), EmblemHealth (88582), CDPHP (94788), Oscar (74289), UnitedHealthcare (54235), Anthem (41046/44113) | TBD — harvest in Phase 1; same per-carrier-PDF approach. |
Decision: parse NY carrier formulary PDFs → resolve to RxCUI (reuse scripts/db/data/rxcui-resolution-cache.json + CMS autocomplete for misses) → upsert to formularies_staging keyed _id: "<rxcui>:<year>", plans[] entries for NY plan_ids, source: "ny_<carrier>_2026_marketplace_formulary". Note NY's Essential Plan (EP) is a NY-specific program with its own formulary — include EP plan_ids where applicable.
Phase 1 — drug formulary ingest → formularies_staging
Mirror scripts/db/ingest-ca-formularies.py + ca-phase-cd-runner.cjs:
- [ ] Harvest NY carrier formulary URLs (full ~13-issuer inventory) →
scripts/db/data/ny-carrier-formularies-2026.ts - [ ] Download + parse (pdfplumber table-aware; positional fallback for single-column tier-digit layouts — both parsers from the CA work are reusable)
- [ ] Resolve RxCUIs (reuse FFM cache)
- [ ] Map carrier → NY plan_ids by HIOS prefix (or per-network via
puf.networkIdsince NY puf is populated) - [ ] Upsert (additive
$addToSet,source: "ny_*", cluster guard, pre/post FFM+CA count assertion, dry-run default)
Phase 2 — provider directory ingest → providers_staging
- [ ] Identify each NY carrier's §1311 MRF index URL (CMS TiC index → per-issuer provider-reference files). NY carriers publish at their TiC disclosure URLs.
- [ ] Reuse the FFM MRF provider pipeline (
scripts/db/ingest-mrf-providers.js) —_id: npi, plans[] keyed by NY plan_id + network_tier,source: "ny_1311_mrf",af_state_scope: ["NY"]. - [ ] Run via AWS Fargate RunTask in-VPC (reuse the ENG-408
ecs-smoke-runnertask-family + bundle pattern) — MRFs are large; no Starlink dependency.
Phase 3 — safety + verify (every covenant from ENG-395/408)
- [ ] Atlas snapshot before any write; log snapshot ID.
- [ ] FFM and CA cohorts byte-identical (pre/post counts: FFM
_id-range + CAca-sym:range both unchanged; NY adds new_id: npiprovider docs + newformularies_stagingplan entries). - [ ] Cluster identity guard (refuse non-staging) + collection allowlist (
formularies_staging+providers_stagingonly). - [ ] Apex smoke: NY ZIP 10001 / 11201 + a real NY doctor by name → correct per-plan coverage (NPPES NPI joins directly — no bridge); NY Lipitor → correct per-plan coverage.
Phase 4 — docs
- [ ] Update
sbe-state-watchouts.mdNY section: locked decisions + verified E2E matrix (replace the Phase-0 "broken" matrix with the post-ingest "works" matrix). - [ ] Florence tooling: NY "just works" given NPPES-native (autocomplete NPI = stored NPI). Confirm + note in
docs/florence-ai/tool-surface.mdSBE-vs-FFM matrix that NYcheck_providerworks WITHOUT the CA caveat. ENG-410's CA-only gap does NOT apply to NY.
Standing covenants (same as ENG-395/408)
Snapshot before writes · FFM + CA byte-identical · cluster identity guard · collection allowlist · $addToSet additive · dry-run default · Fargate in-VPC for bulk crawl/ingest · no prod-cluster writes for coverage data (staging via PrivateLink, ADR 0004).
Reusable-for-other-SBE notes
The PNDS-is-CAPTCHA-gated → use-§1311-MRF decision is likely the common SBE pattern: most states have a consumer provider-lookup tool (often CAPTCHA-protected) AND federally-mandated §1311 MRFs. Default to MRFs for ingestion; treat the state tool as proof-of-data + a potential licensed feed. NPI-native states (NY and most non-CA SBEs) avoid CA's Symphony-providerId bridge problem entirely.
Cross-references
- ENG-412 (this work) · ENG-395 (CA Phase C/D template) · ENG-408 (CA provider ingest harness + Fargate pattern) · ENG-407 (state-aware route dispatch) · ENG-411 (coverage-dispatch predicate — NY rides on
OWNED_COVERAGE_STATES) · ENG-410 (CA-only NPI bridge — N/A for NY) scripts/db/ingest-mrf-providers.js(FFM MRF provider pipeline — reused for NY) ·scripts/db/ingest-ca-formularies.py+parse-ca-formulary*.py(CA formulary parsers — reused for NY)