Appearance
2026-05-09: LARK - the data refresh engine for AskFlorence
Decision record. Defines LARK (a multi-source data refresh engine) that keeps the staging mirror (and any future enriched datasets) current with upstream sources at acceptable cost, with a runtime fallback ordering of staging-first / CMS-on-miss. Collapses Linear issues ENG-227, ENG-228, ENG-231, ENG-232, ENG-233 into one decision. Implementation lands in ENG-236 / #98.
Name. LARK. Loosely "Layer-Aware Refresh Kit" but mostly because larks sing at dawn and ours wakes up at 06:00 UTC to walk the data. AskFlorence's family of engines uses light/morning/Florence-Nightingale-themed names; "Athena" (after Florence's pet owl) is reserved for a future heavier-weight engine (subsidy logic or similar).
Status
- Decision date: 2026-05-09
- Audit verdict referenced:
~/Developer/ask-florence-audit-revalidation/docs/audits/mrf-ingest-staging-audit-2026-05-03T20-07-11.md(99.97% overall match, 148,475/148,520 checks) - Pivot precedent: docs/decisions/2026-05-03-pivot-cms-api-direct.md
- Cluster topology precedent: docs/adr/0004-cross-cluster-atlas-privatelink.md
- Implementation follow-on: ENG-236 / #98
- Spawned issues: ENG-251 (NPPES enrichment evaluation), ENG-252 (plan-catalog refresh cadence investigation) - detail in §7
Decision
Build the LARK as a four-layer change-detection pipeline that runs daily and refreshes our staging mirror only when upstream content materially changes. The four layers stack from cheapest-coarsest signal to most-precise:
- Layer 1 - source version watcher. Polls upstream "what's new" signals (CMS
/versionsfor CMS-Marketplace-API-derived datasets; NPPES file-listing pages for monthly enrichment files; equivalent for any future source). - Layer 2 - file-level change probe. Per-source HEAD with
If-None-Match+If-Modified-Sinceagainst issuer / file URLs; skip on 304. - Layer 3 - record-level semantic diff. Streaming-parses changed files, computes per-record diffs against stored canonical state, classifies whether each diff is material or cosmetic, writes only material diffs to Atlas. Built day one. (Reframed from the original "byte-hash diff" - see §"Why Layer 3 day-one" below.)
- Layer 4 - drift monitor + recovery floor. Tier-6 audit harness re-purposed as continuous monitor with Slack alert on >5% drift; quarterly full re-baseline as recovery sweep for whatever Layers 1-3 missed.
The engine is source-pluggable. Day-one sources: §1311 MRFs (formularies_staging + providers_staging). Day-two sources: CMS Marketplace API plan-catalog data (the plans collection - whose refresh cadence has been wrong-assumed as annual; see §7), NPPES enrichment for taxonomy fields, anything else with the same change-source / file-fetch / record-upsert shape.
Runtime fallback ordering flipped to Option A: staging-first, CMS-on-miss. Reasoning: CMS-API as primary cannot survive OE 2027 scale (1k concurrent visitors exhausts the per-minute budget in <3s per pivot doc); we already have a 99.97%-match staging mirror and a sub-second cross-cluster PrivateLink read path. Use what we have as primary; fall back to CMS only when staging doesn't have the answer.
Tolerance band for staleness: up to 1 week is acceptable, 1 month is the worst-case fallback, daily-or-better is the goal. LARK delivers daily comfortably; hourly during OE 2027 high-churn is one cron-line edit.
Why Layer 3 day-one (the diff classifier)
The original recommendation was to defer Layer 3 (per-record content-hash) until daily cadence wasn't enough. That was wrong. Reframed:
A pure byte-hash Layer 3 doesn't earn its complexity. But a semantic diff classifier does, because real-world MRF files contain non-material churn that would force unnecessary upserts:
- An issuer's file may bump
last_published_aton every publish, even when no records changed. Byte-hash flags every record as "changed"; classifier sees "onlylast_published_atfield changed" and marks it cosmetic. - Field ordering inside JSON objects can drift between publishes. Byte-hash flags everything; canonicalized hash + diff classifier sees "same content, different key order" and skips.
- Nested arrays may re-order without semantic change (e.g.,
plans: ["X", "Y"]vsplans: ["Y", "X"]). Byte-hash flags; classifier with set-equality on known-unordered fields skips.
Without the classifier, Layer 3 saves bandwidth (we already downloaded the file) but doesn't save Atlas write IOPS, which is the actual cost driver during typical-change days. With the classifier, Layer 3 saves both.
Implementation shape (similar to git diff in spirit, simpler in scope):
- Per-collection canonicalization rule: lowercase keys, sort object keys, normalize string whitespace, treat known-unordered arrays as sets.
- SHA-1 the canonicalized form per record - "semantic hash" - stored on the doc as
af_semantic_hashand inmrf_file_state_stagingfor fast lookup. - On detected file change (Layer 2 trip), streaming-parse the new file, compute new semantic hash per record, compare against stored.
- For changed records: optionally compute a per-field diff (which fields changed, old vs new) for audit logging. Cheap; same parse pass.
- Classifier rule set per collection: which fields are "material" (changes warrant upsert) vs "cosmetic" (changes are skipped). Examples for
providers_staging: NPI/name/specialty/network = material;last_published_at/ record-id metadata = cosmetic. Forformularies_staging: drug-tier / formulary-id / coverage = material; ordering of plans array = cosmetic. - Material change → upsert. Cosmetic change → update only
af_semantic_hash+last_seen_at, no write to data fields.
Cost-benefit math:
- Without Layer 3 + classifier: every Layer-2-triggered shard re-fetch upserts every record in the shard, regardless of whether content changed. On a typical-change day with 10-30 issuers updating, that's millions of needless upserts. Atlas write-IOPS pressure.
- With Layer 3 + classifier: same shard re-fetch, only material-diff records upsert. Likely <5% of records per typical-change day. ~20x reduction in Atlas write IOPS.
Engineering cost: ~2-3 engineer-days for the classifier framework + per-collection rule sets. Lower than the cost of one M60 burst event. Worth it day-one.
Rationale
Cost vs achievable cadence
Today's full-ingest cost: ~22 GB compressed download + ~32M Atlas upserts on every run. The 2026-05-06 incident proved sustaining this is ~$2,800/mo at M60. Full nightly re-ingest is off the table. The 2026-05-03 audit at 99.97% match shows we don't need that level of drift mitigation.
| Architecture | Cost-per-run shape | Cadence sustainable | Worst-case staleness | Why we did or didn't pick it |
|---|---|---|---|---|
| Full nightly re-ingest | $2,800/mo if sustained | Daily at huge cost | <24h | Rejected. Cost-prohibitive. |
Layer 1 only (/versions poll, no HEAD, no hash) | Cheap poll + full-fetch all 183 issuers when triggered | Daily safe; sub-daily wasteful | ~24h on trigger days | Rejected. Same cost shape as today's burst on every trigger day. |
Layers 1+2 (/versions + per-issuer ETag/Last-Modified HEAD) | Cheap poll + 183 cheap HEADs + selective re-fetch only changed issuers | Daily comfortable | <24h baseline | Insufficient. Saves bandwidth but still re-upserts every record in every changed shard - misses the Atlas IOPS driver. |
| Layers 1+2+3 with byte-hash | + streaming hash + targeted upserts | Hourly viable | <1h | Insufficient. Cosmetic file churn forces upserts for non-material changes. |
| Layers 1+2+3 with diff classifier (LARK) | + canonicalized record diff + material/cosmetic classifier | Daily default; hourly for OE 2027 | <24h baseline; <1h achievable | Selected. Saves bandwidth AND Atlas IOPS. ~20x reduction on typical-change days vs byte-hash alone. |
| Drift-detection-only (Tier-6 audit triggers ad-hoc re-ingest) | Zero cost until threshold crossed; full cost on recovery | Reactive only | Worst-case = audit-interval + recovery-time | Rejected as primary. Recovery is the rare = under-tested path. Kept as Layer 4 monitor/alert + quarterly recovery floor. |
| Quarterly full re-baseline (recovery floor only) | Predictable, expensive | Quarterly | n/a (back-of-house) | Selected as Layer 4 recovery floor. ~$200/yr per pivot doc. Catches whatever Layers 1-3 missed. |
OE 2027 concurrency math (drives the fallback flip)
From the pivot doc:
1,000 concurrent /plans visitors × ~10 calls per visit in a 30s window = 333 calls/sec. CMS budget: 200 RPS / 1,000 per minute. Single-key budget exhausts in ❤️ seconds.
5,000 concurrent (the flip threshold) = 1,667 calls/sec. CMS-direct breaks here even with multi-key.
The pivot doc's framing was "CMS-API-first as the MVP transitional state, flip to owned-first when scale demands." The math says we should flip now. Reasons:
- CMS-API-first is a known scaling dead-end. Multi-key rotation buys ~5x but doesn't change the architecture; CDN edge cache helps but only on hot tuples. The fundamental shape is: every user's lookup hits CMS over the network, every time, until cache hit. At 1k concurrent the hit rate matters; at 5k it falls over.
- Staging-first with sub-second cross-cluster PrivateLink reads (per ADR 0004) is faster on the hot path AND survives OE.
- We already have 99.97%-match staging data; LARK keeps it within 24h of CMS upstream. Drift risk is well-bounded.
- CMS API stays in the loop as the FALLBACK on staging miss (new plan we haven't ingested yet, NPI not in our directory yet, edge cases). Best of both: speed + survivability of staging-first, authoritative answer of CMS-on-miss.
What we're NOT doing and why
- Full nightly re-ingest. Cost-prohibitive. Doesn't scale.
- Weekly hybrid (full + daily delta). Quarterly recovery floor catches the same gaps at 4x lower cost.
- Layer 3 byte-hash only (without classifier). Saves bandwidth not IOPS. Half the value at almost the same engineering cost.
- Drift-detection-only as primary refresh trigger. Recovery is the under-tested path. Kept as monitor + alert.
- Switching primary data source to an alt vendor. No source displaces §1311 (see §6).
- Staying on CMS-API-first runtime. OE 2027 math is the load-bearing constraint; the pivot was transitional and the transition is now.
/versions endpoint integration (resolves ENG-228)
Per Phase A4 finding, /versions returns 10 dataset stamps. CMS docs do not publish refresh cadence; daily-ish is observational.
Per-dataset role across both data sets LARK manages:
| Dataset | Updates (observed) | LARK role - §1311 MRFs | LARK role - plan catalog |
|---|---|---|---|
coverage | Daily | Layer-1 trigger | n/a |
npis | Daily | Layer-1 trigger | n/a |
drugs | Daily | Layer-1 trigger | n/a |
taxonomies | Daily | Out of scope | n/a |
plans-etl | ~Weekly | n/a | Layer-1 trigger for plan-catalog refresh |
plans | ~Weekly | n/a | Layer-1 trigger for plan-catalog refresh |
issuers | ~Weekly | Trigger for mrpuf_issuers_staging re-pull | Adjacent |
zipcodes | ~Biweekly | Out of scope | Out of scope |
plan-urls, etl-id | Tied to plans-etl | Out of scope | Bookkeeping |
Note that plans-etl and plans advance roughly weekly, NOT annually. Today's plans collection is reingested seasonally (HIOS submission cycle); LARK should track these per-week refreshes. Spec'd in §7.
Granularity caveat: CMS only exposes per-dataset, not per-issuer or per-plan. They can bump coverage even when only 5 issuers' rows changed. Layers 2 + 3 absorb this overhead because HEAD is cheap and the classifier skips no-op records.
Two independent triggers per source: LARK fans out when either Layer 1 (CMS-side /versions advance) or Layer 2 (issuer-side / file-side ETag advance) signals change. This handles the "issuer published, CMS hasn't indexed yet" gap (the 3 FAIL issuers in A1).
Audit-harness pinning: record coverage.updated at audit start so re-runs are interpretable. Port probeVersionsEndpoint(cms) from scripts/audit/investigate-failing-npis-and-errored-issuers.js:40 into scripts/audit/tier-6-mrf-coverage-validation.js.
Stale-data UX (plan card freshness): plan cards / coverage check show "Coverage data refreshed Xh ago" by reading per-source last_seen_material_change_at. Implementation lands in ENG-236 alongside LARK.
Implementation sketch (for ENG-236 / #98)
High-level. Detailed implementation lives in ENG-236.
State store
New collection mrf_file_state_staging (already allowlisted in src/lib/db.ts:172, currently empty). Schema:
js
{
source: "cms_versions" | "issuer_index" | "issuer_shard" | "nppes" | "cms_plans_api",
source_key: string, // e.g., issuer_hios + file_url
layer1_signal: string | null, // /versions timestamp captured at last check
layer2_etag: string | null,
layer2_last_modified: string | null,
layer2_last_check_at: ISODate,
layer3_records_seen: number,
layer3_records_material_changed: number,
layer3_records_cosmetic_changed: number,
last_seen_material_change_at: ISODate,
per_record_semantic_hashes: { [_id]: sha1 }, // for diff lookup
}Daily run (cron schedule)
GitHub Actions workflow at .github/workflows/rdre-daily.yml. Cron 0 6 * * * UTC. One workflow with parallel jobs per source:
Job: §1311 MRFs
- Layer 1: poll
/versions, compare against last-seen forcoverage,npis,drugs. If none advanced, no-op + heartbeat write. - Layer 2: per-issuer (
mrpuf_issuers_staging~183 rows), HEADindex_urlwith conditionals; 304-skip; on 200 OK with new ETag, fan-out to changed shards (HEAD each, 304-skip, full-fetch only changed). - Layer 3: streaming-parse changed shards; canonicalize + semantic-hash per record; diff against stored hash; classify; upsert only material-changed records.
- Per-issuer atomic isolation already in place; reuse.
- Layer 1: poll
Job: plan catalog (CMS Marketplace API)
- Layer 1: poll
/versions, compare against last-seen forplans-etl,plans. If none advanced, no-op. - Layer 2: not applicable (no per-source files; CMS API is the source). Replaced by per-plan freshness query.
- Layer 3: incremental fetch (which plan IDs changed since last-seen
plans-etl), classify, upsert. Detail spec in ENG-237 (TBD).
- Layer 1: poll
Job: NPPES enrichment (designed; build trigger separate)
- Layer 1: HEAD the NPPES file index page; check posted last-updated date.
- Layer 2: HEAD the differential file URLs; 304-skip on no change.
- Layer 3: streaming-parse the differential, classify against existing
providers_staging(taxonomy + practice-address fields are material; rest is enrichment-only). - Build trigger: separate issue.
Post-ingest: rebuild derived collections (ENG-425)
- After the §1311 MRF job (and any SBE/CA formulary ingest) writes
formularies_staging, runnode scripts/db/derive-drug-search-index.js --applyto rebuilddrug_search_index(the drug-name search read-model). It reads the WHOLE collection (FFM + SBE/CA), so it covers every state/source. Idempotent; additive collection; rollback =--rollback/ drop. - Sequencing: run AFTER formulary Layer 3 completes, BEFORE the drift monitor. A stale
drug_search_indexonly affects search ranking/strength lists, never coverage (coverage stays per-rxcui informularies_staging), so a brief lag is non-critical. - Any future derived read-model added on the reference cluster lands here too.
- After the §1311 MRF job (and any SBE/CA formulary ingest) writes
Diff classifier framework
Pluggable per collection:
js
// pseudocode
const classifier = {
providers_staging: {
material_fields: ["npi", "name", "facility_name", "specialty", "plans"],
cosmetic_fields: ["last_published_at", "_metadata"],
set_fields: ["plans"], // arrays treated as sets
canonicalize: (rec) => sortKeys(lowercase(stripWhitespace(rec))),
isMaterial: (oldRec, newRec) => {
for (const field of material_fields) {
if (set_fields.includes(field)) {
if (!setEqual(oldRec[field], newRec[field])) return true;
} else {
if (oldRec[field] !== newRec[field]) return true;
}
}
return false;
},
},
formularies_staging: { /* analogous; tier + plans_covered are material; ordering is cosmetic */ },
};The framework is small (<200 LOC). Per-collection rules are explicit data, easy to audit + adjust as we learn what issuers actually publish.
Drift monitor + recovery
- Tier-6 audit harness (
scripts/audit/tier-6-mrf-coverage-validation.js) re-purposed as continuous monitor. - Daily run after LARK completes (cheap; uses CMS API not local data).
- Slack alert on >5% drift (overall match rate falls below 95%) - distinguishes from the 99.5% SLO.
- On alert: trigger ad-hoc LARK run (no waiting for next cron) + investigation hook for the affected source/issuer.
- Quarterly full re-baseline: first Sunday of each quarter, 03:00 UTC. Existing
scripts/db/ingest-mrf-{providers,formularies}.jsrun unchanged.
Schema-integrity follow-ups (audit-flagged)
- 868 docs with unmapped tiers (
NONPREFERRED-SPECIALTY-DRUGS,NONPREFERRED-BRAND-AND-SPECIALTY): extendTIER_MAPinscripts/db/lib/mrf-helpers.js:97-262with two entries. Tier-map auto-extend wired into LARK: emit a side-channel "unmapped tier seen" event rather than hard-failing. - 167,776 provider records with invalid type field: parsing variance; out of cascade scope, separate fix.
Alt data sources investigated (resolves ENG-232)
Researched 2026-05-09 via vendor pages + targeted searches. Honest about unknowns: where pricing is sales-contact-only or membership-gated, that's stated.
| Source | Role for our use case | Recommendation |
|---|---|---|
| §1311 MRFs (CMS direct file feed) | Primary feed; what we already ingest | Stay (primary). LARK refreshes this. |
| NPPES (NPI Registry) | Free public CMS file: NPI + name + taxonomy/specialty + practice address. No plan/network data. Refresh cadence: monthly (latest 2026-04-13, file ~1073 MB v.2). | Pursue as enrichment (separate issue). LARK-pluggable. Could enrich providers_staging with cleaner specialty + practice-address data. |
| Stedi (healthcare APIs) | Eligibility (270/271) + Claims clearinghouse. $0.15/check after 100 free; Developer $500/mo. SOC 2 + HIPAA-compliant. | Not pursuing. Wrong primitive for marketplace coverage display. Eligibility = "is patient enrolled with payer," not "is drug covered." Possibly later for member-portal eligibility check post-enrollment. |
| Serif Health Signal (signal) | MRF rate-transparency benchmarking; 200+ commercial payers; monthly refresh. Pricing sales-only. | Not pursuing. Wrong file scope (negotiated rates, not formulary or directory). |
| Surescripts (overview) | Real-time eligibility + Rx history + e-prescribing + RTBC. Enterprise EHR/payer ecosystem. Sales-only. | Not pursuing. Wrong access tier (institutional, not consumer apps). |
| NCPDP dataQ (NCPDP) | Pharmacy directory + prescriber file + formulary standards. Membership-gated. | Defer. Revisit if/when we ship pharmacy-network-tier feature (#106). Not relevant to LARK. |
| NIPR (nipr.com) | Insurance producer/agent licensing; NPN validation. | Out of scope. Different workstream (agent platform Phase 5). |
| Per-issuer REST APIs (Cigna, Aetna, BCBS, etc.) | Member-portal-quality data per issuer. Engineering cost ~1 week/issuer × 183 issuers = unaffordable. | Not pursuing as primary. Possible enrichment for top 3-5 issuers post-OE 2027 if §1311 has gaps after LARK. |
CMS /coverage/search | Combined provider+drug autocomplete on CMS Marketplace API. | Already on the runtime path - underused endpoint. Worth a separate ticket for typeahead UX. |
| GoodRx, RxRevu | Retail Rx discount pricing or EHR-integrated RTBC. | Not pursuing. Different problem space. |
Net: No alt source displaces §1311 as the primary marketplace coverage feed. NPPES is a valuable monthly enrichment for taxonomy + practice-address; tracked as a separate issue. LARK is source-pluggable so adding NPPES later is mechanical.
Fallback ordering decision (resolves ENG-233) - REVISED
Flip to Option A: staging-first, CMS-on-miss. The pivot doc framed CMS-API-first as the MVP transitional state. The math now says staging-first is the right ordering at OE-baseline traffic, not just past 5k concurrent.
Rationale
- OE 2027 scale. CMS-API-as-primary cannot survive 1k concurrent /plans visitors (per pivot doc math: budget exhausts in <3s). Staging-first sidesteps the rate-limit problem entirely on the hot path.
- Latency. Staging-via-PrivateLink is faster than CMS-via-public-network. We've measured ~225-465ms p99 for the gap-fill path on 2026-05-08; pure staging-only reads should be similar or faster.
- Freshness is good enough. LARK keeps staging within 24h of CMS upstream. 99.97% match. The 0.03% drift becomes the CMS-fallback path (small fraction of lookups).
- CMS stays in the loop as authoritative fallback. When staging doesn't have a plan_id, NPI, or RxCUI we can answer (new plan added since last refresh, brand-new NPI, etc.), runtime falls through to CMS. Best of both.
How the runtime changes
Today (Option B): CMS-first → staging fills gaps where CMS returns Covered but no drug_tier / network_tier (per src/lib/drug-tier-fallback.ts + src/lib/provider-network-fallback.ts).
After (Option A): Staging first for the canonical answer (covered/not + tier/network + carrier metadata). CMS only on miss (staging doesn't have the plan/NPI/RxCUI tuple). The fallback shapes:
- Lookup hits staging → return immediately
- Lookup misses staging → call CMS API; if CMS has the answer, return that AND opportunistically write back to staging (lazy enrichment so the next user gets a hit) - tracked as a follow-up
- Lookup misses both → return "no data available" with appropriate UX
Implementation: invert the order in drug-tier-fallback.ts and provider-network-fallback.ts. Add an opportunistic-writeback pattern. Feature-flag the flip so we can toggle if anything misbehaves.
Triggers to flip BACK to Option B (CMS-first)
Explicit and named. Should never trip given current math + LARK design:
- Audit match rate sustained below 99.0% for >2 weeks (cascade lag too severe; CMS-first would be a workaround until LARK is fixed)
- LARK reliability incidents (cron failures + Layer-3 false-positives) > 2 per quarter sustained
- Atlas cluster outage > 15 min during business hours (CMS-first as a circuit-breaker is a separate question - probably solved via cached-response fallback rather than a primary flip)
- Compliance / audit finding that staging data lag > 24h is unacceptable for a specific user surface (would scope flip to that surface only)
SLOs (revised for staging-first)
- Runtime p99 latency: ≤200ms for staging-hit; ≤500ms for staging-miss-then-CMS path
- Staging staleness: ≤24h baseline, ≤1h during OE 2027 peak windows
- Staging hit rate: ≥99% on
(plan_id, NPI)/(plan_id, RxCUI)lookups for plans known to CMS - CMS API budget consumption at 1k concurrent: ≤5% of single-key budget (only consumed on staging-miss path - massive headroom)
- Match-rate drift alert: Slack ping on >5% drift in tier-6 audit
Implementation note for ENG-236
Runtime flip is part of this milestone, behind a feature flag. Code paths to invert: src/lib/drug-tier-fallback.ts, src/lib/provider-network-fallback.ts, plus the /api/drugs/covered and /api/providers/covered route handlers that consume them. Opportunistic writeback can land in a follow-up; the inversion itself is small.
Open follow-ups + spawned issues
Spawned issues (logged separately)
- ENG-251 - NPPES enrichment evaluation. Investigate using NPPES monthly differential file (~1073 MB v.2, last updated 2026-04-13) to enrich
providers_stagingwith cleaner taxonomy + practice-address data. Build path is LARK-pluggable (Layer 1 = posted-date check, Layer 2 = HEAD differential URL, Layer 3 = canonicalize NPPES record vs ours). Could backfill the 167,776 invalid-type + 151,036 missing-name records inproviders_staging. Effort estimate: ~1 engineer-week after LARK framework lands. Priority: Medium. - ENG-252 - Plan-catalog refresh cadence investigation. The
planscollection is currently re-ingested seasonally (HIOS submission cycle). But CMS/versionsshowsplans-etlandplansadvancing roughly weekly. Need to: (a) verify the actual cadence of changes the CMS Marketplace API serves; (b) determine which plan attributes change mid-year (URL fixes? plan adds/drops? service-area shifts?) and which are pinned by SI submission cycle (copays, deductibles, MOOPs, premiums); (c) wire LARK Layer-1+ for the plan catalog if sub-annual changes are material. Investigation: ~1 engineer-week. Priority: High.
Carry-over follow-ups from §1311 audit
- Per-issuer cooperative behavior unknown. Do all 183 issuers honor
If-None-Match/If-Modified-Since? Subset that returns 200 OK on every request gets a fallback path: fetch + compare-shard-content-hash. Measure during ENG-236 implementation; flag in the PR. - 3 FAIL issuers in latest audit (
42326,50305,25896). Allstale_or_missing_in_cms- issuer published, CMS hasn't indexed. No LARK fixes. Daily refresh should self-heal within 24-48h once CMS catches up. Worth a dashboard + alert for sustained FAIL state per issuer. - 19 hard-error issuers in
mrpuf_issuers_staging. Cross-correlate with the per-state coverage gaps (IN 78%, MS 79%, AZ/FL/NC 92%). IP-allowlist breakers (Medica MO, Dean WI, BCBSFL) noted in ENG-232. Solved via VPC egress; verify after ENG-236 lands. - Tier-map extension. Add
NONPREFERRED-SPECIALTY-DRUGS+NONPREFERRED-BRAND-AND-SPECIALTYentries to TIER_MAP. Trivial follow-up. - Audit-harness
coverage.updatedpinning. Small port frominvestigate-failing-npis-and-errored-issuers.jsintotier-6-mrf-coverage-validation.js. Could ship independently of ENG-236. - Stale-data UX on plan card. "Coverage data refreshed Xh ago" surface; ships alongside ENG-236.
- CMS-on-miss opportunistic writeback to staging. Lazy enrichment pattern when staging-miss falls through to CMS. Follow-up to the runtime flip.
- CDN edge cache + multi-key CMS rotation. Largely obviated by the staging-first flip. Multi-key only useful for the staging-miss fallback path. Revisit if that path becomes hot.
Acceptance checklist
- [x] What cadence? Daily as the default. LARK built day-one with all 4 layers, including diff classifier. Hourly possible with one cron-line edit when OE 2027 demands.
- [x] How is
/versionsintegrated? Daily poll forcoverage/npis/drugs(§1311 path) +plans-etl/plans(plan-catalog path). Per-dataset semantics in §3 + §4. - [x] Alt sources eval - recommendation per source. Table in §6; no alt source displaces §1311. NPPES pursued separately as enrichment.
- [x] Fallback ordering pick. Flipped to Option A: staging-first, CMS-on-miss. Triggers to flip back documented in §7.
- [x] First implementation milestone (with rough effort). ENG-236. LARK 4-layer + diff classifier + plan-catalog parallel job + runtime flip behind feature flag + drift monitor + tier-map extension. Estimated 2-3 engineer-weeks (up from 1-2 in v1).
- [x] Stale-data UX. "Coverage data refreshed Xh ago" on plan card via
last_seen_material_change_at. - [x] Cross-references. ADR 0004, pivot decision doc, audit-revalidation report, ENG-236 (linked above).
- [x] NPPES + plan-catalog scope expansion. Both spawned as separate issues (linked above).
Related
- docs/decisions/2026-05-03-pivot-cms-api-direct.md - pivot doc; performance ceiling math
- docs/adr/0004-cross-cluster-atlas-privatelink.md - cluster topology; PrivateLink
docs/audits/mrf-ingest-staging-audit-2026-05-03T20-07-11.md(in audit-revalidation worktree, not yet on main) - 99.97% match audit verdict- docs/session-log/2026-05-08-phase-d-and-ci-guard.md - drug-tier + provider-network fallback shipped 2026-05-08
- docs/session-log/2026-05-08-phase-11-cross-cluster-privatelink.md - PrivateLink cutover
- scripts/db/lib/mrf-helpers.js - ingest helper library; cluster guard, tier map, CMS API wrapper
- scripts/db/ingest-mrf-providers.js + scripts/db/ingest-mrf-formularies.js - current ingest scripts; LARK replaces these on the daily path
- scripts/audit/tier-6-mrf-coverage-validation.js - drift monitor source
- src/lib/db.ts -
STAGING_ALLOWED_COLLECTIONSincludesmrf_file_state_staging - src/lib/drug-tier-fallback.ts + src/lib/provider-network-fallback.ts - runtime fallback consumers; inverted in this decision
- Linear: ENG-227, ENG-228, ENG-231, ENG-232, ENG-233
- GitHub: #86, #87, #93, #94, #95, #98
Revision history
- 2026-05-09 v1 (commit
df1882f): initial decision; Layers 1+2 only, Layer 3 deferred, Option B fallback (CMS-first). Superseded. - 2026-05-09 v2 (commit
ebc5f0d): engine reframe + Layer 3 day-one with diff classifier + fallback flip to Option A (staging-first, CMS-on-miss) + scope expanded to plan-catalog + spawned NPPES (ENG-251) + plan-cadence (ENG-252) issues. Engine working name was "Reference Data Refresh Engine (RDRE)". Superseded by v3 naming. - 2026-05-09 v3 (this commit): renamed engine to LARK. "Reference Data Refresh Engine" was technically descriptive but pitch-flat; LARK is a Florence-tied name (larks sing at dawn; our daily cron runs 06:00 UTC) with its own personality. Reserves the "Athena" name (after Florence's pet owl) for a future heavier-weight engine. ENG-251 / ENG-252 metadata also corrected here: assigned to current cycle (
5b739f96-...) with 2026-05-11 due dates matching the original 5 issues. Current.