2026-05-09: LARK - the data refresh engine for AskFlorence

Decision record. Defines LARK (a multi-source data refresh engine) that keeps the staging mirror (and any future enriched datasets) current with upstream sources at acceptable cost, with a runtime fallback ordering of staging-first / CMS-on-miss. Collapses Linear issues ENG-227, ENG-228, ENG-231, ENG-232, ENG-233 into one decision. Implementation lands in ENG-236 / #98.

Name. LARK. Loosely "Layer-Aware Refresh Kit" but mostly because larks sing at dawn and ours wakes up at 06:00 UTC to walk the data. AskFlorence's family of engines uses light/morning/Florence-Nightingale-themed names; "Athena" (after Florence's pet owl) is reserved for a future heavier-weight engine (subsidy logic or similar).

Status

Decision date: 2026-05-09
Audit verdict referenced: ~/Developer/ask-florence-audit-revalidation/docs/audits/mrf-ingest-staging-audit-2026-05-03T20-07-11.md (99.97% overall match, 148,475/148,520 checks)
Pivot precedent: docs/decisions/2026-05-03-pivot-cms-api-direct.md
Cluster topology precedent: docs/adr/0004-cross-cluster-atlas-privatelink.md
Implementation follow-on: ENG-236 / #98
Spawned issues: ENG-251 (NPPES enrichment evaluation), ENG-252 (plan-catalog refresh cadence investigation) - detail in §7

Decision

Build the LARK as a four-layer change-detection pipeline that runs daily and refreshes our staging mirror only when upstream content materially changes. The four layers stack from cheapest-coarsest signal to most-precise:

Layer 1 - source version watcher. Polls upstream "what's new" signals (CMS /versions for CMS-Marketplace-API-derived datasets; NPPES file-listing pages for monthly enrichment files; equivalent for any future source).
Layer 2 - file-level change probe. Per-source HEAD with If-None-Match + If-Modified-Since against issuer / file URLs; skip on 304.
Layer 3 - record-level semantic diff. Streaming-parses changed files, computes per-record diffs against stored canonical state, classifies whether each diff is material or cosmetic, writes only material diffs to Atlas. Built day one. (Reframed from the original "byte-hash diff" - see §"Why Layer 3 day-one" below.)
Layer 4 - drift monitor + recovery floor. Tier-6 audit harness re-purposed as continuous monitor with Slack alert on >5% drift; quarterly full re-baseline as recovery sweep for whatever Layers 1-3 missed.

The engine is source-pluggable. Day-one sources: §1311 MRFs (formularies_staging + providers_staging). Day-two sources: CMS Marketplace API plan-catalog data (the plans collection - whose refresh cadence has been wrong-assumed as annual; see §7), NPPES enrichment for taxonomy fields, anything else with the same change-source / file-fetch / record-upsert shape.

Runtime fallback ordering flipped to Option A: staging-first, CMS-on-miss. Reasoning: CMS-API as primary cannot survive OE 2027 scale (1k concurrent visitors exhausts the per-minute budget in <3s per pivot doc); we already have a 99.97%-match staging mirror and a sub-second cross-cluster PrivateLink read path. Use what we have as primary; fall back to CMS only when staging doesn't have the answer.

Tolerance band for staleness: up to 1 week is acceptable, 1 month is the worst-case fallback, daily-or-better is the goal. LARK delivers daily comfortably; hourly during OE 2027 high-churn is one cron-line edit.

Why Layer 3 day-one (the diff classifier)

The original recommendation was to defer Layer 3 (per-record content-hash) until daily cadence wasn't enough. That was wrong. Reframed:

A pure byte-hash Layer 3 doesn't earn its complexity. But a semantic diff classifier does, because real-world MRF files contain non-material churn that would force unnecessary upserts:

An issuer's file may bump last_published_at on every publish, even when no records changed. Byte-hash flags every record as "changed"; classifier sees "only last_published_at field changed" and marks it cosmetic.
Field ordering inside JSON objects can drift between publishes. Byte-hash flags everything; canonicalized hash + diff classifier sees "same content, different key order" and skips.
Nested arrays may re-order without semantic change (e.g., plans: ["X", "Y"] vs plans: ["Y", "X"]). Byte-hash flags; classifier with set-equality on known-unordered fields skips.

Without the classifier, Layer 3 saves bandwidth (we already downloaded the file) but doesn't save Atlas write IOPS, which is the actual cost driver during typical-change days. With the classifier, Layer 3 saves both.

Implementation shape (similar to git diff in spirit, simpler in scope):

Per-collection canonicalization rule: lowercase keys, sort object keys, normalize string whitespace, treat known-unordered arrays as sets.
SHA-1 the canonicalized form per record - "semantic hash" - stored on the doc as af_semantic_hash and in mrf_file_state_staging for fast lookup.
On detected file change (Layer 2 trip), streaming-parse the new file, compute new semantic hash per record, compare against stored.
For changed records: optionally compute a per-field diff (which fields changed, old vs new) for audit logging. Cheap; same parse pass.
Classifier rule set per collection: which fields are "material" (changes warrant upsert) vs "cosmetic" (changes are skipped). Examples for providers_staging: NPI/name/specialty/network = material; last_published_at / record-id metadata = cosmetic. For formularies_staging: drug-tier / formulary-id / coverage = material; ordering of plans array = cosmetic.
Material change → upsert. Cosmetic change → update only af_semantic_hash + last_seen_at, no write to data fields.

Cost-benefit math:

Without Layer 3 + classifier: every Layer-2-triggered shard re-fetch upserts every record in the shard, regardless of whether content changed. On a typical-change day with 10-30 issuers updating, that's millions of needless upserts. Atlas write-IOPS pressure.
With Layer 3 + classifier: same shard re-fetch, only material-diff records upsert. Likely <5% of records per typical-change day. ~20x reduction in Atlas write IOPS.

Engineering cost: ~2-3 engineer-days for the classifier framework + per-collection rule sets. Lower than the cost of one M60 burst event. Worth it day-one.

Rationale

Cost vs achievable cadence

Today's full-ingest cost: ~22 GB compressed download + ~32M Atlas upserts on every run. The 2026-05-06 incident proved sustaining this is ~$2,800/mo at M60. Full nightly re-ingest is off the table. The 2026-05-03 audit at 99.97% match shows we don't need that level of drift mitigation.

Architecture	Cost-per-run shape	Cadence sustainable	Worst-case staleness	Why we did or didn't pick it
Full nightly re-ingest	$2,800/mo if sustained	Daily at huge cost	<24h	Rejected. Cost-prohibitive.
Layer 1 only (`/versions` poll, no HEAD, no hash)	Cheap poll + full-fetch all 183 issuers when triggered	Daily safe; sub-daily wasteful	~24h on trigger days	Rejected. Same cost shape as today's burst on every trigger day.
Layers 1+2 (`/versions` + per-issuer ETag/Last-Modified HEAD)	Cheap poll + 183 cheap HEADs + selective re-fetch only changed issuers	Daily comfortable	<24h baseline	Insufficient. Saves bandwidth but still re-upserts every record in every changed shard - misses the Atlas IOPS driver.
Layers 1+2+3 with byte-hash	+ streaming hash + targeted upserts	Hourly viable	<1h	Insufficient. Cosmetic file churn forces upserts for non-material changes.
Layers 1+2+3 with diff classifier (LARK)	+ canonicalized record diff + material/cosmetic classifier	Daily default; hourly for OE 2027	<24h baseline; <1h achievable	Selected. Saves bandwidth AND Atlas IOPS. ~20x reduction on typical-change days vs byte-hash alone.
Drift-detection-only (Tier-6 audit triggers ad-hoc re-ingest)	Zero cost until threshold crossed; full cost on recovery	Reactive only	Worst-case = audit-interval + recovery-time	Rejected as primary. Recovery is the rare = under-tested path. Kept as Layer 4 monitor/alert + quarterly recovery floor.
Quarterly full re-baseline (recovery floor only)	Predictable, expensive	Quarterly	n/a (back-of-house)	Selected as Layer 4 recovery floor. ~$200/yr per pivot doc. Catches whatever Layers 1-3 missed.

OE 2027 concurrency math (drives the fallback flip)

From the pivot doc:

1,000 concurrent /plans visitors × ~10 calls per visit in a 30s window = 333 calls/sec. CMS budget: 200 RPS / 1,000 per minute. Single-key budget exhausts in ❤️ seconds.
5,000 concurrent (the flip threshold) = 1,667 calls/sec. CMS-direct breaks here even with multi-key.

The pivot doc's framing was "CMS-API-first as the MVP transitional state, flip to owned-first when scale demands." The math says we should flip now. Reasons:

CMS-API-first is a known scaling dead-end. Multi-key rotation buys ~5x but doesn't change the architecture; CDN edge cache helps but only on hot tuples. The fundamental shape is: every user's lookup hits CMS over the network, every time, until cache hit. At 1k concurrent the hit rate matters; at 5k it falls over.
Staging-first with sub-second cross-cluster PrivateLink reads (per ADR 0004) is faster on the hot path AND survives OE.
We already have 99.97%-match staging data; LARK keeps it within 24h of CMS upstream. Drift risk is well-bounded.
CMS API stays in the loop as the FALLBACK on staging miss (new plan we haven't ingested yet, NPI not in our directory yet, edge cases). Best of both: speed + survivability of staging-first, authoritative answer of CMS-on-miss.

What we're NOT doing and why

Full nightly re-ingest. Cost-prohibitive. Doesn't scale.
Weekly hybrid (full + daily delta). Quarterly recovery floor catches the same gaps at 4x lower cost.
Layer 3 byte-hash only (without classifier). Saves bandwidth not IOPS. Half the value at almost the same engineering cost.
Drift-detection-only as primary refresh trigger. Recovery is the under-tested path. Kept as monitor + alert.
Switching primary data source to an alt vendor. No source displaces §1311 (see §6).
Staying on CMS-API-first runtime. OE 2027 math is the load-bearing constraint; the pivot was transitional and the transition is now.

/versions endpoint integration (resolves ENG-228)

Per Phase A4 finding, /versions returns 10 dataset stamps. CMS docs do not publish refresh cadence; daily-ish is observational.

Per-dataset role across both data sets LARK manages:

Dataset	Updates (observed)	LARK role - §1311 MRFs	LARK role - plan catalog
`coverage`	Daily	Layer-1 trigger	n/a
`npis`	Daily	Layer-1 trigger	n/a
`drugs`	Daily	Layer-1 trigger	n/a
`taxonomies`	Daily	Out of scope	n/a
`plans-etl`	~Weekly	n/a	Layer-1 trigger for plan-catalog refresh
`plans`	~Weekly	n/a	Layer-1 trigger for plan-catalog refresh
`issuers`	~Weekly	Trigger for `mrpuf_issuers_staging` re-pull	Adjacent
`zipcodes`	~Biweekly	Out of scope	Out of scope
`plan-urls`, `etl-id`	Tied to plans-etl	Out of scope	Bookkeeping

Note that plans-etl and plans advance roughly weekly, NOT annually. Today's plans collection is reingested seasonally (HIOS submission cycle); LARK should track these per-week refreshes. Spec'd in §7.

Granularity caveat: CMS only exposes per-dataset, not per-issuer or per-plan. They can bump coverage even when only 5 issuers' rows changed. Layers 2 + 3 absorb this overhead because HEAD is cheap and the classifier skips no-op records.

Two independent triggers per source: LARK fans out when either Layer 1 (CMS-side /versions advance) or Layer 2 (issuer-side / file-side ETag advance) signals change. This handles the "issuer published, CMS hasn't indexed yet" gap (the 3 FAIL issuers in A1).

Audit-harness pinning: record coverage.updated at audit start so re-runs are interpretable. Port probeVersionsEndpoint(cms) from scripts/audit/investigate-failing-npis-and-errored-issuers.js:40 into scripts/audit/tier-6-mrf-coverage-validation.js.

Stale-data UX (plan card freshness): plan cards / coverage check show "Coverage data refreshed Xh ago" by reading per-source last_seen_material_change_at. Implementation lands in ENG-236 alongside LARK.

Implementation sketch (for ENG-236 / #98)

High-level. Detailed implementation lives in ENG-236.

State store

New collection mrf_file_state_staging (already allowlisted in src/lib/db.ts:172, currently empty). Schema:

{
  source: "cms_versions" | "issuer_index" | "issuer_shard" | "nppes" | "cms_plans_api",
  source_key: string,                  // e.g., issuer_hios + file_url
  layer1_signal: string | null,        // /versions timestamp captured at last check
  layer2_etag: string | null,
  layer2_last_modified: string | null,
  layer2_last_check_at: ISODate,
  layer3_records_seen: number,
  layer3_records_material_changed: number,
  layer3_records_cosmetic_changed: number,
  last_seen_material_change_at: ISODate,
  per_record_semantic_hashes: { [_id]: sha1 },  // for diff lookup
}

Daily run (cron schedule)

GitHub Actions workflow at .github/workflows/rdre-daily.yml. Cron 0 6 * * * UTC. One workflow with parallel jobs per source:

Job: §1311 MRFs
- Layer 1: poll /versions, compare against last-seen for coverage, npis, drugs. If none advanced, no-op + heartbeat write.
- Layer 2: per-issuer (mrpuf_issuers_staging ~183 rows), HEAD index_url with conditionals; 304-skip; on 200 OK with new ETag, fan-out to changed shards (HEAD each, 304-skip, full-fetch only changed).
- Layer 3: streaming-parse changed shards; canonicalize + semantic-hash per record; diff against stored hash; classify; upsert only material-changed records.
- Per-issuer atomic isolation already in place; reuse.
Job: plan catalog (CMS Marketplace API)
- Layer 1: poll /versions, compare against last-seen for plans-etl, plans. If none advanced, no-op.
- Layer 2: not applicable (no per-source files; CMS API is the source). Replaced by per-plan freshness query.
- Layer 3: incremental fetch (which plan IDs changed since last-seen plans-etl), classify, upsert. Detail spec in ENG-237 (TBD).
Job: NPPES enrichment (designed; build trigger separate)
- Layer 1: HEAD the NPPES file index page; check posted last-updated date.
- Layer 2: HEAD the differential file URLs; 304-skip on no change.
- Layer 3: streaming-parse the differential, classify against existing providers_staging (taxonomy + practice-address fields are material; rest is enrichment-only).
- Build trigger: separate issue.
Post-ingest: rebuild derived collections (ENG-425)
- After the §1311 MRF job (and any SBE/CA formulary ingest) writes formularies_staging, run node scripts/db/derive-drug-search-index.js --apply to rebuild drug_search_index (the drug-name search read-model). It reads the WHOLE collection (FFM + SBE/CA), so it covers every state/source. Idempotent; additive collection; rollback = --rollback / drop.
- Sequencing: run AFTER formulary Layer 3 completes, BEFORE the drift monitor. A stale drug_search_index only affects search ranking/strength lists, never coverage (coverage stays per-rxcui in formularies_staging), so a brief lag is non-critical.
- Any future derived read-model added on the reference cluster lands here too.

Diff classifier framework

Pluggable per collection:

// pseudocode
const classifier = {
  providers_staging: {
    material_fields: ["npi", "name", "facility_name", "specialty", "plans"],
    cosmetic_fields: ["last_published_at", "_metadata"],
    set_fields: ["plans"],          // arrays treated as sets
    canonicalize: (rec) => sortKeys(lowercase(stripWhitespace(rec))),
    isMaterial: (oldRec, newRec) => {
      for (const field of material_fields) {
        if (set_fields.includes(field)) {
          if (!setEqual(oldRec[field], newRec[field])) return true;
        } else {
          if (oldRec[field] !== newRec[field]) return true;
        }
      }
      return false;
    },
  },
  formularies_staging: { /* analogous; tier + plans_covered are material; ordering is cosmetic */ },
};

The framework is small (<200 LOC). Per-collection rules are explicit data, easy to audit + adjust as we learn what issuers actually publish.

Drift monitor + recovery

Tier-6 audit harness (scripts/audit/tier-6-mrf-coverage-validation.js) re-purposed as continuous monitor.
Daily run after LARK completes (cheap; uses CMS API not local data).
Slack alert on >5% drift (overall match rate falls below 95%) - distinguishes from the 99.5% SLO.
On alert: trigger ad-hoc LARK run (no waiting for next cron) + investigation hook for the affected source/issuer.
Quarterly full re-baseline: first Sunday of each quarter, 03:00 UTC. Existing scripts/db/ingest-mrf-{providers,formularies}.js run unchanged.

Schema-integrity follow-ups (audit-flagged)

868 docs with unmapped tiers (NONPREFERRED-SPECIALTY-DRUGS, NONPREFERRED-BRAND-AND-SPECIALTY): extend TIER_MAP in scripts/db/lib/mrf-helpers.js:97-262 with two entries. Tier-map auto-extend wired into LARK: emit a side-channel "unmapped tier seen" event rather than hard-failing.
167,776 provider records with invalid type field: parsing variance; out of cascade scope, separate fix.

Alt data sources investigated (resolves ENG-232)

Researched 2026-05-09 via vendor pages + targeted searches. Honest about unknowns: where pricing is sales-contact-only or membership-gated, that's stated.

Source	Role for our use case	Recommendation
§1311 MRFs (CMS direct file feed)	Primary feed; what we already ingest	Stay (primary). LARK refreshes this.
NPPES (NPI Registry)	Free public CMS file: NPI + name + taxonomy/specialty + practice address. No plan/network data. Refresh cadence: monthly (latest 2026-04-13, file ~1073 MB v.2).	Pursue as enrichment (separate issue). LARK-pluggable. Could enrich `providers_staging` with cleaner specialty + practice-address data.
Stedi (healthcare APIs)	Eligibility (270/271) + Claims clearinghouse. $0.15/check after 100 free; Developer $500/mo. SOC 2 + HIPAA-compliant.	Not pursuing. Wrong primitive for marketplace coverage display. Eligibility = "is patient enrolled with payer," not "is drug covered." Possibly later for member-portal eligibility check post-enrollment.
Serif Health Signal (signal)	MRF rate-transparency benchmarking; 200+ commercial payers; monthly refresh. Pricing sales-only.	Not pursuing. Wrong file scope (negotiated rates, not formulary or directory).
Surescripts (overview)	Real-time eligibility + Rx history + e-prescribing + RTBC. Enterprise EHR/payer ecosystem. Sales-only.	Not pursuing. Wrong access tier (institutional, not consumer apps).
NCPDP dataQ (NCPDP)	Pharmacy directory + prescriber file + formulary standards. Membership-gated.	Defer. Revisit if/when we ship pharmacy-network-tier feature (#106). Not relevant to LARK.
NIPR (nipr.com)	Insurance producer/agent licensing; NPN validation.	Out of scope. Different workstream (agent platform Phase 5).
Per-issuer REST APIs (Cigna, Aetna, BCBS, etc.)	Member-portal-quality data per issuer. Engineering cost ~1 week/issuer × 183 issuers = unaffordable.	Not pursuing as primary. Possible enrichment for top 3-5 issuers post-OE 2027 if §1311 has gaps after LARK.
CMS `/coverage/search`	Combined provider+drug autocomplete on CMS Marketplace API.	Already on the runtime path - underused endpoint. Worth a separate ticket for typeahead UX.
GoodRx, RxRevu	Retail Rx discount pricing or EHR-integrated RTBC.	Not pursuing. Different problem space.

Net: No alt source displaces §1311 as the primary marketplace coverage feed. NPPES is a valuable monthly enrichment for taxonomy + practice-address; tracked as a separate issue. LARK is source-pluggable so adding NPPES later is mechanical.

Fallback ordering decision (resolves ENG-233) - REVISED

Flip to Option A: staging-first, CMS-on-miss. The pivot doc framed CMS-API-first as the MVP transitional state. The math now says staging-first is the right ordering at OE-baseline traffic, not just past 5k concurrent.

Rationale

OE 2027 scale. CMS-API-as-primary cannot survive 1k concurrent /plans visitors (per pivot doc math: budget exhausts in <3s). Staging-first sidesteps the rate-limit problem entirely on the hot path.
Latency. Staging-via-PrivateLink is faster than CMS-via-public-network. We've measured ~225-465ms p99 for the gap-fill path on 2026-05-08; pure staging-only reads should be similar or faster.
Freshness is good enough. LARK keeps staging within 24h of CMS upstream. 99.97% match. The 0.03% drift becomes the CMS-fallback path (small fraction of lookups).
CMS stays in the loop as authoritative fallback. When staging doesn't have a plan_id, NPI, or RxCUI we can answer (new plan added since last refresh, brand-new NPI, etc.), runtime falls through to CMS. Best of both.

How the runtime changes

Today (Option B): CMS-first → staging fills gaps where CMS returns Covered but no drug_tier / network_tier (per src/lib/drug-tier-fallback.ts + src/lib/provider-network-fallback.ts).

After (Option A): Staging first for the canonical answer (covered/not + tier/network + carrier metadata). CMS only on miss (staging doesn't have the plan/NPI/RxCUI tuple). The fallback shapes:

Lookup hits staging → return immediately
Lookup misses staging → call CMS API; if CMS has the answer, return that AND opportunistically write back to staging (lazy enrichment so the next user gets a hit) - tracked as a follow-up
Lookup misses both → return "no data available" with appropriate UX

Implementation: invert the order in drug-tier-fallback.ts and provider-network-fallback.ts. Add an opportunistic-writeback pattern. Feature-flag the flip so we can toggle if anything misbehaves.

Triggers to flip BACK to Option B (CMS-first)

Explicit and named. Should never trip given current math + LARK design:

Audit match rate sustained below 99.0% for >2 weeks (cascade lag too severe; CMS-first would be a workaround until LARK is fixed)
LARK reliability incidents (cron failures + Layer-3 false-positives) > 2 per quarter sustained
Atlas cluster outage > 15 min during business hours (CMS-first as a circuit-breaker is a separate question - probably solved via cached-response fallback rather than a primary flip)
Compliance / audit finding that staging data lag > 24h is unacceptable for a specific user surface (would scope flip to that surface only)

SLOs (revised for staging-first)

Runtime p99 latency: ≤200ms for staging-hit; ≤500ms for staging-miss-then-CMS path
Staging staleness: ≤24h baseline, ≤1h during OE 2027 peak windows
Staging hit rate: ≥99% on (plan_id, NPI) / (plan_id, RxCUI) lookups for plans known to CMS
CMS API budget consumption at 1k concurrent: ≤5% of single-key budget (only consumed on staging-miss path - massive headroom)
Match-rate drift alert: Slack ping on >5% drift in tier-6 audit

Implementation note for ENG-236

Runtime flip is part of this milestone, behind a feature flag. Code paths to invert: src/lib/drug-tier-fallback.ts, src/lib/provider-network-fallback.ts, plus the /api/drugs/covered and /api/providers/covered route handlers that consume them. Opportunistic writeback can land in a follow-up; the inversion itself is small.

Open follow-ups + spawned issues

Spawned issues (logged separately)

ENG-251 - NPPES enrichment evaluation. Investigate using NPPES monthly differential file (~1073 MB v.2, last updated 2026-04-13) to enrich providers_staging with cleaner taxonomy + practice-address data. Build path is LARK-pluggable (Layer 1 = posted-date check, Layer 2 = HEAD differential URL, Layer 3 = canonicalize NPPES record vs ours). Could backfill the 167,776 invalid-type + 151,036 missing-name records in providers_staging. Effort estimate: ~1 engineer-week after LARK framework lands. Priority: Medium.
ENG-252 - Plan-catalog refresh cadence investigation. The plans collection is currently re-ingested seasonally (HIOS submission cycle). But CMS /versions shows plans-etl and plans advancing roughly weekly. Need to: (a) verify the actual cadence of changes the CMS Marketplace API serves; (b) determine which plan attributes change mid-year (URL fixes? plan adds/drops? service-area shifts?) and which are pinned by SI submission cycle (copays, deductibles, MOOPs, premiums); (c) wire LARK Layer-1+ for the plan catalog if sub-annual changes are material. Investigation: ~1 engineer-week. Priority: High.

Carry-over follow-ups from §1311 audit

Per-issuer cooperative behavior unknown. Do all 183 issuers honor If-None-Match / If-Modified-Since? Subset that returns 200 OK on every request gets a fallback path: fetch + compare-shard-content-hash. Measure during ENG-236 implementation; flag in the PR.
3 FAIL issuers in latest audit (42326, 50305, 25896). All stale_or_missing_in_cms - issuer published, CMS hasn't indexed. No LARK fixes. Daily refresh should self-heal within 24-48h once CMS catches up. Worth a dashboard + alert for sustained FAIL state per issuer.
19 hard-error issuers in mrpuf_issuers_staging. Cross-correlate with the per-state coverage gaps (IN 78%, MS 79%, AZ/FL/NC 92%). IP-allowlist breakers (Medica MO, Dean WI, BCBSFL) noted in ENG-232. Solved via VPC egress; verify after ENG-236 lands.
Tier-map extension. Add NONPREFERRED-SPECIALTY-DRUGS + NONPREFERRED-BRAND-AND-SPECIALTY entries to TIER_MAP. Trivial follow-up.
Audit-harness coverage.updated pinning. Small port from investigate-failing-npis-and-errored-issuers.js into tier-6-mrf-coverage-validation.js. Could ship independently of ENG-236.
Stale-data UX on plan card. "Coverage data refreshed Xh ago" surface; ships alongside ENG-236.
CMS-on-miss opportunistic writeback to staging. Lazy enrichment pattern when staging-miss falls through to CMS. Follow-up to the runtime flip.
CDN edge cache + multi-key CMS rotation. Largely obviated by the staging-first flip. Multi-key only useful for the staging-miss fallback path. Revisit if that path becomes hot.

Acceptance checklist

[x] What cadence? Daily as the default. LARK built day-one with all 4 layers, including diff classifier. Hourly possible with one cron-line edit when OE 2027 demands.
[x] How is /versions integrated? Daily poll for coverage / npis / drugs (§1311 path) + plans-etl / plans (plan-catalog path). Per-dataset semantics in §3 + §4.
[x] Alt sources eval - recommendation per source. Table in §6; no alt source displaces §1311. NPPES pursued separately as enrichment.
[x] Fallback ordering pick. Flipped to Option A: staging-first, CMS-on-miss. Triggers to flip back documented in §7.
[x] First implementation milestone (with rough effort). ENG-236. LARK 4-layer + diff classifier + plan-catalog parallel job + runtime flip behind feature flag + drift monitor + tier-map extension. Estimated 2-3 engineer-weeks (up from 1-2 in v1).
[x] Stale-data UX. "Coverage data refreshed Xh ago" on plan card via last_seen_material_change_at.
[x] Cross-references. ADR 0004, pivot decision doc, audit-revalidation report, ENG-236 (linked above).
[x] NPPES + plan-catalog scope expansion. Both spawned as separate issues (linked above).

docs/decisions/2026-05-03-pivot-cms-api-direct.md - pivot doc; performance ceiling math
docs/adr/0004-cross-cluster-atlas-privatelink.md - cluster topology; PrivateLink
docs/audits/mrf-ingest-staging-audit-2026-05-03T20-07-11.md (in audit-revalidation worktree, not yet on main) - 99.97% match audit verdict
docs/session-log/2026-05-08-phase-d-and-ci-guard.md - drug-tier + provider-network fallback shipped 2026-05-08
docs/session-log/2026-05-08-phase-11-cross-cluster-privatelink.md - PrivateLink cutover
scripts/db/lib/mrf-helpers.js - ingest helper library; cluster guard, tier map, CMS API wrapper
scripts/db/ingest-mrf-providers.js + scripts/db/ingest-mrf-formularies.js - current ingest scripts; LARK replaces these on the daily path
scripts/audit/tier-6-mrf-coverage-validation.js - drift monitor source
src/lib/db.ts - STAGING_ALLOWED_COLLECTIONS includes mrf_file_state_staging
src/lib/drug-tier-fallback.ts + src/lib/provider-network-fallback.ts - runtime fallback consumers; inverted in this decision
Linear: ENG-227, ENG-228, ENG-231, ENG-232, ENG-233
GitHub: #86, #87, #93, #94, #95, #98

Revision history

2026-05-09 v1 (commit df1882f): initial decision; Layers 1+2 only, Layer 3 deferred, Option B fallback (CMS-first). Superseded.
2026-05-09 v2 (commit ebc5f0d): engine reframe + Layer 3 day-one with diff classifier + fallback flip to Option A (staging-first, CMS-on-miss) + scope expanded to plan-catalog + spawned NPPES (ENG-251) + plan-cadence (ENG-252) issues. Engine working name was "Reference Data Refresh Engine (RDRE)". Superseded by v3 naming.
2026-05-09 v3 (this commit): renamed engine to LARK. "Reference Data Refresh Engine" was technically descriptive but pitch-flat; LARK is a Florence-tied name (larks sing at dawn; our daily cron runs 06:00 UTC) with its own personality. Reserves the "Athena" name (after Florence's pet owl) for a future heavier-weight engine. ENG-251 / ENG-252 metadata also corrected here: assigned to current cycle (5b739f96-...) with 2026-05-11 due dates matching the original 5 issues. Current.