Skip to content
AskFlorence
Main Navigation ArchitectureFlorence AIAgentsMembersAgent PlatformValidationInfrastructure

Appearance

Sidebar Navigation

Overview

Home

Glossary

System Architecture

Consumer & Agent Flow

Florence AI

Overview

Principles

Runtime

Tool surface

Adding a tool

Tool registry

Knowledge: SBC scenarios & CSR

Voice

Evals & observability

Provider risk & portability

Outage playbook

Roadmap

Build plan

Agents

Overview

Workflows & pain points

Members

Overview

Medicaid coverage gap

Carriers

Overview

Marketplaces

Overview

Agency

Overview

Regulations

Overview

Agent Platform

Overview

Auth Architecture

MongoDB Permissioning

Compliance Model

Data Models

Data Sources

Overview

CMS Marketplace API

CMS dependency map

PUF Data

State Subsidies

SBE Ingestion Playbook

SBE State Watchouts + Decisions

CA Phase C/D Playbook

NY Phase C/D Playbook

Validation

Overview

Methodology

APTC Formula

California 2026

New York 2026

CAPS Formula

Scenario Results

Infrastructure

Account Inventory

AWS Setup Runbook

AWS Organizations

CloudTrail

GuardDuty

Security Hub

Config

CloudFront + WAFv2

Data sources & ingest

Phase 4 DNS

Change Log

Vulnerability Management

MongoDB Setup

Access Control

Data Classification

Documentation Hosting

Post-deploy Smoke

Development

Preflight (local CI mirror)

Testing strategy

Compliance

Overview (auditor entry point)

SOC 2 Control Mapping

HIPAA Control Mapping

CMS EDE Appendix A Mapping

Risk Assessment

Encryption Policy

Data Retention Policy

Privacy Impact Assessment

Consent Capture & Versioning

Incident Response Plan

Access Control Policy

Marketing vs. Portal Analytics

Vendor / Subprocessor Register

Dependency Vulnerability Policy

BAA / Compliance Evidence

Compliance-Automation Integration

Compliance-Automation Vendor Evaluation

Penetration Test Reports

Architecture

Portal entry handoff

Mobile app strategy

Deferred architecture decisions

Session cookie architecture

Share flows

Decisions (ADRs)

Index

0001 — Atlas project isolation

0002 — Append-only audit log

0003 — Narrow-scoped Mongo users

0004 — Cross-cluster Atlas PrivateLink

0005 — Delayed-job architecture

0006 — Mongo user simplification

0007 — Terraform owns ECS task def

0008 — E2E testing strategy

0009 — Self-hosted analytics + observability (superseded)

0010 — PostHog HIPAA Cloud (supersedes 0009)

Runbooks

Security Incident Response

Break-Glass Root Login

Onboard Team Member

Offboard Team Member

Atlas user provisioning

Deploy via Terraform (ENG-277)

Rollback via Terraform (ENG-277)

S3 data bucket migration (planned Phase 11)

Access Reviews

2026-Q2 Review

Session log

Index

2026-04-23 — Phase 10 DNS cutover

2026-04-22 — Phase 8 prod AWS mirror

2026-04-22 — Phase 7 Atlas VPC peering

2026-04-22 — Phase 6 CloudFront + WAF

2026-04-21 — Phase 5 staging go-live

2026-04-17 — Atlas staging

Briefs

Index

Member portal plan (ENG-187)

2026-04-16/17 handoff

2026-04-17 Atlas handoff

System briefing (2026-04-17)

Creative AdBundance proposal brief

Creative AdBundance analytics brief

ElevenLabs RN integration research

Policies

Overview

On this page

2026-05-09: LARK - the data refresh engine for AskFlorence ​

Decision record. Defines LARK (a multi-source data refresh engine) that keeps the staging mirror (and any future enriched datasets) current with upstream sources at acceptable cost, with a runtime fallback ordering of staging-first / CMS-on-miss. Collapses Linear issues ENG-227, ENG-228, ENG-231, ENG-232, ENG-233 into one decision. Implementation lands in ENG-236 / #98.

Name. LARK. Loosely "Layer-Aware Refresh Kit" but mostly because larks sing at dawn and ours wakes up at 06:00 UTC to walk the data. AskFlorence's family of engines uses light/morning/Florence-Nightingale-themed names; "Athena" (after Florence's pet owl) is reserved for a future heavier-weight engine (subsidy logic or similar).

Status ​

  • Decision date: 2026-05-09
  • Audit verdict referenced: ~/Developer/ask-florence-audit-revalidation/docs/audits/mrf-ingest-staging-audit-2026-05-03T20-07-11.md (99.97% overall match, 148,475/148,520 checks)
  • Pivot precedent: docs/decisions/2026-05-03-pivot-cms-api-direct.md
  • Cluster topology precedent: docs/adr/0004-cross-cluster-atlas-privatelink.md
  • Implementation follow-on: ENG-236 / #98
  • Spawned issues: ENG-251 (NPPES enrichment evaluation), ENG-252 (plan-catalog refresh cadence investigation) - detail in §7

Decision ​

Build the LARK as a four-layer change-detection pipeline that runs daily and refreshes our staging mirror only when upstream content materially changes. The four layers stack from cheapest-coarsest signal to most-precise:

  1. Layer 1 - source version watcher. Polls upstream "what's new" signals (CMS /versions for CMS-Marketplace-API-derived datasets; NPPES file-listing pages for monthly enrichment files; equivalent for any future source).
  2. Layer 2 - file-level change probe. Per-source HEAD with If-None-Match + If-Modified-Since against issuer / file URLs; skip on 304.
  3. Layer 3 - record-level semantic diff. Streaming-parses changed files, computes per-record diffs against stored canonical state, classifies whether each diff is material or cosmetic, writes only material diffs to Atlas. Built day one. (Reframed from the original "byte-hash diff" - see §"Why Layer 3 day-one" below.)
  4. Layer 4 - drift monitor + recovery floor. Tier-6 audit harness re-purposed as continuous monitor with Slack alert on >5% drift; quarterly full re-baseline as recovery sweep for whatever Layers 1-3 missed.

The engine is source-pluggable. Day-one sources: §1311 MRFs (formularies_staging + providers_staging). Day-two sources: CMS Marketplace API plan-catalog data (the plans collection - whose refresh cadence has been wrong-assumed as annual; see §7), NPPES enrichment for taxonomy fields, anything else with the same change-source / file-fetch / record-upsert shape.

Runtime fallback ordering flipped to Option A: staging-first, CMS-on-miss. Reasoning: CMS-API as primary cannot survive OE 2027 scale (1k concurrent visitors exhausts the per-minute budget in <3s per pivot doc); we already have a 99.97%-match staging mirror and a sub-second cross-cluster PrivateLink read path. Use what we have as primary; fall back to CMS only when staging doesn't have the answer.

Tolerance band for staleness: up to 1 week is acceptable, 1 month is the worst-case fallback, daily-or-better is the goal. LARK delivers daily comfortably; hourly during OE 2027 high-churn is one cron-line edit.

Why Layer 3 day-one (the diff classifier) ​

The original recommendation was to defer Layer 3 (per-record content-hash) until daily cadence wasn't enough. That was wrong. Reframed:

A pure byte-hash Layer 3 doesn't earn its complexity. But a semantic diff classifier does, because real-world MRF files contain non-material churn that would force unnecessary upserts:

  • An issuer's file may bump last_published_at on every publish, even when no records changed. Byte-hash flags every record as "changed"; classifier sees "only last_published_at field changed" and marks it cosmetic.
  • Field ordering inside JSON objects can drift between publishes. Byte-hash flags everything; canonicalized hash + diff classifier sees "same content, different key order" and skips.
  • Nested arrays may re-order without semantic change (e.g., plans: ["X", "Y"] vs plans: ["Y", "X"]). Byte-hash flags; classifier with set-equality on known-unordered fields skips.

Without the classifier, Layer 3 saves bandwidth (we already downloaded the file) but doesn't save Atlas write IOPS, which is the actual cost driver during typical-change days. With the classifier, Layer 3 saves both.

Implementation shape (similar to git diff in spirit, simpler in scope):

  • Per-collection canonicalization rule: lowercase keys, sort object keys, normalize string whitespace, treat known-unordered arrays as sets.
  • SHA-1 the canonicalized form per record - "semantic hash" - stored on the doc as af_semantic_hash and in mrf_file_state_staging for fast lookup.
  • On detected file change (Layer 2 trip), streaming-parse the new file, compute new semantic hash per record, compare against stored.
  • For changed records: optionally compute a per-field diff (which fields changed, old vs new) for audit logging. Cheap; same parse pass.
  • Classifier rule set per collection: which fields are "material" (changes warrant upsert) vs "cosmetic" (changes are skipped). Examples for providers_staging: NPI/name/specialty/network = material; last_published_at / record-id metadata = cosmetic. For formularies_staging: drug-tier / formulary-id / coverage = material; ordering of plans array = cosmetic.
  • Material change → upsert. Cosmetic change → update only af_semantic_hash + last_seen_at, no write to data fields.

Cost-benefit math:

  • Without Layer 3 + classifier: every Layer-2-triggered shard re-fetch upserts every record in the shard, regardless of whether content changed. On a typical-change day with 10-30 issuers updating, that's millions of needless upserts. Atlas write-IOPS pressure.
  • With Layer 3 + classifier: same shard re-fetch, only material-diff records upsert. Likely <5% of records per typical-change day. ~20x reduction in Atlas write IOPS.

Engineering cost: ~2-3 engineer-days for the classifier framework + per-collection rule sets. Lower than the cost of one M60 burst event. Worth it day-one.

Rationale ​

Cost vs achievable cadence ​

Today's full-ingest cost: ~22 GB compressed download + ~32M Atlas upserts on every run. The 2026-05-06 incident proved sustaining this is ~$2,800/mo at M60. Full nightly re-ingest is off the table. The 2026-05-03 audit at 99.97% match shows we don't need that level of drift mitigation.

ArchitectureCost-per-run shapeCadence sustainableWorst-case stalenessWhy we did or didn't pick it
Full nightly re-ingest$2,800/mo if sustainedDaily at huge cost<24hRejected. Cost-prohibitive.
Layer 1 only (/versions poll, no HEAD, no hash)Cheap poll + full-fetch all 183 issuers when triggeredDaily safe; sub-daily wasteful~24h on trigger daysRejected. Same cost shape as today's burst on every trigger day.
Layers 1+2 (/versions + per-issuer ETag/Last-Modified HEAD)Cheap poll + 183 cheap HEADs + selective re-fetch only changed issuersDaily comfortable<24h baselineInsufficient. Saves bandwidth but still re-upserts every record in every changed shard - misses the Atlas IOPS driver.
Layers 1+2+3 with byte-hash+ streaming hash + targeted upsertsHourly viable<1hInsufficient. Cosmetic file churn forces upserts for non-material changes.
Layers 1+2+3 with diff classifier (LARK)+ canonicalized record diff + material/cosmetic classifierDaily default; hourly for OE 2027<24h baseline; <1h achievableSelected. Saves bandwidth AND Atlas IOPS. ~20x reduction on typical-change days vs byte-hash alone.
Drift-detection-only (Tier-6 audit triggers ad-hoc re-ingest)Zero cost until threshold crossed; full cost on recoveryReactive onlyWorst-case = audit-interval + recovery-timeRejected as primary. Recovery is the rare = under-tested path. Kept as Layer 4 monitor/alert + quarterly recovery floor.
Quarterly full re-baseline (recovery floor only)Predictable, expensiveQuarterlyn/a (back-of-house)Selected as Layer 4 recovery floor. ~$200/yr per pivot doc. Catches whatever Layers 1-3 missed.

OE 2027 concurrency math (drives the fallback flip) ​

From the pivot doc:

1,000 concurrent /plans visitors × ~10 calls per visit in a 30s window = 333 calls/sec. CMS budget: 200 RPS / 1,000 per minute. Single-key budget exhausts in ❤️ seconds.

5,000 concurrent (the flip threshold) = 1,667 calls/sec. CMS-direct breaks here even with multi-key.

The pivot doc's framing was "CMS-API-first as the MVP transitional state, flip to owned-first when scale demands." The math says we should flip now. Reasons:

  1. CMS-API-first is a known scaling dead-end. Multi-key rotation buys ~5x but doesn't change the architecture; CDN edge cache helps but only on hot tuples. The fundamental shape is: every user's lookup hits CMS over the network, every time, until cache hit. At 1k concurrent the hit rate matters; at 5k it falls over.
  2. Staging-first with sub-second cross-cluster PrivateLink reads (per ADR 0004) is faster on the hot path AND survives OE.
  3. We already have 99.97%-match staging data; LARK keeps it within 24h of CMS upstream. Drift risk is well-bounded.
  4. CMS API stays in the loop as the FALLBACK on staging miss (new plan we haven't ingested yet, NPI not in our directory yet, edge cases). Best of both: speed + survivability of staging-first, authoritative answer of CMS-on-miss.

What we're NOT doing and why ​

  • Full nightly re-ingest. Cost-prohibitive. Doesn't scale.
  • Weekly hybrid (full + daily delta). Quarterly recovery floor catches the same gaps at 4x lower cost.
  • Layer 3 byte-hash only (without classifier). Saves bandwidth not IOPS. Half the value at almost the same engineering cost.
  • Drift-detection-only as primary refresh trigger. Recovery is the under-tested path. Kept as monitor + alert.
  • Switching primary data source to an alt vendor. No source displaces §1311 (see §6).
  • Staying on CMS-API-first runtime. OE 2027 math is the load-bearing constraint; the pivot was transitional and the transition is now.

/versions endpoint integration (resolves ENG-228) ​

Per Phase A4 finding, /versions returns 10 dataset stamps. CMS docs do not publish refresh cadence; daily-ish is observational.

Per-dataset role across both data sets LARK manages:

DatasetUpdates (observed)LARK role - §1311 MRFsLARK role - plan catalog
coverageDailyLayer-1 triggern/a
npisDailyLayer-1 triggern/a
drugsDailyLayer-1 triggern/a
taxonomiesDailyOut of scopen/a
plans-etl~Weeklyn/aLayer-1 trigger for plan-catalog refresh
plans~Weeklyn/aLayer-1 trigger for plan-catalog refresh
issuers~WeeklyTrigger for mrpuf_issuers_staging re-pullAdjacent
zipcodes~BiweeklyOut of scopeOut of scope
plan-urls, etl-idTied to plans-etlOut of scopeBookkeeping

Note that plans-etl and plans advance roughly weekly, NOT annually. Today's plans collection is reingested seasonally (HIOS submission cycle); LARK should track these per-week refreshes. Spec'd in §7.

Granularity caveat: CMS only exposes per-dataset, not per-issuer or per-plan. They can bump coverage even when only 5 issuers' rows changed. Layers 2 + 3 absorb this overhead because HEAD is cheap and the classifier skips no-op records.

Two independent triggers per source: LARK fans out when either Layer 1 (CMS-side /versions advance) or Layer 2 (issuer-side / file-side ETag advance) signals change. This handles the "issuer published, CMS hasn't indexed yet" gap (the 3 FAIL issuers in A1).

Audit-harness pinning: record coverage.updated at audit start so re-runs are interpretable. Port probeVersionsEndpoint(cms) from scripts/audit/investigate-failing-npis-and-errored-issuers.js:40 into scripts/audit/tier-6-mrf-coverage-validation.js.

Stale-data UX (plan card freshness): plan cards / coverage check show "Coverage data refreshed Xh ago" by reading per-source last_seen_material_change_at. Implementation lands in ENG-236 alongside LARK.

Implementation sketch (for ENG-236 / #98) ​

High-level. Detailed implementation lives in ENG-236.

State store ​

New collection mrf_file_state_staging (already allowlisted in src/lib/db.ts:172, currently empty). Schema:

js
{
  source: "cms_versions" | "issuer_index" | "issuer_shard" | "nppes" | "cms_plans_api",
  source_key: string,                  // e.g., issuer_hios + file_url
  layer1_signal: string | null,        // /versions timestamp captured at last check
  layer2_etag: string | null,
  layer2_last_modified: string | null,
  layer2_last_check_at: ISODate,
  layer3_records_seen: number,
  layer3_records_material_changed: number,
  layer3_records_cosmetic_changed: number,
  last_seen_material_change_at: ISODate,
  per_record_semantic_hashes: { [_id]: sha1 },  // for diff lookup
}

Daily run (cron schedule) ​

GitHub Actions workflow at .github/workflows/rdre-daily.yml. Cron 0 6 * * * UTC. One workflow with parallel jobs per source:

  1. Job: §1311 MRFs

    • Layer 1: poll /versions, compare against last-seen for coverage, npis, drugs. If none advanced, no-op + heartbeat write.
    • Layer 2: per-issuer (mrpuf_issuers_staging ~183 rows), HEAD index_url with conditionals; 304-skip; on 200 OK with new ETag, fan-out to changed shards (HEAD each, 304-skip, full-fetch only changed).
    • Layer 3: streaming-parse changed shards; canonicalize + semantic-hash per record; diff against stored hash; classify; upsert only material-changed records.
    • Per-issuer atomic isolation already in place; reuse.
  2. Job: plan catalog (CMS Marketplace API)

    • Layer 1: poll /versions, compare against last-seen for plans-etl, plans. If none advanced, no-op.
    • Layer 2: not applicable (no per-source files; CMS API is the source). Replaced by per-plan freshness query.
    • Layer 3: incremental fetch (which plan IDs changed since last-seen plans-etl), classify, upsert. Detail spec in ENG-237 (TBD).
  3. Job: NPPES enrichment (designed; build trigger separate)

    • Layer 1: HEAD the NPPES file index page; check posted last-updated date.
    • Layer 2: HEAD the differential file URLs; 304-skip on no change.
    • Layer 3: streaming-parse the differential, classify against existing providers_staging (taxonomy + practice-address fields are material; rest is enrichment-only).
    • Build trigger: separate issue.
  4. Post-ingest: rebuild derived collections (ENG-425)

    • After the §1311 MRF job (and any SBE/CA formulary ingest) writes formularies_staging, run node scripts/db/derive-drug-search-index.js --apply to rebuild drug_search_index (the drug-name search read-model). It reads the WHOLE collection (FFM + SBE/CA), so it covers every state/source. Idempotent; additive collection; rollback = --rollback / drop.
    • Sequencing: run AFTER formulary Layer 3 completes, BEFORE the drift monitor. A stale drug_search_index only affects search ranking/strength lists, never coverage (coverage stays per-rxcui in formularies_staging), so a brief lag is non-critical.
    • Any future derived read-model added on the reference cluster lands here too.

Diff classifier framework ​

Pluggable per collection:

js
// pseudocode
const classifier = {
  providers_staging: {
    material_fields: ["npi", "name", "facility_name", "specialty", "plans"],
    cosmetic_fields: ["last_published_at", "_metadata"],
    set_fields: ["plans"],          // arrays treated as sets
    canonicalize: (rec) => sortKeys(lowercase(stripWhitespace(rec))),
    isMaterial: (oldRec, newRec) => {
      for (const field of material_fields) {
        if (set_fields.includes(field)) {
          if (!setEqual(oldRec[field], newRec[field])) return true;
        } else {
          if (oldRec[field] !== newRec[field]) return true;
        }
      }
      return false;
    },
  },
  formularies_staging: { /* analogous; tier + plans_covered are material; ordering is cosmetic */ },
};

The framework is small (<200 LOC). Per-collection rules are explicit data, easy to audit + adjust as we learn what issuers actually publish.

Drift monitor + recovery ​

  • Tier-6 audit harness (scripts/audit/tier-6-mrf-coverage-validation.js) re-purposed as continuous monitor.
  • Daily run after LARK completes (cheap; uses CMS API not local data).
  • Slack alert on >5% drift (overall match rate falls below 95%) - distinguishes from the 99.5% SLO.
  • On alert: trigger ad-hoc LARK run (no waiting for next cron) + investigation hook for the affected source/issuer.
  • Quarterly full re-baseline: first Sunday of each quarter, 03:00 UTC. Existing scripts/db/ingest-mrf-{providers,formularies}.js run unchanged.

Schema-integrity follow-ups (audit-flagged) ​

  • 868 docs with unmapped tiers (NONPREFERRED-SPECIALTY-DRUGS, NONPREFERRED-BRAND-AND-SPECIALTY): extend TIER_MAP in scripts/db/lib/mrf-helpers.js:97-262 with two entries. Tier-map auto-extend wired into LARK: emit a side-channel "unmapped tier seen" event rather than hard-failing.
  • 167,776 provider records with invalid type field: parsing variance; out of cascade scope, separate fix.

Alt data sources investigated (resolves ENG-232) ​

Researched 2026-05-09 via vendor pages + targeted searches. Honest about unknowns: where pricing is sales-contact-only or membership-gated, that's stated.

SourceRole for our use caseRecommendation
§1311 MRFs (CMS direct file feed)Primary feed; what we already ingestStay (primary). LARK refreshes this.
NPPES (NPI Registry)Free public CMS file: NPI + name + taxonomy/specialty + practice address. No plan/network data. Refresh cadence: monthly (latest 2026-04-13, file ~1073 MB v.2).Pursue as enrichment (separate issue). LARK-pluggable. Could enrich providers_staging with cleaner specialty + practice-address data.
Stedi (healthcare APIs)Eligibility (270/271) + Claims clearinghouse. $0.15/check after 100 free; Developer $500/mo. SOC 2 + HIPAA-compliant.Not pursuing. Wrong primitive for marketplace coverage display. Eligibility = "is patient enrolled with payer," not "is drug covered." Possibly later for member-portal eligibility check post-enrollment.
Serif Health Signal (signal)MRF rate-transparency benchmarking; 200+ commercial payers; monthly refresh. Pricing sales-only.Not pursuing. Wrong file scope (negotiated rates, not formulary or directory).
Surescripts (overview)Real-time eligibility + Rx history + e-prescribing + RTBC. Enterprise EHR/payer ecosystem. Sales-only.Not pursuing. Wrong access tier (institutional, not consumer apps).
NCPDP dataQ (NCPDP)Pharmacy directory + prescriber file + formulary standards. Membership-gated.Defer. Revisit if/when we ship pharmacy-network-tier feature (#106). Not relevant to LARK.
NIPR (nipr.com)Insurance producer/agent licensing; NPN validation.Out of scope. Different workstream (agent platform Phase 5).
Per-issuer REST APIs (Cigna, Aetna, BCBS, etc.)Member-portal-quality data per issuer. Engineering cost ~1 week/issuer × 183 issuers = unaffordable.Not pursuing as primary. Possible enrichment for top 3-5 issuers post-OE 2027 if §1311 has gaps after LARK.
CMS /coverage/searchCombined provider+drug autocomplete on CMS Marketplace API.Already on the runtime path - underused endpoint. Worth a separate ticket for typeahead UX.
GoodRx, RxRevuRetail Rx discount pricing or EHR-integrated RTBC.Not pursuing. Different problem space.

Net: No alt source displaces §1311 as the primary marketplace coverage feed. NPPES is a valuable monthly enrichment for taxonomy + practice-address; tracked as a separate issue. LARK is source-pluggable so adding NPPES later is mechanical.

Fallback ordering decision (resolves ENG-233) - REVISED ​

Flip to Option A: staging-first, CMS-on-miss. The pivot doc framed CMS-API-first as the MVP transitional state. The math now says staging-first is the right ordering at OE-baseline traffic, not just past 5k concurrent.

Rationale ​

  1. OE 2027 scale. CMS-API-as-primary cannot survive 1k concurrent /plans visitors (per pivot doc math: budget exhausts in <3s). Staging-first sidesteps the rate-limit problem entirely on the hot path.
  2. Latency. Staging-via-PrivateLink is faster than CMS-via-public-network. We've measured ~225-465ms p99 for the gap-fill path on 2026-05-08; pure staging-only reads should be similar or faster.
  3. Freshness is good enough. LARK keeps staging within 24h of CMS upstream. 99.97% match. The 0.03% drift becomes the CMS-fallback path (small fraction of lookups).
  4. CMS stays in the loop as authoritative fallback. When staging doesn't have a plan_id, NPI, or RxCUI we can answer (new plan added since last refresh, brand-new NPI, etc.), runtime falls through to CMS. Best of both.

How the runtime changes ​

Today (Option B): CMS-first → staging fills gaps where CMS returns Covered but no drug_tier / network_tier (per src/lib/drug-tier-fallback.ts + src/lib/provider-network-fallback.ts).

After (Option A): Staging first for the canonical answer (covered/not + tier/network + carrier metadata). CMS only on miss (staging doesn't have the plan/NPI/RxCUI tuple). The fallback shapes:

  • Lookup hits staging → return immediately
  • Lookup misses staging → call CMS API; if CMS has the answer, return that AND opportunistically write back to staging (lazy enrichment so the next user gets a hit) - tracked as a follow-up
  • Lookup misses both → return "no data available" with appropriate UX

Implementation: invert the order in drug-tier-fallback.ts and provider-network-fallback.ts. Add an opportunistic-writeback pattern. Feature-flag the flip so we can toggle if anything misbehaves.

Triggers to flip BACK to Option B (CMS-first) ​

Explicit and named. Should never trip given current math + LARK design:

  • Audit match rate sustained below 99.0% for >2 weeks (cascade lag too severe; CMS-first would be a workaround until LARK is fixed)
  • LARK reliability incidents (cron failures + Layer-3 false-positives) > 2 per quarter sustained
  • Atlas cluster outage > 15 min during business hours (CMS-first as a circuit-breaker is a separate question - probably solved via cached-response fallback rather than a primary flip)
  • Compliance / audit finding that staging data lag > 24h is unacceptable for a specific user surface (would scope flip to that surface only)

SLOs (revised for staging-first) ​

  • Runtime p99 latency: ≤200ms for staging-hit; ≤500ms for staging-miss-then-CMS path
  • Staging staleness: ≤24h baseline, ≤1h during OE 2027 peak windows
  • Staging hit rate: ≥99% on (plan_id, NPI) / (plan_id, RxCUI) lookups for plans known to CMS
  • CMS API budget consumption at 1k concurrent: ≤5% of single-key budget (only consumed on staging-miss path - massive headroom)
  • Match-rate drift alert: Slack ping on >5% drift in tier-6 audit

Implementation note for ENG-236 ​

Runtime flip is part of this milestone, behind a feature flag. Code paths to invert: src/lib/drug-tier-fallback.ts, src/lib/provider-network-fallback.ts, plus the /api/drugs/covered and /api/providers/covered route handlers that consume them. Opportunistic writeback can land in a follow-up; the inversion itself is small.

Open follow-ups + spawned issues ​

Spawned issues (logged separately) ​

  • ENG-251 - NPPES enrichment evaluation. Investigate using NPPES monthly differential file (~1073 MB v.2, last updated 2026-04-13) to enrich providers_staging with cleaner taxonomy + practice-address data. Build path is LARK-pluggable (Layer 1 = posted-date check, Layer 2 = HEAD differential URL, Layer 3 = canonicalize NPPES record vs ours). Could backfill the 167,776 invalid-type + 151,036 missing-name records in providers_staging. Effort estimate: ~1 engineer-week after LARK framework lands. Priority: Medium.
  • ENG-252 - Plan-catalog refresh cadence investigation. The plans collection is currently re-ingested seasonally (HIOS submission cycle). But CMS /versions shows plans-etl and plans advancing roughly weekly. Need to: (a) verify the actual cadence of changes the CMS Marketplace API serves; (b) determine which plan attributes change mid-year (URL fixes? plan adds/drops? service-area shifts?) and which are pinned by SI submission cycle (copays, deductibles, MOOPs, premiums); (c) wire LARK Layer-1+ for the plan catalog if sub-annual changes are material. Investigation: ~1 engineer-week. Priority: High.

Carry-over follow-ups from §1311 audit ​

  • Per-issuer cooperative behavior unknown. Do all 183 issuers honor If-None-Match / If-Modified-Since? Subset that returns 200 OK on every request gets a fallback path: fetch + compare-shard-content-hash. Measure during ENG-236 implementation; flag in the PR.
  • 3 FAIL issuers in latest audit (42326, 50305, 25896). All stale_or_missing_in_cms - issuer published, CMS hasn't indexed. No LARK fixes. Daily refresh should self-heal within 24-48h once CMS catches up. Worth a dashboard + alert for sustained FAIL state per issuer.
  • 19 hard-error issuers in mrpuf_issuers_staging. Cross-correlate with the per-state coverage gaps (IN 78%, MS 79%, AZ/FL/NC 92%). IP-allowlist breakers (Medica MO, Dean WI, BCBSFL) noted in ENG-232. Solved via VPC egress; verify after ENG-236 lands.
  • Tier-map extension. Add NONPREFERRED-SPECIALTY-DRUGS + NONPREFERRED-BRAND-AND-SPECIALTY entries to TIER_MAP. Trivial follow-up.
  • Audit-harness coverage.updated pinning. Small port from investigate-failing-npis-and-errored-issuers.js into tier-6-mrf-coverage-validation.js. Could ship independently of ENG-236.
  • Stale-data UX on plan card. "Coverage data refreshed Xh ago" surface; ships alongside ENG-236.
  • CMS-on-miss opportunistic writeback to staging. Lazy enrichment pattern when staging-miss falls through to CMS. Follow-up to the runtime flip.
  • CDN edge cache + multi-key CMS rotation. Largely obviated by the staging-first flip. Multi-key only useful for the staging-miss fallback path. Revisit if that path becomes hot.

Acceptance checklist ​

  • [x] What cadence? Daily as the default. LARK built day-one with all 4 layers, including diff classifier. Hourly possible with one cron-line edit when OE 2027 demands.
  • [x] How is /versions integrated? Daily poll for coverage / npis / drugs (§1311 path) + plans-etl / plans (plan-catalog path). Per-dataset semantics in §3 + §4.
  • [x] Alt sources eval - recommendation per source. Table in §6; no alt source displaces §1311. NPPES pursued separately as enrichment.
  • [x] Fallback ordering pick. Flipped to Option A: staging-first, CMS-on-miss. Triggers to flip back documented in §7.
  • [x] First implementation milestone (with rough effort). ENG-236. LARK 4-layer + diff classifier + plan-catalog parallel job + runtime flip behind feature flag + drift monitor + tier-map extension. Estimated 2-3 engineer-weeks (up from 1-2 in v1).
  • [x] Stale-data UX. "Coverage data refreshed Xh ago" on plan card via last_seen_material_change_at.
  • [x] Cross-references. ADR 0004, pivot decision doc, audit-revalidation report, ENG-236 (linked above).
  • [x] NPPES + plan-catalog scope expansion. Both spawned as separate issues (linked above).

Related ​

  • docs/decisions/2026-05-03-pivot-cms-api-direct.md - pivot doc; performance ceiling math
  • docs/adr/0004-cross-cluster-atlas-privatelink.md - cluster topology; PrivateLink
  • docs/audits/mrf-ingest-staging-audit-2026-05-03T20-07-11.md (in audit-revalidation worktree, not yet on main) - 99.97% match audit verdict
  • docs/session-log/2026-05-08-phase-d-and-ci-guard.md - drug-tier + provider-network fallback shipped 2026-05-08
  • docs/session-log/2026-05-08-phase-11-cross-cluster-privatelink.md - PrivateLink cutover
  • scripts/db/lib/mrf-helpers.js - ingest helper library; cluster guard, tier map, CMS API wrapper
  • scripts/db/ingest-mrf-providers.js + scripts/db/ingest-mrf-formularies.js - current ingest scripts; LARK replaces these on the daily path
  • scripts/audit/tier-6-mrf-coverage-validation.js - drift monitor source
  • src/lib/db.ts - STAGING_ALLOWED_COLLECTIONS includes mrf_file_state_staging
  • src/lib/drug-tier-fallback.ts + src/lib/provider-network-fallback.ts - runtime fallback consumers; inverted in this decision
  • Linear: ENG-227, ENG-228, ENG-231, ENG-232, ENG-233
  • GitHub: #86, #87, #93, #94, #95, #98

Revision history ​

  • 2026-05-09 v1 (commit df1882f): initial decision; Layers 1+2 only, Layer 3 deferred, Option B fallback (CMS-first). Superseded.
  • 2026-05-09 v2 (commit ebc5f0d): engine reframe + Layer 3 day-one with diff classifier + fallback flip to Option A (staging-first, CMS-on-miss) + scope expanded to plan-catalog + spawned NPPES (ENG-251) + plan-cadence (ENG-252) issues. Engine working name was "Reference Data Refresh Engine (RDRE)". Superseded by v3 naming.
  • 2026-05-09 v3 (this commit): renamed engine to LARK. "Reference Data Refresh Engine" was technically descriptive but pitch-flat; LARK is a Florence-tied name (larks sing at dawn; our daily cron runs 06:00 UTC) with its own personality. Reserves the "Athena" name (after Florence's pet owl) for a future heavier-weight engine. ENG-251 / ENG-252 metadata also corrected here: assigned to current cycle (5b739f96-...) with 2026-05-11 due dates matching the original 5 issues. Current.
Pager
Next pageHome

AskFlorence Internal Documentation. Not for public distribution.

AskFlorence

Internal Documentation

Access restricted. Not for public distribution.