Skip to content
AskFlorence
Main Navigation ArchitectureFlorence AIAgentsMembersAgent PlatformValidationInfrastructure

Appearance

Sidebar Navigation

Overview

Home

Glossary

System Architecture

Consumer & Agent Flow

Florence AI

Overview

Principles

Runtime

Tool surface

Adding a tool

Tool registry

Knowledge: SBC scenarios & CSR

Voice

Evals & observability

Provider risk & portability

Outage playbook

Roadmap

Build plan

Agents

Overview

Workflows & pain points

Members

Overview

Medicaid coverage gap

Carriers

Overview

Marketplaces

Overview

Agency

Overview

Regulations

Overview

Agent Platform

Overview

Auth Architecture

MongoDB Permissioning

Compliance Model

Data Models

Data Sources

Overview

CMS Marketplace API

CMS dependency map

PUF Data

State Subsidies

SBE Ingestion Playbook

SBE State Watchouts + Decisions

CA Phase C/D Playbook

NY Phase C/D Playbook

Validation

Overview

Methodology

APTC Formula

California 2026

New York 2026

CAPS Formula

Scenario Results

Infrastructure

Account Inventory

AWS Setup Runbook

AWS Organizations

CloudTrail

GuardDuty

Security Hub

Config

CloudFront + WAFv2

Data sources & ingest

Phase 4 DNS

Change Log

Vulnerability Management

MongoDB Setup

Access Control

Data Classification

Documentation Hosting

Post-deploy Smoke

Development

Preflight (local CI mirror)

Testing strategy

Compliance

Overview (auditor entry point)

SOC 2 Control Mapping

HIPAA Control Mapping

CMS EDE Appendix A Mapping

Risk Assessment

Encryption Policy

Data Retention Policy

Privacy Impact Assessment

Consent Capture & Versioning

Incident Response Plan

Access Control Policy

Marketing vs. Portal Analytics

Vendor / Subprocessor Register

Dependency Vulnerability Policy

BAA / Compliance Evidence

Compliance-Automation Integration

Compliance-Automation Vendor Evaluation

Penetration Test Reports

Architecture

Portal entry handoff

Mobile app strategy

Deferred architecture decisions

Session cookie architecture

Share flows

Decisions (ADRs)

Index

0001 — Atlas project isolation

0002 — Append-only audit log

0003 — Narrow-scoped Mongo users

0004 — Cross-cluster Atlas PrivateLink

0005 — Delayed-job architecture

0006 — Mongo user simplification

0007 — Terraform owns ECS task def

0008 — E2E testing strategy

0009 — Self-hosted analytics + observability (superseded)

0010 — PostHog HIPAA Cloud (supersedes 0009)

Runbooks

Security Incident Response

Break-Glass Root Login

Onboard Team Member

Offboard Team Member

Atlas user provisioning

Deploy via Terraform (ENG-277)

Rollback via Terraform (ENG-277)

S3 data bucket migration (planned Phase 11)

Access Reviews

2026-Q2 Review

Session log

Index

2026-04-23 — Phase 10 DNS cutover

2026-04-22 — Phase 8 prod AWS mirror

2026-04-22 — Phase 7 Atlas VPC peering

2026-04-22 — Phase 6 CloudFront + WAF

2026-04-21 — Phase 5 staging go-live

2026-04-17 — Atlas staging

Briefs

Index

Member portal plan (ENG-187)

2026-04-16/17 handoff

2026-04-17 Atlas handoff

System briefing (2026-04-17)

Creative AdBundance proposal brief

Creative AdBundance analytics brief

ElevenLabs RN integration research

Policies

Overview

On this page

Session brief — ENG-279 Mongo user simplification ​

Linear: ENG-279 · Priority: High · Branch: eng-279-mongo-user-simplification · Worktree: ~/Developer/ask-florence-eng-279-mongo-user-simplification

Goal: Roll back the speculative narrow-scoped Mongo user model (10+ custom roles, recurring silent-regression source) to a clean 3-user functional model per cluster (app_read, app_write, app_audit_writer), aligned with MongoDB's documented best practices. Re-narrow only at Phase 5 PHI introduction, using views + JWT for tenant filtering instead of per-tenant DB users.

ZERO-DOWNTIME CONSTRAINT — read this twice: https://stage.askflorence.health is the YCombinator demo link — Y Combinator and prospective investors visit this URL. Staging must stay online + functionally correct AT ALL TIMES throughout this work. Same constraint as production: never degrade, never break, never show "covered" → "not covered" or "$X" → "error." A single broken probe on staging during this work is a STOP-the-line incident — roll back the change that caused it before continuing. Brief writes "staging" as if it were a sandbox in some places below for prose readability; treat every staging mention as if it said "STAGING IS YC-FACING PROD-EQUIVALENT" instead.


What you are walking into ​

The previous session (ENG-214 compliance docs) discovered that the canonical .env.local has MONGODB_URI bound to the wrong narrow user (app_read_staging — the cross-cluster reference reader, not the deployment-local reader). That was the third silent-regression bug in this Mongo-user category in a single week. The pattern:

  • ENG-239 — narrowed app_read_staging role (intended; clean)
  • ENG-271 — narrowed staging deployment MONGODB_URI to a new user, but getReferenceDb() silently fell back to it; /api/providers/covered returned HTTP 500 silently on apex
  • ENG-272 — added app_read_local_staging + removed the silent fallback; fixed deployed staging but missed canonical .env.local
  • This session (ENG-279) — found the local-side miss

The cycle has consumed multiple emergency sessions and caused at least one user-facing prod outage (ENG-271).

MongoDB itself names this anti-pattern (over-segmentation; custom roles before exhausting built-in; collection-level when DB-level would do). We are explicitly fighting the documented model. ENG-279's plan is to stop.


Required reading (~45 min, IN THIS ORDER) ​

  1. ENG-279 issue body + every comment — full problem statement, current-state matrix across 3 envs, proposal, MongoDB pattern reference, acceptance criteria.
  2. ENG-214 issue body + the ENG-279 cross-link comment — context on the compliance work that surfaced this.
  3. docs/infrastructure/atlas-access-matrix.md — authoritative current state of all 12 Mongo users + env-var bindings + consumers per cluster. Read end-to-end.
  4. infra/atlas/access-matrix.ts — TypeScript manifest that generates the doc; this is what you'll be editing for the user model change.
  5. ADR 0003 — original narrow-scoped users decision (this issue supersedes it).
  6. ADR 0002 — append-only audit log (PRESERVED; app_audit_writer keeps the narrow append-only role).
  7. ADR 0004 — cross-cluster PrivateLink (PRESERVED; just the role + user that backs the reader changes).
  8. src/lib/db.ts — the dual-pool getDb() + getReferenceDb() connection helpers. Read header comments fully — they document the env-var contract.
  9. infra/envs/prod/ecs.tf + infra/envs/prod/secrets.tf — prod ECS task definition env binding + Secrets Manager declarations.
  10. infra/envs/staging/ecs.tf + infra/envs/staging/secrets.tf — same for staging.
  11. docs/security-compliance/access-control-policy.md DB section + docs/security-compliance/hipaa-control-mapping.md §164.308(a)(4) row — the compliance narrative that the simplification must preserve.

Skim: ENG-239, ENG-249/PR91, ENG-271, ENG-272 — the historical narrowing-then-bug cycles. You don't need to fully internalize each; you need to know the failure mode.


What you are building (the end state) ​

Three users per cluster (6 total), with MongoDB best-practices baked in:

Per cluster (askflorence-prod-01 + askflorence-staging) ​

UserRolePrivilegesEnv var bindings
app_readBuilt-in read@askflorenceDB-wide read on askflorenceMONGODB_URI (prod + staging + local) + MONGODB_REFERENCE_URI (prod via PrivateLink to staging cluster's app_read)
app_writeBuilt-in readWrite@askflorenceDB-wide readWrite on askflorenceMONGODB_WRITE_URI, plus until Phase 5 lands also the legacy MONGODB_URI_PLANS_WRITE / _SURVEY_WRITE / _WAITLIST_WRITE / _HUBSPOT_SYNC_WRITE env vars all point at this user (no behavioral change for app code; just the underlying user is consolidated)
app_audit_writerCustom role_audit_writer — FIND + INSERT on agent_audit_log onlyAppend-only append-only-onlyMONGODB_AUDIT_WRITE_URI (new env var; consumed by Phase 5 audit-log writers; falls back to MONGODB_WRITE_URI if unset, but unset is a config bug)

Total users post-migration: 6 (3 per cluster × 2 clusters). Down from 12 today. Down from the projected 15-20 at Phase 5 launch.

Why this works ​

  • Pre-PHI today: all askflorence DB content is public CMS marketplace data + agent waitlist PII. No row-level filtering needed at the DB layer. app_read whole-DB read is the right grant.
  • Audit-log integrity preserved: app_audit_writer keeps the FIND+INSERT-only role per ADR 0002 — this is the one custom role that's compliance-critical.
  • Re-narrowing playbook at Phase 5: when PHI lands, use views for per-agent / per-member filtering + JWT in app middleware for tenant identity. Do NOT add per-tenant Mongo users. Cap total users at 5 per cluster (per ENG-279 acceptance criteria).

Execution plan — STAGED, additive-first ​

This is the part that determines whether prod stays up. Atlas state changes first (additive only), then Terraform (additive only), then ECS task definitions (rolling deploy to use new env bindings), THEN verification, THEN deprecation. Never delete an old user until the new task revisions have rolled successfully across all tasks AND verification has passed.

Phase A — Plan + capture pre-change baselines ​

STEP 1: Capture today's YC-demo baseline on staging + prod BEFORE READING ANYTHING ELSE. The "Tyler Wood + Synthroid + Lipitor covered on 14 plans" contract is the success criterion — you cannot measure success without first knowing the starting state. Run the 8-gate probe (described under Verification Gates) against both https://stage.askflorence.health and https://askflorence.health and save the responses verbatim:

bash
mkdir -p /tmp/eng-279-baselines
# Capture staging baseline
for gate in eligibility plans providers-tyler-wood drugs-synthroid drugs-lipitor; do
  # ... probe and save response to /tmp/eng-279-baselines/staging-<gate>-pre.json
done
# Same for prod

If today's plan count is NOT 14, that's interesting — capture whatever number it IS, that becomes the session contract. If "Tyler Wood" or "Synthroid" or "Lipitor" don't return covered on the current staging today, STOP and ask the user — something is already broken pre-session and that needs to be resolved before any Mongo-user change can ship.

STEP 2: Now do the planning work:

  1. Open this brief + the required reading in order.
  2. Write a plan file at ~/.claude/plans/eng-279-mongo-user-simplification-<adjective-noun-adjective>.md (the runtime gives you a name). Plan must include:
    • Reference the captured pre-change baselines (paths under /tmp/eng-279-baselines/)
    • Exact list of Atlas users to create (3 per cluster, 6 total) with the role definitions verbatim
    • Exact list of users to deprecate (12 total — every existing custom + legacy user)
    • Exact env-var binding changes per env (prod ECS, staging ECS, local) — table format
    • Verification probes at each gate (the 8-gate YC-demo smoke + calculator regression + dev probe)
    • Rollback plan for each phase, with the snapshotted task-def paths referenced
    • The session-contract plan count (14 if confirmed; whatever today's number is otherwise)
  3. Use AskUserQuestion to clarify any ambiguity. Examples that should NOT be assumed:
    • Whether to keep app_admin_schema (it's used for index creation by scripts/db/setup-collections.js — likely needs migration to app_write or stays as a 4th user; ask)
    • Whether MONGODB_AUDIT_WRITE_URI is wired today or only at Phase 5 (probably Phase 5 — confirm before adding env binding)
    • Whether MONGODB_REFERENCE_URI on prod stays pointed at the staging cluster's NEW app_read (via PrivateLink) or gets a dedicated cross-cluster reader user (clean architectural call — recommend reusing app_read since narrowness no longer the priority; ADR 0004 PrivateLink architecture is preserved either way)
    • Whether today's staging plan count for the YC default scenario is actually 14, or whether the contract should be the actually-captured-today number (likely the latter; ask after capturing the baseline)
  4. ExitPlanMode for user approval.

Phase B — Atlas changes (additive) ​

Order matters. Each step is verifiable before the next.

  1. Staging cluster first. Create app_read@staging (built-in read@askflorence); create app_write@staging (built-in readWrite@askflorence); create app_audit_writer@staging with the custom append-only role (clone from existing role_audit_reader / role_writer_agents definitions). Verify each user can do exactly what they should via direct mongo probes from your laptop (FIND + INSERT smoke tests, plus a denial probe for app_audit_writer UPDATE/REMOVE attempts to prove the append-only property holds).
  2. Prod cluster second. Same three users created on askflorence-prod-01.
  3. Do NOT touch any existing user yet. Old users continue to work for old ECS task revisions.

Phase C — Secrets Manager (additive) ​

  1. Add new secret entries:
    • staging/mongodb/app-read-v2 (or whatever naming; consider future-proof) ← app_read@staging URI
    • staging/mongodb/app-write-v2 ← app_write@staging URI
    • staging/mongodb/audit-write ← app_audit_writer@staging URI
    • prod/mongodb/app-read-v2 ← app_read@prod URI
    • prod/mongodb/app-write-v2 ← app_write@prod URI
    • prod/mongodb/audit-write ← app_audit_writer@prod URI
  2. Verify each secret resolves correctly via aws secretsmanager get-secret-value (test from your laptop with active AWS auth).
  3. Do NOT delete the old secrets yet.

Phase D — Terraform + ECS task definition rev ​

Verify-new-users-WORK before swapping any env binding. The cycle that has burned us repeatedly: assume a new credential works → wire it in → discover at runtime it doesn't have the privileges the code needs. Break that cycle by exercising every new user against every consuming code path BEFORE any task-def revision swaps the live binding.

D.0 — Pre-swap verification (NO ECS changes yet) ​

For each new user created in Phase B, run a direct probe from your laptop using the corresponding secret value:

bash
# Test app_read@staging can do everything getDb() does
MONGODB_URI="<app_read@staging URI>" npx tsx scripts/debug-probe-new-user.ts \
  --collections "plans,zip_county,regions,plan_years,audit_log,agent_waitlist_submissions,formularies_staging,providers_staging" \
  --action FIND

# Test app_write@staging can do everything getDb() write paths do
MONGODB_URI="<app_write@staging URI>" npx tsx scripts/debug-probe-new-user.ts \
  --collections "plans,agent_waitlist_submissions,agent_survey_responses,hubspot_sync_log,plan_years,zip_county,regions" \
  --action FIND_INSERT_UPDATE_REMOVE

# Test app_audit_writer@staging has FIND + INSERT on agent_audit_log AND IS DENIED UPDATE/REMOVE
MONGODB_URI="<app_audit_writer@staging URI>" npx tsx scripts/debug-probe-new-user.ts \
  --collection agent_audit_log \
  --expect-allow FIND,INSERT \
  --expect-deny UPDATE,REMOVE

(You'll need to write scripts/debug-probe-new-user.ts — small wrapper. Delete after the migration.)

ALL probes must pass on the new users before swapping any task-def env binding. If a probe fails, fix the Atlas role in Phase B; do not proceed.

Repeat for prod cluster's new users.

D.1 — Staging ECS task-def swap ​

  1. Update infra/envs/staging/ecs.tf so the env-var bindings point at the new secrets. Keep the env var NAMES the same (MONGODB_URI, MONGODB_WRITE_URI, MONGODB_REFERENCE_URI) so app code doesn't need to change.
  2. Add MONGODB_AUDIT_WRITE_URI binding (new env var; safe to add — falls back to MONGODB_WRITE_URI in src/lib/db.ts if unset).
  3. terraform plan — review carefully.
  4. Snapshot the current ECS task-def revision (aws ecs describe-task-definition --task-definition <name> > /tmp/staging-task-def-prerollback.json) — this is your rollback artifact.
  5. terraform apply against staging. ECS rolling deploy fires.
  6. Watch the staging service stabilize (aws ecs describe-services until runningCount === desiredCount AND rolloutState=COMPLETED on PRIMARY).

D.2 — Staging YC-demo smoke gate (MUST PASS before D.3) ​

Run the YC-demo smoke test on https://stage.askflorence.health. This is the contract the YC reviewer would hit:

  1. Calculator basic flow — POST /api/eligibility + POST /api/plans with the demo-default inputs (ZIP 84094 / married couple ages 35+30 / appropriate non-Medicaid income that returns plan results). Expect: plans JSON returned, no 500s.

  2. Plan count baseline — Plans returned for the canonical doctor + Rx test inputs (call them the "YC default scenario") must total 14 plans for default calculator inputs that exercise the doctor + Rx coverage path. The exact ZIP / income / household composition is in scripts/audit/fixtures/calculator-baseline.json or — if not there — derive from the current staging behavior BEFORE making any change (capture the pre-change baseline as /tmp/yc-demo-baseline-pre.json). The 14 number is the contract — if pre-change staging returns a different count, capture that number first, that is the new baseline.

  3. Provider coverage probe — Tyler Wood — POST /api/providers/covered searching for "Tyler Wood" (a real provider with known coverage across UT plans). Expected: provider returned as covered on the 14 plans the default calculator returns. Specifically the response payload must show covered: true for the doctor across every plan in the YC default scenario plan list.

  4. Medication coverage probe — Synthroid — POST /api/drugs/covered with "Synthroid" (levothyroxine sodium, a high-volume formulary entry). Expected: drug returned as covered on the 14 plans, with a tier classification populated (drug_tier=PreferredBrand or Generic per plan).

  5. Medication coverage probe — Lipitor — POST /api/drugs/covered with "Lipitor" (atorvastatin calcium, another high-volume formulary entry). Expected: drug returned as covered on the 14 plans, with a tier classification populated.

  6. End-to-end member flow — full happy-path walkthrough in the browser at https://stage.askflorence.health: open the home page → demo calculator → land on plans → click "Check doctor + Rx coverage" → search for Tyler Wood, Synthroid, Lipitor → verify the per-plan coverage indicators render correctly → no console errors, no broken renders.

  7. Calculator regression — BASELINE_BASE=https://stage.askflorence.health npx tsx scripts/audit/calculator-baseline-diff.ts → ZERO DIFFS.

  8. Agent flow — submit a synthetic agent waitlist signup at https://stage.askflorence.health/agent-onboarding (use tahaabbasi+yctest-eng279-<timestamp>@me.com). Confirm: Mongo write succeeded (verify via Atlas Admin UI or aws ecs execute-command into the task), SES sent the confirmation email, HubSpot mirror created the contact, no errors in CloudWatch logs. Clean up the test row + HubSpot contact after.

If ANY of these eight gates fails, immediately roll back by re-registering the snapshotted prior task-def revision (aws ecs register-task-definition --cli-input-json file:///tmp/staging-task-def-prerollback.json + aws ecs update-service) and DO NOT proceed to D.3 until the root cause is found and fixed in Phase B (Atlas role) or Phase C (Secrets Manager value).

D.3 — Soak staging for ≥ 30 min after D.2 passes ​

Hands off. Watch CloudWatch logs (aws logs tail) for any error spike. Watch Atlas Admin UI for any auth-failure spike. After 30 min of clean operation:

D.4 — Prod ECS task-def swap ​

Mirror D.1 + D.2 against prod (infra/envs/prod/ecs.tf + apply + watch + run the YC-demo smoke test against https://askflorence.health).

Same eight-gate smoke test. Same rollback recipe. Same 30-min soak.

Phase E — Canonical .env.local update ​

  1. Update /Users/tahaabbasi/Developer/askflorence/.env.local:
    • MONGODB_URI ← app_read@staging connection string
    • MONGODB_WRITE_URI ← app_write@staging connection string
    • MONGODB_REFERENCE_URI ← app_read@staging connection string (same cluster + user; can be identical to MONGODB_URI, but kept as a distinct env var so getReferenceDb() doesn't silently lose its explicit binding contract per ENG-272)
    • MONGODB_AUDIT_WRITE_URI ← app_audit_writer@staging connection string
    • Remove all legacy MONGODB_URI_*_WRITE entries (no consumer will use them after Phase F)
  2. Restart local dev server. Probe /api/plans + /api/eligibility + /api/providers/covered + /api/drugs/covered against http://localhost:3000 (or whatever port).
  3. Run calculator regression: npx tsx scripts/audit/calculator-baseline-diff.ts — must be ZERO DIFFS, no env-var juggling required.
  4. Update docs/briefs/SESSION_BRIEF_NEW_WORKTREE.md or whatever the worktree-setup brief is, so future worktrees pick up the right env-var convention. Also update .env.example if one exists.

Phase F — Code consolidation (remove dead env-var references) ​

  1. Grep src/, scripts/, infra/ for the deprecated env-var names: MONGODB_URI_PLANS_WRITE, MONGODB_URI_SURVEY_WRITE, MONGODB_URI_AGENTS_WRITE, MONGODB_URI_AGENTS_ADMIN, MONGODB_URI_AUDIT_READ, MONGODB_URI_WAITLIST_WRITE, MONGODB_URI_HUBSPOT_SYNC_WRITE.
  2. Replace each call site with MONGODB_WRITE_URI (or MONGODB_AUDIT_WRITE_URI for audit-log writers when those ship in Phase 5).
  3. Run npx tsc --noEmit clean.
  4. Run calculator regression. Run a full smoke test of every API route that does a DB write.

Phase G — Deprecate old Mongo users (DELETE last) ​

Only after Phase F is verified. Three rolling deploys later (so no in-flight ECS tasks reference old creds) AND a clean nightly drift check AND a clean calculator regression.

  1. Delete custom roles: role_writer_survey, role_writer_plans, role_writer_agents, role_admin_agents, role_admin_schema, role_audit_reader, role_reader_reference, role_reader_local_staging, plus role_writer_waitlist and role_writer_hubspot_sync if they exist as separate roles.
  2. Delete users: app_read_local_staging, app_read_staging, app_writer_survey, app_writer_plans, app_writer_waitlist, app_writer_hubspot_sync, app_writer_agents, app_admin_agents, app_admin_schema, audit_reader.
  3. Delete prod app-write (Issue #56 exit criterion).
  4. Delete the corresponding Secrets Manager secrets.
  5. Final Terraform pass to remove the deleted secret definitions from infra/envs/{prod,staging}/secrets.tf.

Phase H — Documentation ​

  1. New ADR docs/adr/0005-mongo-user-simplification.md that supersedes ADR 0003. References MongoDB's documented anti-pattern guidance from the ENG-279 comment. Lays out the 3-user model + the Phase 5 re-narrowing playbook (views + JWT).
  2. Update infra/atlas/access-matrix.ts to the new 6-user manifest. Run npm run docs:atlas to regenerate docs/infrastructure/atlas-access-matrix.md.
  3. Touch up docs/security-compliance/access-control-policy.md DB section. The §164.308(a)(4) row in docs/security-compliance/hipaa-control-mapping.md stays accurate (least-privilege is still met at the role-tier level).
  4. Touch up the now-pointer docs/infrastructure/access-control.md if it still has stale Mongo-user details.
  5. Add a "What shipped" entry to CLAUDE.md under today's date, no version bump per cadence policy.
  6. Close out ENG-279 — comment with summary + tick all acceptance-criteria checkboxes + move status to In Review.
  7. Update ENG-214 close-out comment to note the access-control-policy + atlas-access-matrix refreshes shipped.

Verification gates (NON-NEGOTIABLE) ​

At each gate, all probes must pass before proceeding to the next phase:

Probe set — the YC-demo smoke test ​

This is the contract. Run all 8 gates against https://stage.askflorence.health after every Atlas / Secrets / Terraform change. Run all 8 against https://askflorence.health after every prod change. Local-only probes (calculator regression on http://localhost:3000) supplement but do not replace the staging + prod gates — staging is the YC link and prod is prod; both are live external-facing surfaces.

#GateWhat it testsPass criteria
1POST /api/eligibility (UT 84094 + non-Medicaid income)ZIP lookup + APTC / CSR calcHTTP 200, eligibility payload returned, no error
2POST /api/plans (same inputs as #1)Plan search + scoringHTTP 200, plans JSON returned, plan count matches the "YC default scenario" baseline (target 14 plans; capture pre-change baseline if different from 14 and use that as the contract for this session)
3POST /api/providers/covered (search "Tyler Wood")Cross-cluster providers_staging readHTTP 200, response shows covered: true on each of the 14 plans the calculator returns
4POST /api/drugs/covered (search "Synthroid")Cross-cluster formularies_staging readHTTP 200, response shows covered: true on each of the 14 plans with drug_tier populated
5POST /api/drugs/covered (search "Lipitor")Same path as #4, different drugHTTP 200, response shows covered: true on each of the 14 plans with drug_tier populated
6End-to-end browser flow (home → calculator → plans → coverage check for the three test items)Real user journeyAll renders correct, no console errors, no broken states; "Tyler Wood + Synthroid + Lipitor all show covered on the 14 plans"
7BASELINE_BASE=<env-url> npx tsx scripts/audit/calculator-baseline-diff.tsFull 12-scenario pipeline regressionZERO DIFFS
8Agent flow — synthetic waitlist signup at /agent-onboarding with tahaabbasi+yctest-eng279-<ts>@me.comMongo write + SES send + HubSpot mirrorAll three side effects fire correctly; clean up test row + HubSpot contact post-test

Plus the standing CI guards:

  • nightly staging-cluster-drift workflow passes (trigger manually via gh workflow run staging-cluster-drift after Atlas changes)
  • staging-collections-guard CI workflow passes on the PR
  • atlas-env-var-coverage CI workflow passes on the PR
  • atlas-docs-sync CI workflow passes on the PR (after npm run docs:atlas regen)
  • validate-secrets CI workflow passes on the PR

When ANY probe fails ​

STOP THE LINE. Roll back the change that caused the failure immediately:

Failure pointRollback action
Gate 1-8 fails against staging after a Terraform applyaws ecs register-task-definition --cli-input-json file:///tmp/staging-task-def-prerollback.json + aws ecs update-service. Watch rollout. Re-probe gates. Investigate root cause in Atlas / Secrets layer.
Gate 1-8 fails against prod after a Terraform applySame as above, against prod task def + service. Page Asad if customer-visible.
Pre-swap probe (D.0) failsFix the Atlas role grant in the Atlas Admin UI. Re-probe before Terraform changes.
validate-secrets CI failsInspect — likely a malformed connection string in Secrets Manager. Fix the secret value before re-running.
atlas-cluster-drift fails after Phase B (additive create)Should NOT happen — additive create doesn't change existing user privileges. If it does, the new user's role definition is wider than declared. Inspect role JSON; tighten in Atlas Admin UI.

NEVER proceed past a failed gate. NEVER. Each Mongo-user regression in our history reached prod or staging because someone "fixed it in the next step" instead of stopping.

When a probe fails ​

STOP. Roll back the change that caused the failure (Terraform apply of the prior task-def revision; revert the Atlas user role; etc.). Do NOT proceed.

Rollback recipes per phase ​

  • Phase B (new users created): roll back = delete the new users in Atlas Admin UI. No app traffic touches them yet.
  • Phase C (new secrets created): roll back = delete the new Secrets Manager secrets. No app traffic uses them yet.
  • Phase D (ECS task-def points at new secrets): roll back = re-register the prior task-def revision (aws ecs register-task-definition from saved :N-1 snapshot) + aws ecs update-service. Old secrets are still alive so old creds still work.
  • Phase E (canonical .env.local update): roll back = git restore .env.local or hand-revert.
  • Phase F (code env-var consolidation): roll back = git revert <commit>.
  • Phase G (delete old users + secrets): NOT rollback-safe without a fresh Atlas user create + secret regenerate. Treat as point of no return — do not enter Phase G unless Phase A–F are all 100% verified across all 3 envs for ≥ 24 hours.

Hard rules ​

  • Staging is the YC demo link — it is prod-equivalent for downtime tolerance. Never degrade, never break, never invalidate the "Tyler Wood + Synthroid + Lipitor covered on 14 plans" smoke test. A failed staging gate is a STOP-the-line incident even if prod is fine. Roll back BEFORE investigating root cause.
  • Verify new users WORK before swapping any live binding. Phase D.0 pre-swap probe is non-negotiable — exercise each new user against every code path it will serve, from your laptop, BEFORE any task-def revision change. This is the principle that breaks the historical bug cycle.
  • Never delete a Mongo user that has any task definition revision still bound to it. ECS rolling deploys take time; new task revisions stabilize over minutes. Wait for desiredCount === runningCount AND rolloutState=COMPLETED on PRIMARY for the new revision before considering the old credential safe to delete.
  • Never commit a Mongo connection string (with password) to git. Pull from AWS Secrets Manager or .env.local (gitignored). Per CLAUDE.md Security rules.
  • Never run git add . or git add -A. Stage specific paths (infra/, docs/, CLAUDE.md, etc.). Per CLAUDE.md Security rules.
  • The append-only audit-log property is sacred. app_audit_writer keeps FIND+INSERT-only on agent_audit_log. Any change to this role requires a separate ADR + Asad/Taha sign-off. Per ADR 0002.
  • Atlas BAA enumeration unchanged. Both askflorence-prod-01 + askflorence-staging stay in scope. No new clusters, no project changes — the simplification is users + roles within existing projects only.
  • Calculator regression must pass at EVERY phase boundary. Plus the full 8-gate YC-demo smoke test against staging after every change touching staging, and against prod after every change touching prod.
  • Capture pre-change baselines before any change. The "14 plans" + "Tyler Wood + Synthroid + Lipitor covered" contract is what the YC reviewer sees today on staging. If today's staging returns a different number (e.g., 13 or 15 plans), capture that as the session-start baseline FIRST — don't discover mid-session that you changed an output you couldn't measure against. Save baselines at /tmp/yc-demo-baseline-pre-<env>.json before Phase B even starts.
  • Snapshot task-def revisions before each Terraform apply. aws ecs describe-task-definition --task-definition <name> > /tmp/<env>-task-def-prerollback.json. Your rollback artifact.

When a deploy is needed ​

This work touches prod ECS task definitions, so it requires real terraform apply runs against the prod account. Per CLAUDE.md "Deploy + release cadence policy":

  • Wait for explicit Taha approval before EACH prod-side change
  • Staging-side changes can ship as part of normal workflow (Taha reviewing after the fact is fine)
  • Hotfixes (rollback) can ship immediately if a probe fails

This is a multi-deploy operation. Probably 2 prod deploys (Phase D task-def update + Phase G secrets cleanup). Confirm scope with Taha at plan-approval time.


What success looks like ​

  • All 8 YC-demo verification gates pass against staging at every phase boundary AND at the end. Staging stayed online + correct throughout — Tyler Wood + Synthroid + Lipitor remained covered: true across all 14 (or session-contract-N) plans for the entire session.
  • All 8 YC-demo verification gates pass against prod at every phase boundary AND at the end. Same contract on prod.
  • ENG-279 acceptance criteria all ticked
  • 6 total Mongo users across both clusters (down from 12)
  • MONGODB_URI, MONGODB_WRITE_URI, MONGODB_REFERENCE_URI, MONGODB_AUDIT_WRITE_URI are the only Mongo env vars in active code
  • Local dev works with vanilla MONGODB_URI + MONGODB_REFERENCE_URI — no env-var juggling required for calculator regression
  • ADR 0005 supersedes ADR 0003 documenting the new model + the Phase 5 re-narrowing discipline (views + JWT)
  • Atlas access matrix regenerated + clean
  • agent_audit_log append-only property still verifiable via the same probe pattern as today
  • CLAUDE.md has the "What shipped" entry
  • ENG-279 closes (status In Review awaiting Taha sign-off)
  • ENG-214 close-out comment updated noting the access-control-policy.md refresh
  • Pre-change baselines preserved at /tmp/eng-279-baselines/ until ENG-279 PR merges — proof artifact that nothing degraded

Estimated effort ​

PhaseEffort
A — Plan (reading + writing plan file)1 hr
B — Atlas additive user creation30 min
C — Secrets Manager additive30 min
D — Terraform + ECS rolling deploy (staging + prod with 30-min stable wait between)1.5 hr
E — .env.local update + local verification30 min
F — Code env-var consolidation1 hr
G — Deprecation (only after 24h verified)1 hr (next session if 24h elapsed)
H — Documentation1 hr
Total~6 hr active + 24h wait between F and G

Realistically 2 sessions: one to ship Phases A–F (~5 hr), one to do Phase G + H after the 24h soak.


At plan approval time, surface these decisions to Taha ​

  1. Naming of new users + secrets — app_read vs app_read_v2 vs appReader. Consistency matters; pick once.
  2. Whether to consolidate the legacy MONGODB_URI_*_WRITE env vars now (Phase F) or leave them as aliases for MONGODB_WRITE_URI. Recommend consolidating — fewer env vars = less drift surface.
  3. Whether the prod app_read@prod cluster user should exist or whether prod reads only from staging cluster via PrivateLink — current Terraform has prod/mongodb/app-read pointing at prod cluster. Re-confirm the cluster split today is "prod has its own plan data; staging has its own plan data; cross-cluster is only formularies + providers". (Spoiler: that's what the access matrix says + what the live probes confirmed.)
  4. MONGODB_AUDIT_WRITE_URI introduction now vs at Phase 5 — recommend introducing now as a wired-up env var even if no code consumer exists yet, so Phase 5 work can drop the binding straight in.
  5. Whether to file a separate issue for the access-control-policy.md doc refresh or include it in this PR. Recommend including it — keeps the doc + the implementation atomically synced.

Cross-references ​

  • Issue: ENG-279
  • Related: ENG-214 (the work that surfaced this)
  • ADRs: 0001 (project isolation — preserved), 0002 (audit log — preserved), 0003 (narrow-scoped users — SUPERSEDED by this work), 0004 (PrivateLink — preserved)
  • Docs to update: docs/infrastructure/atlas-access-matrix.md (auto-regen), docs/security-compliance/access-control-policy.md (DB section), CLAUDE.md (What shipped)
  • MongoDB pattern reference: built-in roles, custom roles at DB level, views for filtering, JWT for tenant identity — see ENG-279 comment for the full citation
Pager
Next pageHome

AskFlorence Internal Documentation. Not for public distribution.

AskFlorence

Internal Documentation

Access restricted. Not for public distribution.