Appearance
Runbook — Provision Atlas users and roles (Issue #56)
🔴 LIVE STATE → see Atlas Access Matrix (auto-generated from
infra/atlas/access-matrix.ts, CI-guarded for accuracy). This runbook documents the ORIGINAL provisioning steps for #56 plus the Phase 11 cross-cluster reader, but the matrix is the canonical view of every user that exists today (including post-runbook additions likeapp_read_local_stagingfrom ENG-271,app_writer_hubspot_syncfrom PR91, andapp_admin_schemafrom ENG-266). When provisioning a new user, update the matrix in lock-step — CI enforces it.
This runbook reproduces the user/role setup executed against the askflorence-staging Atlas project on 2026-04-17. Use it to bring a second project (prod, a future DR region, or a throwaway sandbox) to the same state. It assumes an empty Atlas project — no pre-existing role names collide.
Design rationale lives in ADR 0003. Project-isolation rationale in ADR 0001. Append-only audit log rationale in ADR 0002.
Cross-cluster read user (app_read_staging) — the read-only user prod uses to read non-PHI public CMS reference data from the staging cluster via AWS PrivateLink. Provisioning steps for this user are documented in the dedicated section at the bottom of this runbook ("Cross-cluster app_read_staging user — Phase 11"). Decision rationale: ADR 0004.
Prerequisites
atlasCLI v1.x with an authenticated session (atlas auth login).mongoshinstalled (for verification probes).opensslinstalled (for password generation).- Project Owner role in the Atlas project you're targeting.
- The target project ID recorded:
atlas projects list.
Variables you'll substitute
bash
PROJECT_ID=<target Atlas project ID> # e.g. 69e31af12fd2c0aef51bbb41 for staging
CLUSTER_HOST=<cluster SRV host> # e.g. askflorence-staging.efsikmv.mongodb.net
DB_NAME=askflorence # never changesStep 1 — Confirm no role name collisions
bash
atlas customDbRoles list --projectId $PROJECT_ID --output jsonExpected on a fresh project: []. If any of the five role names (role_writer_survey, role_writer_plans, role_writer_agents, role_admin_agents, role_audit_reader) already exist, stop — investigate.
Step 2 — Create the five custom roles
role_writer_survey
bash
atlas customDbRoles create role_writer_survey \
--privilege FIND@${DB_NAME}.agent_survey_responses,INSERT@${DB_NAME}.agent_survey_responses,UPDATE@${DB_NAME}.agent_survey_responses,REMOVE@${DB_NAME}.agent_survey_responses \
--projectId $PROJECT_IDrole_writer_plans
bash
ACTIONS=(FIND INSERT UPDATE REMOVE CREATE_INDEX DROP_INDEX COLL_MOD)
COLLS=(plans zip_county regions plan_years audit_log)
PRIV=""
for a in "${ACTIONS[@]}"; do for c in "${COLLS[@]}"; do PRIV="${PRIV}${a}@${DB_NAME}.${c},"; done; done
PRIV="${PRIV%,}"
atlas customDbRoles create role_writer_plans --privilege "$PRIV" --projectId $PROJECT_IDrole_writer_agents — append-only on agent_audit_log
bash
AGENTS_RW="FIND@${DB_NAME}.agents,INSERT@${DB_NAME}.agents,UPDATE@${DB_NAME}.agents,REMOVE@${DB_NAME}.agents,FIND@${DB_NAME}.agencies,INSERT@${DB_NAME}.agencies,UPDATE@${DB_NAME}.agencies,REMOVE@${DB_NAME}.agencies,FIND@${DB_NAME}.agent_sessions,INSERT@${DB_NAME}.agent_sessions,UPDATE@${DB_NAME}.agent_sessions,REMOVE@${DB_NAME}.agent_sessions"
AUDIT_APPEND="FIND@${DB_NAME}.agent_audit_log,INSERT@${DB_NAME}.agent_audit_log"
atlas customDbRoles create role_writer_agents --privilege "${AGENTS_RW},${AUDIT_APPEND}" --projectId $PROJECT_IDrole_admin_agents — same as above plus admins
bash
ADMINS_RW="FIND@${DB_NAME}.admins,INSERT@${DB_NAME}.admins,UPDATE@${DB_NAME}.admins,REMOVE@${DB_NAME}.admins"
atlas customDbRoles create role_admin_agents --privilege "${AGENTS_RW},${AUDIT_APPEND},${ADMINS_RW}" --projectId $PROJECT_IDWhy not --inheritedRole? Atlas rejected --inheritedRole role_writer_agents@askflorence with ATLAS_INVALID_CUSTOM_ROLE_INHERITED_SCOPE. Custom-role inheritance in Atlas is strict about the scope — explicit enumeration is more robust and the privilege list is still short.
role_audit_reader
bash
atlas customDbRoles create role_audit_reader \
--privilege FIND@${DB_NAME}.agent_audit_log \
--projectId $PROJECT_IDStep 3 — Create the six users
Important: when assigning a custom role to a user, the role must be referenced as
role_name@admin, notrole_name@${DB_NAME}. Atlas rejects the latter withUNSUPPORTED_ROLE: Custom role X must scoped to admin database. The role's privileges targetaskflorence.*collections; the role itself is assigned via@admin.
Generate a 32-char alphanumeric password per user (openssl rand -base64 48 | tr -dc 'A-Za-z0-9' | head -c 32). Do not reuse passwords across users.
bash
declare -a USERS=(
"app_read_staging:read@${DB_NAME}:MONGODB_URI"
"app_writer_survey:role_writer_survey@admin:MONGODB_URI_SURVEY_WRITE"
"app_writer_plans:role_writer_plans@admin:MONGODB_URI_PLANS_WRITE"
"app_writer_agents:role_writer_agents@admin:MONGODB_URI_AGENTS_WRITE"
"app_admin_agents:role_admin_agents@admin:MONGODB_URI_AGENTS_ADMIN"
"audit_reader:role_audit_reader@admin:MONGODB_URI_AUDIT_READ"
)
PW_FILE=/tmp/.atlas-provision-pws # mode 600, deleted at the end
: > "$PW_FILE"
chmod 600 "$PW_FILE"
for entry in "${USERS[@]}"; do
USER="${entry%%:*}"
rest="${entry#*:}"
ROLE="${rest%%:*}"
ENV_NAME="${rest#*:}"
PW=$(openssl rand -base64 48 | tr -dc 'A-Za-z0-9' | head -c 32)
atlas dbusers create --username "$USER" --password "$PW" --role "$ROLE" --projectId $PROJECT_ID
echo "${ENV_NAME}=mongodb+srv://${USER}:${PW}@${CLUSTER_HOST}/${DB_NAME}?retryWrites=true&w=majority" >> "$PW_FILE"
doneFor the prod rollout, rename the first user from app_read_staging to app_read_prod (or just keep the existing app-read and skip that entry).
Step 4 — Write creds to the env file
Move the contents of $PW_FILE to the appropriate local env file:
- Staging cluster:
.env.staging.local(mode 600, gitignored). - Prod cluster: do not write to
.env.localon a dev machine. Move directly to Vercel env (or AWS Secrets Manager post-migration) and securely share with only the engineers who need local prod access.
Then:
bash
rm "$PW_FILE"Step 5 — Verify (positive + negative probes)
Source the env file with the line-by-line loader (raw source breaks on & in SRV strings):
bash
while IFS= read -r line; do
[[ -z "$line" || "$line" == \#* ]] && continue
k="${line%%=*}"; v="${line#*=}"
export "$k=$v"
done < .env.staging.localRun all 12 probes — expect 6 positive ACKs and 6 "user is not allowed to do action" denials. The full probe script is in docs/session-log/2026-04-17-atlas-staging.md. Any unexpected outcome means a role privilege is wrong — stop and fix before handing the creds to any consumer.
Step 6 — Probe-row hygiene
The positive probe against role_writer_agents inserts {_probe: true} into agent_audit_log. By design of ADR 0002, this row cannot be deleted by any app-tier user. Leave it; it ages out with retention.
Step 7 — Cleanup
- Ensure
$PW_FILEwas deleted. - If a temp restore admin was used for a seeded cluster, delete it (
atlas dbusers delete tmp_restore_admin --projectId $PROJECT_ID --force). - Remove any local mongodump output (
rm -rf ./tmp/prod-snapshot/).
When you're done — handoff
Produce a session brief with: project ID, cluster host, six usernames, env file location, region/tier/version. Never include passwords. Briefs live in docs/briefs/ — the staging handoff is at docs/briefs/SESSION_BRIEF_2026-04-17_atlas.md; for the prod rollout use the same filename pattern (SESSION_BRIEF_<YYYY-MM-DD>_<topic>.md).
Rollback
For a newly-minted project, one command drops everything:
bash
atlas projects delete $PROJECT_IDFor an in-flight provisioning against an existing project (e.g. prod), revert by deleting each user created this session and each custom role created this session, in that order (users first, so the roles are unused).
Cross-cluster app_read_staging user — Phase 11
This user lives on the staging Atlas project (askflorence-staging, project_id 69e31af12fd2c0aef51bbb41) and is read by the prod app over AWS PrivateLink to fetch non-PHI public CMS reference data (formularies_staging, providers_staging). Decision rationale: ADR 0004.
The user is distinct from the staging-side app_writer_* / app_admin_* users defined earlier in this runbook — those scope writes to staging-app collections; app_read_staging is a narrow read-only consumer used by external (prod) callers only.
Step A — Generate a strong password
bash
PW_FILE=$(mktemp)
openssl rand -base64 32 > "$PW_FILE"
chmod 600 "$PW_FILE"The password is later loaded into AWS Secrets Manager as part of the connection string (prod/mongodb/reference-uri) on the prod AWS account. Never commit, paste into chat, or store outside Secrets Manager.
Step B — Create the Atlas user with the custom role_reader_reference role
bash
PROJECT_ID=69e31af12fd2c0aef51bbb41 # staging
DB_NAME=askflorence
# First, ensure the custom role exists (one-time per project — idempotent).
# If it already exists this returns a 409; safe to ignore.
# Canonical scope: 4 collections (see ADR 0004 amendment 2026-05-11 /
# ENG-257 closeout). Two for runtime tier-fallback (formularies_staging +
# providers_staging) and two for audit re-validation (plans +
# mrpuf_issuers_staging) — all part of the §1311 / MRF reference dataset.
atlas customDbRoles create role_reader_reference \
--privilege FIND@${DB_NAME}.formularies_staging,FIND@${DB_NAME}.providers_staging,FIND@${DB_NAME}.plans,FIND@${DB_NAME}.mrpuf_issuers_staging \
--projectId $PROJECT_ID
# Then create the user with that role.
atlas dbusers create \
--username app_read_staging \
--password "$(cat "$PW_FILE")" \
--role role_reader_reference@admin \
--projectId $PROJECT_IDThe custom role role_reader_reference grants ONLY FIND on the four §1311 / MRF reference collections: formularies_staging, providers_staging, plans, mrpuf_issuers_staging. Tighter than the built-in read role (which grants read on the entire askflorence DB). Per ADR 0004 amendment 2026-05-11 (ENG-257 closeout), the four-collection scope is the canonical permanent posture — two collections back the runtime tier-fallback path on prod, two back periodic audit re-validation cycles (ENG-230, future ENG-231 refresh cycles). All four share the same non-PHI data classification and the same AWS PrivateLink network path. Two reasons we ship the tight version:
- Defense in depth. Even though the data classification policy + Phase 1 static guard (scripts/audit/staging-collections-guard.ts) keep the staging cluster non-PHI, the principle of least privilege says the cross-cluster reader should only see what it needs.
- Phase 2 nightly drift check (scripts/audit/staging-cluster-drift.ts) audits this exact role shape every night via the Atlas Admin API. Any drift (extra collection grant, wider action like
INSERT, additional role on the user, escalation to a built-in likeread/readWrite) opens a P1 GitHub issue.
If you ever need to widen the role (new cross-cluster collection, etc.), update both the role on Atlas AND the constants STAGING_REFERENCE_READ_COLLECTIONS (src/lib/db.ts) + EXPECTED_READ_COLLECTIONS_SCRIPT_COPY (scripts/audit/staging-cluster-drift.ts) in lock-step. Otherwise the next nightly run flags the change as drift.
Emergency rollback (revert to the historical built-in read@askflorence, e.g. if Phase 2 audit ever breaks production cross-cluster reads):
bash
atlas dbusers update app_read_staging \
--role read@${DB_NAME} \
--projectId $PROJECT_IDThis is reversible and the cross-cluster reader keeps working immediately. Disables the data-classification posture though — the user can read anything in askflorence DB on the staging cluster — so investigate + revert the rollback as soon as the audit is fixed.
Step C — Resolve the private connection string
After AWS creates the VPC endpoint and Atlas approves the connection (per infra/envs/prod/atlas-staging-privatelink.tf), Atlas issues a private SRV connection string:
bash
atlas privateEndpoints aws describe 69fe75c5b02c024f32d2af50 \
--projectId 69e31af12fd2c0aef51bbb41
# Look for endpointServiceName + connection string under interfaceEndpointsThe connection string format is:
mongodb+srv://app_read_staging:<password>@askflorence-staging-pl-0.<random>.mongodb.net/askflorence?retryWrites=true&w=majorityStep D — Store in prod Secrets Manager (project CMK encrypted)
bash
AWS_PROFILE=askflorence-prod aws secretsmanager put-secret-value \
--secret-id prod/mongodb/reference-uri \
--secret-string "$CONNECTION_STRING"The secret shell is created by Terraform (infra/envs/prod/secrets.tf) which sets kms_key_arn to the prod project CMK so the value is encrypted with our key, not the AWS-default key. Do not create the secret directly via aws secretsmanager create-secret — that uses the AWS-default KMS key and is inconsistent with the rest of our secret hygiene.
Step E — Wire the env binding
Already declared in infra/envs/prod/ecs.tf:
hcl
secrets_from_manager = {
...
MONGODB_REFERENCE_URI = module.secrets.secret_arns["mongodb/reference-uri"]
}The next ECS deploy bakes the binding into a fresh task def. Application code (getReferenceDb() in src/lib/db.ts) routes via MONGODB_REFERENCE_URI and falls back to MONGODB_URI when unset — dev + staging keep working without code changes.
Step F — Verify end-to-end
bash
# From a host that can reach prod ECS — typically a smoke test against
# askflorence.health that exercises the cross-cluster read path.
curl -sS -X POST https://askflorence.health/api/drugs/covered \
-H 'Content-Type: application/json' \
-d '{"rxcuis":["1364441"],"plan_id":"42261UT0060023","year":2026}'Expected response: coverage=Covered, drug_tier=PreferredBrand. The drug_tier field is only populated by lookupStagingDrugTiers() reading from formularies_staging via the cross-cluster path — its presence is the proof.
Step G — Cleanup
- Ensure
$PW_FILEwas deleted (rm -f "$PW_FILE"). - Confirm Secrets Manager secret has populated value (
aws secretsmanager get-secret-value --secret-id prod/mongodb/reference-uri --query 'SecretString' --output textshould return non-empty; never paste the value into chat / commit / log).
Step H — API key for nightly drift check (Phase 2)
Phase 2 of #100 / ENG-239 adds a nightly GitHub Actions workflow (.github/workflows/staging-cluster-drift.yml) that audits the live Atlas state of app_read_staging via the Atlas Admin API. To run the audit in CI, the workflow needs an Atlas Programmatic API key bound to GitHub Actions secrets.
One-time provisioning:
Create the API key in Atlas Org settings. Requires Org Owner role.
Atlas UI → Organization Settings → Access Manager → Applications → Create API Key Name: gh-actions-staging-drift-check Org Permission: leave at default (no Org-level role)On the next page, when prompted for project access:
Project: askflorence-staging (69e31af12fd2c0aef51bbb41) Project permission: Project Read OnlyDo NOT grant Org-level permissions, and do NOT grant any other project. Project Read Only on staging is the only access this key needs (it reads
app_read_staging's role + therole_reader_referencedefinition, nothing else).Capture the public + private key pair. The private key is shown ONCE at creation. Save to 1Password under
Atlas — gh-actions-staging-drift-check.Bind to GitHub Actions secrets:
bashgh secret set ATLAS_DRIFT_CHECK_PUBLIC_KEY --body "<public-key>" gh secret set ATLAS_DRIFT_CHECK_PRIVATE_KEY --body "<private-key>"The workflow exposes these as
MONGODB_ATLAS_PUBLIC_API_KEY+MONGODB_ATLAS_PRIVATE_API_KEYenv vars at runtime — the atlas CLI consumes those automatically (noatlas auth loginstep needed in CI).Manually trigger to verify. First run should pass (the role is in canonical state):
bashgh workflow run staging-cluster-drift.yml gh run list --workflow staging-cluster-drift.yml --limit 1
Rotating the key: Atlas does not auto-rotate Programmatic Keys. Rotate annually as part of the same quarterly review that touches STAGING_ALLOWED_COLLECTIONS. To rotate:
Atlas UI → Org Settings → Access Manager → Applications → gh-actions-staging-drift-check → ... → RotateThen re-run gh secret set for both keys with the new values. The old key auto-deactivates after the rotation grace period.
Drift-check rollback (emergency): if the nightly audit ever produces false positives (e.g. Atlas API schema change), disable the schedule by editing .github/workflows/staging-cluster-drift.yml and removing the schedule: block (keeping workflow_dispatch: for manual runs while debugging). The workflow itself does not modify Atlas state — it's a read-only audit — so disabling it carries zero security blast radius beyond losing the nightly check.
Rollback
bash
# Atlas:
atlas dbusers delete app_read_staging \
--projectId 69e31af12fd2c0aef51bbb41 --force
# AWS Secrets Manager (30-day recovery window):
AWS_PROFILE=askflorence-prod aws secretsmanager delete-secret \
--secret-id prod/mongodb/reference-uri \
--recovery-window-in-days 30To fully tear down the cross-cluster path including the AWS PrivateLink endpoint + security group, see the rollback section in the Phase 11 session log.