Runbook — Provision Atlas users and roles (Issue #56)

🔴 LIVE STATE → see Atlas Access Matrix (auto-generated from infra/atlas/access-matrix.ts, CI-guarded for accuracy). This runbook documents the ORIGINAL provisioning steps for #56 plus the Phase 11 cross-cluster reader, but the matrix is the canonical view of every user that exists today (including post-runbook additions like app_read_local_staging from ENG-271, app_writer_hubspot_sync from PR91, and app_admin_schema from ENG-266). When provisioning a new user, update the matrix in lock-step — CI enforces it.

This runbook reproduces the user/role setup executed against the askflorence-staging Atlas project on 2026-04-17. Use it to bring a second project (prod, a future DR region, or a throwaway sandbox) to the same state. It assumes an empty Atlas project — no pre-existing role names collide.

Design rationale lives in ADR 0003. Project-isolation rationale in ADR 0001. Append-only audit log rationale in ADR 0002.

Cross-cluster read user (app_read_staging) — the read-only user prod uses to read non-PHI public CMS reference data from the staging cluster via AWS PrivateLink. Provisioning steps for this user are documented in the dedicated section at the bottom of this runbook ("Cross-cluster app_read_staging user — Phase 11"). Decision rationale: ADR 0004.

Prerequisites

atlas CLI v1.x with an authenticated session (atlas auth login).
mongosh installed (for verification probes).
openssl installed (for password generation).
Project Owner role in the Atlas project you're targeting.
The target project ID recorded: atlas projects list.

Variables you'll substitute

bash

PROJECT_ID=<target Atlas project ID>          # e.g. 69e31af12fd2c0aef51bbb41 for staging
CLUSTER_HOST=<cluster SRV host>               # e.g. askflorence-staging.efsikmv.mongodb.net
DB_NAME=askflorence                           # never changes

Step 1 — Confirm no role name collisions

bash

atlas customDbRoles list --projectId $PROJECT_ID --output json

Expected on a fresh project: []. If any of the five role names (role_writer_survey, role_writer_plans, role_writer_agents, role_admin_agents, role_audit_reader) already exist, stop — investigate.

Step 2 — Create the five custom roles

`role_writer_survey`

bash

atlas customDbRoles create role_writer_survey \
  --privilege FIND@${DB_NAME}.agent_survey_responses,INSERT@${DB_NAME}.agent_survey_responses,UPDATE@${DB_NAME}.agent_survey_responses,REMOVE@${DB_NAME}.agent_survey_responses \
  --projectId $PROJECT_ID

`role_writer_plans`

bash

ACTIONS=(FIND INSERT UPDATE REMOVE CREATE_INDEX DROP_INDEX COLL_MOD)
COLLS=(plans zip_county regions plan_years audit_log)
PRIV=""
for a in "${ACTIONS[@]}"; do for c in "${COLLS[@]}"; do PRIV="${PRIV}${a}@${DB_NAME}.${c},"; done; done
PRIV="${PRIV%,}"
atlas customDbRoles create role_writer_plans --privilege "$PRIV" --projectId $PROJECT_ID

`role_writer_agents` — append-only on `agent_audit_log`

bash

AGENTS_RW="FIND@${DB_NAME}.agents,INSERT@${DB_NAME}.agents,UPDATE@${DB_NAME}.agents,REMOVE@${DB_NAME}.agents,FIND@${DB_NAME}.agencies,INSERT@${DB_NAME}.agencies,UPDATE@${DB_NAME}.agencies,REMOVE@${DB_NAME}.agencies,FIND@${DB_NAME}.agent_sessions,INSERT@${DB_NAME}.agent_sessions,UPDATE@${DB_NAME}.agent_sessions,REMOVE@${DB_NAME}.agent_sessions"
AUDIT_APPEND="FIND@${DB_NAME}.agent_audit_log,INSERT@${DB_NAME}.agent_audit_log"
atlas customDbRoles create role_writer_agents --privilege "${AGENTS_RW},${AUDIT_APPEND}" --projectId $PROJECT_ID

`role_admin_agents` — same as above plus `admins`

bash

ADMINS_RW="FIND@${DB_NAME}.admins,INSERT@${DB_NAME}.admins,UPDATE@${DB_NAME}.admins,REMOVE@${DB_NAME}.admins"
atlas customDbRoles create role_admin_agents --privilege "${AGENTS_RW},${AUDIT_APPEND},${ADMINS_RW}" --projectId $PROJECT_ID

Why not --inheritedRole? Atlas rejected --inheritedRole role_writer_agents@askflorence with ATLAS_INVALID_CUSTOM_ROLE_INHERITED_SCOPE. Custom-role inheritance in Atlas is strict about the scope — explicit enumeration is more robust and the privilege list is still short.

`role_audit_reader`

bash

atlas customDbRoles create role_audit_reader \
  --privilege FIND@${DB_NAME}.agent_audit_log \
  --projectId $PROJECT_ID

Step 3 — Create the six users

Important: when assigning a custom role to a user, the role must be referenced as role_name@admin, not role_name@${DB_NAME}. Atlas rejects the latter with UNSUPPORTED_ROLE: Custom role X must scoped to admin database. The role's privileges target askflorence.* collections; the role itself is assigned via @admin.

Generate a 32-char alphanumeric password per user (openssl rand -base64 48 | tr -dc 'A-Za-z0-9' | head -c 32). Do not reuse passwords across users.

bash

declare -a USERS=(
  "app_read_staging:read@${DB_NAME}:MONGODB_URI"
  "app_writer_survey:role_writer_survey@admin:MONGODB_URI_SURVEY_WRITE"
  "app_writer_plans:role_writer_plans@admin:MONGODB_URI_PLANS_WRITE"
  "app_writer_agents:role_writer_agents@admin:MONGODB_URI_AGENTS_WRITE"
  "app_admin_agents:role_admin_agents@admin:MONGODB_URI_AGENTS_ADMIN"
  "audit_reader:role_audit_reader@admin:MONGODB_URI_AUDIT_READ"
)

PW_FILE=/tmp/.atlas-provision-pws   # mode 600, deleted at the end
: > "$PW_FILE"
chmod 600 "$PW_FILE"

for entry in "${USERS[@]}"; do
  USER="${entry%%:*}"
  rest="${entry#*:}"
  ROLE="${rest%%:*}"
  ENV_NAME="${rest#*:}"
  PW=$(openssl rand -base64 48 | tr -dc 'A-Za-z0-9' | head -c 32)
  atlas dbusers create --username "$USER" --password "$PW" --role "$ROLE" --projectId $PROJECT_ID
  echo "${ENV_NAME}=mongodb+srv://${USER}:${PW}@${CLUSTER_HOST}/${DB_NAME}?retryWrites=true&w=majority" >> "$PW_FILE"
done

For the prod rollout, rename the first user from app_read_staging to app_read_prod (or just keep the existing app-read and skip that entry).

Step 4 — Write creds to the env file

Move the contents of $PW_FILE to the appropriate local env file:

Staging cluster: .env.staging.local (mode 600, gitignored).
Prod cluster: do not write to .env.local on a dev machine. Move directly to Vercel env (or AWS Secrets Manager post-migration) and securely share with only the engineers who need local prod access.

Then:

bash

rm "$PW_FILE"

Step 5 — Verify (positive + negative probes)

Source the env file with the line-by-line loader (raw source breaks on & in SRV strings):

bash

while IFS= read -r line; do
  [[ -z "$line" || "$line" == \#* ]] && continue
  k="${line%%=*}"; v="${line#*=}"
  export "$k=$v"
done < .env.staging.local

Run all 12 probes — expect 6 positive ACKs and 6 "user is not allowed to do action" denials. The full probe script is in docs/session-log/2026-04-17-atlas-staging.md. Any unexpected outcome means a role privilege is wrong — stop and fix before handing the creds to any consumer.

Step 6 — Probe-row hygiene

The positive probe against role_writer_agents inserts {_probe: true} into agent_audit_log. By design of ADR 0002, this row cannot be deleted by any app-tier user. Leave it; it ages out with retention.

Step 7 — Cleanup

Ensure $PW_FILE was deleted.
If a temp restore admin was used for a seeded cluster, delete it (atlas dbusers delete tmp_restore_admin --projectId $PROJECT_ID --force).
Remove any local mongodump output (rm -rf ./tmp/prod-snapshot/).

When you're done — handoff

Produce a session brief with: project ID, cluster host, six usernames, env file location, region/tier/version. Never include passwords. Briefs live in docs/briefs/ — the staging handoff is at docs/briefs/SESSION_BRIEF_2026-04-17_atlas.md; for the prod rollout use the same filename pattern (SESSION_BRIEF_<YYYY-MM-DD>_<topic>.md).

Rollback

For a newly-minted project, one command drops everything:

bash

atlas projects delete $PROJECT_ID

For an in-flight provisioning against an existing project (e.g. prod), revert by deleting each user created this session and each custom role created this session, in that order (users first, so the roles are unused).

Cross-cluster `app_read_staging` user — Phase 11

This user lives on the staging Atlas project (askflorence-staging, project_id 69e31af12fd2c0aef51bbb41) and is read by the prod app over AWS PrivateLink to fetch non-PHI public CMS reference data (formularies_staging, providers_staging). Decision rationale: ADR 0004.

The user is distinct from the staging-side app_writer_* / app_admin_* users defined earlier in this runbook — those scope writes to staging-app collections; app_read_staging is a narrow read-only consumer used by external (prod) callers only.

Step A — Generate a strong password

bash

PW_FILE=$(mktemp)
openssl rand -base64 32 > "$PW_FILE"
chmod 600 "$PW_FILE"

The password is later loaded into AWS Secrets Manager as part of the connection string (prod/mongodb/reference-uri) on the prod AWS account. Never commit, paste into chat, or store outside Secrets Manager.

Step B — Create the Atlas user with the custom `role_reader_reference` role

bash

PROJECT_ID=69e31af12fd2c0aef51bbb41   # staging
DB_NAME=askflorence

# First, ensure the custom role exists (one-time per project — idempotent).
# If it already exists this returns a 409; safe to ignore.
# Canonical scope: 4 collections (see ADR 0004 amendment 2026-05-11 /
# ENG-257 closeout). Two for runtime tier-fallback (formularies_staging +
# providers_staging) and two for audit re-validation (plans +
# mrpuf_issuers_staging) — all part of the §1311 / MRF reference dataset.
atlas customDbRoles create role_reader_reference \
  --privilege FIND@${DB_NAME}.formularies_staging,FIND@${DB_NAME}.providers_staging,FIND@${DB_NAME}.plans,FIND@${DB_NAME}.mrpuf_issuers_staging \
  --projectId $PROJECT_ID

# Then create the user with that role.
atlas dbusers create \
  --username app_read_staging \
  --password "$(cat "$PW_FILE")" \
  --role role_reader_reference@admin \
  --projectId $PROJECT_ID

The custom role role_reader_reference grants ONLY FIND on the four §1311 / MRF reference collections: formularies_staging, providers_staging, plans, mrpuf_issuers_staging. Tighter than the built-in read role (which grants read on the entire askflorence DB). Per ADR 0004 amendment 2026-05-11 (ENG-257 closeout), the four-collection scope is the canonical permanent posture — two collections back the runtime tier-fallback path on prod, two back periodic audit re-validation cycles (ENG-230, future ENG-231 refresh cycles). All four share the same non-PHI data classification and the same AWS PrivateLink network path. Two reasons we ship the tight version:

Defense in depth. Even though the data classification policy + Phase 1 static guard (scripts/audit/staging-collections-guard.ts) keep the staging cluster non-PHI, the principle of least privilege says the cross-cluster reader should only see what it needs.
Phase 2 nightly drift check (scripts/audit/staging-cluster-drift.ts) audits this exact role shape every night via the Atlas Admin API. Any drift (extra collection grant, wider action like INSERT, additional role on the user, escalation to a built-in like read/readWrite) opens a P1 GitHub issue.

If you ever need to widen the role (new cross-cluster collection, etc.), update both the role on Atlas AND the constants STAGING_REFERENCE_READ_COLLECTIONS (src/lib/db.ts) + EXPECTED_READ_COLLECTIONS_SCRIPT_COPY (scripts/audit/staging-cluster-drift.ts) in lock-step. Otherwise the next nightly run flags the change as drift.

Emergency rollback (revert to the historical built-in read@askflorence, e.g. if Phase 2 audit ever breaks production cross-cluster reads):

bash

atlas dbusers update app_read_staging \
  --role read@${DB_NAME} \
  --projectId $PROJECT_ID

This is reversible and the cross-cluster reader keeps working immediately. Disables the data-classification posture though — the user can read anything in askflorence DB on the staging cluster — so investigate + revert the rollback as soon as the audit is fixed.

Step C — Resolve the private connection string

After AWS creates the VPC endpoint and Atlas approves the connection (per infra/envs/prod/atlas-staging-privatelink.tf), Atlas issues a private SRV connection string:

bash

atlas privateEndpoints aws describe 69fe75c5b02c024f32d2af50 \
  --projectId 69e31af12fd2c0aef51bbb41
# Look for endpointServiceName + connection string under interfaceEndpoints

The connection string format is:

mongodb+srv://app_read_staging:<password>@askflorence-staging-pl-0.<random>.mongodb.net/askflorence?retryWrites=true&w=majority

Step D — Store in prod Secrets Manager (project CMK encrypted)

bash

AWS_PROFILE=askflorence-prod aws secretsmanager put-secret-value \
  --secret-id prod/mongodb/reference-uri \
  --secret-string "$CONNECTION_STRING"

The secret shell is created by Terraform (infra/envs/prod/secrets.tf) which sets kms_key_arn to the prod project CMK so the value is encrypted with our key, not the AWS-default key. Do not create the secret directly via aws secretsmanager create-secret — that uses the AWS-default KMS key and is inconsistent with the rest of our secret hygiene.

Step E — Wire the env binding

Already declared in infra/envs/prod/ecs.tf:

hcl

secrets_from_manager = {
  ...
  MONGODB_REFERENCE_URI = module.secrets.secret_arns["mongodb/reference-uri"]
}

The next ECS deploy bakes the binding into a fresh task def. Application code (getReferenceDb() in src/lib/db.ts) routes via MONGODB_REFERENCE_URI and falls back to MONGODB_URI when unset — dev + staging keep working without code changes.

Step F — Verify end-to-end

bash

# From a host that can reach prod ECS — typically a smoke test against
# askflorence.health that exercises the cross-cluster read path.
curl -sS -X POST https://askflorence.health/api/drugs/covered \
  -H 'Content-Type: application/json' \
  -d '{"rxcuis":["1364441"],"plan_id":"42261UT0060023","year":2026}'

Expected response: coverage=Covered, drug_tier=PreferredBrand. The drug_tier field is only populated by lookupStagingDrugTiers() reading from formularies_staging via the cross-cluster path — its presence is the proof.

Step G — Cleanup

Ensure $PW_FILE was deleted (rm -f "$PW_FILE").
Confirm Secrets Manager secret has populated value (aws secretsmanager get-secret-value --secret-id prod/mongodb/reference-uri --query 'SecretString' --output text should return non-empty; never paste the value into chat / commit / log).

Step H — API key for nightly drift check (Phase 2)

Phase 2 of #100 / ENG-239 adds a nightly GitHub Actions workflow (.github/workflows/staging-cluster-drift.yml) that audits the live Atlas state of app_read_staging via the Atlas Admin API. To run the audit in CI, the workflow needs an Atlas Programmatic API key bound to GitHub Actions secrets.

One-time provisioning:

Create the API key in Atlas Org settings. Requires Org Owner role.
```
Atlas UI → Organization Settings → Access Manager → Applications → Create API Key
Name:        gh-actions-staging-drift-check
Org Permission: leave at default (no Org-level role)
```
On the next page, when prompted for project access:
```
Project:                     askflorence-staging (69e31af12fd2c0aef51bbb41)
Project permission:          Project Read Only
```
Do NOT grant Org-level permissions, and do NOT grant any other project. Project Read Only on staging is the only access this key needs (it reads app_read_staging's role + the role_reader_reference definition, nothing else).
Capture the public + private key pair. The private key is shown ONCE at creation. Save to 1Password under Atlas — gh-actions-staging-drift-check.
Bind to GitHub Actions secrets:
bash
```
gh secret set ATLAS_DRIFT_CHECK_PUBLIC_KEY  --body "<public-key>"
gh secret set ATLAS_DRIFT_CHECK_PRIVATE_KEY --body "<private-key>"
```
The workflow exposes these as MONGODB_ATLAS_PUBLIC_API_KEY + MONGODB_ATLAS_PRIVATE_API_KEY env vars at runtime — the atlas CLI consumes those automatically (no atlas auth login step needed in CI).

Manually trigger to verify. First run should pass (the role is in canonical state):

bash

gh workflow run staging-cluster-drift.yml
gh run list --workflow staging-cluster-drift.yml --limit 1

Rotating the key: Atlas does not auto-rotate Programmatic Keys. Rotate annually as part of the same quarterly review that touches STAGING_ALLOWED_COLLECTIONS. To rotate:

Atlas UI → Org Settings → Access Manager → Applications → gh-actions-staging-drift-check → ... → Rotate

Then re-run gh secret set for both keys with the new values. The old key auto-deactivates after the rotation grace period.

Drift-check rollback (emergency): if the nightly audit ever produces false positives (e.g. Atlas API schema change), disable the schedule by editing .github/workflows/staging-cluster-drift.yml and removing the schedule: block (keeping workflow_dispatch: for manual runs while debugging). The workflow itself does not modify Atlas state — it's a read-only audit — so disabling it carries zero security blast radius beyond losing the nightly check.

Rollback

bash

# Atlas:
atlas dbusers delete app_read_staging \
  --projectId 69e31af12fd2c0aef51bbb41 --force

# AWS Secrets Manager (30-day recovery window):
AWS_PROFILE=askflorence-prod aws secretsmanager delete-secret \
  --secret-id prod/mongodb/reference-uri \
  --recovery-window-in-days 30

To fully tear down the cross-cluster path including the AWS PrivateLink endpoint + security group, see the rollback section in the Phase 11 session log.