Skip to content
AskFlorence
Main Navigation ArchitectureFlorence AIAgentsMembersAgent PlatformValidationInfrastructure

Appearance

Sidebar Navigation

Overview

Home

Glossary

System Architecture

Consumer & Agent Flow

Florence AI

Overview

Principles

Runtime

Tool surface

Adding a tool

Tool registry

Knowledge: SBC scenarios & CSR

Voice

Evals & observability

Provider risk & portability

Outage playbook

Roadmap

Build plan

Agents

Overview

Workflows & pain points

Members

Overview

Medicaid coverage gap

Carriers

Overview

Marketplaces

Overview

Agency

Overview

Regulations

Overview

Agent Platform

Overview

Auth Architecture

MongoDB Permissioning

Compliance Model

Data Models

Data Sources

Overview

CMS Marketplace API

CMS dependency map

PUF Data

State Subsidies

SBE Ingestion Playbook

SBE State Watchouts + Decisions

CA Phase C/D Playbook

NY Phase C/D Playbook

Validation

Overview

Methodology

APTC Formula

California 2026

New York 2026

CAPS Formula

Scenario Results

Infrastructure

Account Inventory

AWS Setup Runbook

AWS Organizations

CloudTrail

GuardDuty

Security Hub

Config

CloudFront + WAFv2

Data sources & ingest

Phase 4 DNS

Change Log

Vulnerability Management

MongoDB Setup

Access Control

Data Classification

Documentation Hosting

Post-deploy Smoke

Development

Preflight (local CI mirror)

Testing strategy

Compliance

Overview (auditor entry point)

SOC 2 Control Mapping

HIPAA Control Mapping

CMS EDE Appendix A Mapping

Risk Assessment

Encryption Policy

Data Retention Policy

Privacy Impact Assessment

Consent Capture & Versioning

Incident Response Plan

Access Control Policy

Marketing vs. Portal Analytics

Vendor / Subprocessor Register

Dependency Vulnerability Policy

BAA / Compliance Evidence

Compliance-Automation Integration

Compliance-Automation Vendor Evaluation

Penetration Test Reports

Architecture

Portal entry handoff

Mobile app strategy

Deferred architecture decisions

Session cookie architecture

Share flows

Decisions (ADRs)

Index

0001 — Atlas project isolation

0002 — Append-only audit log

0003 — Narrow-scoped Mongo users

0004 — Cross-cluster Atlas PrivateLink

0005 — Delayed-job architecture

0006 — Mongo user simplification

0007 — Terraform owns ECS task def

0008 — E2E testing strategy

0009 — Self-hosted analytics + observability (superseded)

0010 — PostHog HIPAA Cloud (supersedes 0009)

Runbooks

Security Incident Response

Break-Glass Root Login

Onboard Team Member

Offboard Team Member

Atlas user provisioning

Deploy via Terraform (ENG-277)

Rollback via Terraform (ENG-277)

S3 data bucket migration (planned Phase 11)

Access Reviews

2026-Q2 Review

Session log

Index

2026-04-23 — Phase 10 DNS cutover

2026-04-22 — Phase 8 prod AWS mirror

2026-04-22 — Phase 7 Atlas VPC peering

2026-04-22 — Phase 6 CloudFront + WAF

2026-04-21 — Phase 5 staging go-live

2026-04-17 — Atlas staging

Briefs

Index

Member portal plan (ENG-187)

2026-04-16/17 handoff

2026-04-17 Atlas handoff

System briefing (2026-04-17)

Creative AdBundance proposal brief

Creative AdBundance analytics brief

ElevenLabs RN integration research

Policies

Overview

On this page

Session log — 2026-04-22 — Phase 7 staging Atlas VPC peering ​

Scope ​

Replace the public-internet path from staging ECS to MongoDB Atlas with an AWS VPC peering connection. Required first upgrading the staging Atlas cluster from the shared M0 tier (no peering support) to dedicated M10. End state: Atlas's IP access list holds a single entry (10.40.0.0/16, the staging VPC CIDR) and Mongo traffic rides a private fabric end-to-end. No application code change. No Vercel change. No production Atlas change.

Actor ​

  • Human: Taha Abbasi.
  • Agent: Claude Opus 4.7 (1M context), running in Claude Code CLI.

Tickets ​

  • Advances Issue #47 Phase 7.
  • Adds an M10-era staging cluster as a Phase 8 rehearsal surface — Phase 8 will apply the same peering playbook to the prod project's existing M10 HIPAA cluster, taking advantage of lessons learned here.

External systems touched ​

MongoDB Atlas (staging project 69e31af12fd2c0aef51bbb41) ​

  • Cluster upgraded M0 → M10 (AWS us-east-1, MongoDB 8.0.21, 10 GB disk). Provider transitioned TENANT → AWS. SRV hostname preserved: mongodb+srv://askflorence-staging.efsikmv.mongodb.net. atlas clusters upgrade completed ~3 min of UPDATING then IDLE. Database users + collections + data all survived without intervention.
  • Network container auto-provisioned on the project at M10 upgrade: Atlas VPC vpc-0c1e118736ac1fb74 in Atlas's AWS account 354811016174, CIDR 192.168.248.0/21, region US_EAST_1.
  • Peering connection created: Atlas peering 69e939017b7816840c17063c, AWS-side pcx-05d74ae6d34a31a02. Status AVAILABLE on Atlas, active on AWS.
  • Allowlist reduced from 2 entries (NAT EIP 54.164.140.5/32 + operator laptop 136.38.212.186/32) to 1 entry (10.40.0.0/16). Public reachability to the cluster closed.

AWS (staging account 549136075525) ​

  • VPC peering accepter pcx-05d74ae6d34a31a02 — accepted + tagged + DNS-from-remote-VPC enabled on accepter side.
  • Private route tables rtb-0b5a5b1da1f0a99c4 + rtb-00fc1026859373d4f: added routes 192.168.248.0/21 → pcx-05d74ae6d34a31a02.
  • Network module updated — aws_route_table.private now has lifecycle { ignore_changes = [route] } so external aws_route resources don't conflict with the inline default-route list. New module outputs private_route_table_ids + public_route_table_ids.
  • ECS service force-new-deployment to rotate the running task. Old task had cached DNS + MongoClient state from the pre-peering era and couldn't reach Atlas after the allowlist tightened; new task resolved shard hostnames from within the peered VPC and got Atlas's private IPs via split-horizon DNS.

Prod systems ​

  • Untouched. Prod Atlas project, prod AWS account, Vercel — none even discovered-against.

What shipped ​

Terraform-managed end-to-end after a Phase 7 CLI handshake:

  1. infra/envs/staging/peering.tf (new) — aws_vpc_peering_connection_accepter.atlas_staging + aws_route.atlas_from_private_{a,b}. Peering was accepted out-of-band via the AWS CLI first (because Atlas created it and sync required a handshake), then imported into Terraform state. Current plan is clean.
  2. infra/modules/network/main.tf — aws_route_table.private got lifecycle { ignore_changes = [route] } so inline routes + external peering routes can coexist.
  3. infra/modules/network/outputs.tf — added private_route_table_ids + public_route_table_ids.

The blocker that took 30 minutes to diagnose ​

After removing 54.164.140.5/32 from the Atlas allowlist, /api/waitlist started returning HTTP 504 Gateway Timeout with ECS logs showing MongoServerSelectionError: Server selection timed out after 30000 ms listing all three Atlas shards by their public hostnames.

Initial hypothesis: Atlas's split-horizon DNS wasn't resolving shard hostnames to private IPs from within our peered VPC. Checked AllowDnsResolutionFromRemoteVpc on both sides (true on accepter, null on requester which is expected cross-account), checked VPC DNS settings (enableDnsHostnames + enableDnsSupport both true), checked route table entries (both private RTs correctly routed 192.168.248.0/21 via pcx). Everything on paper was right.

Actual cause: the running ECS task was started before peering went live. Its MongoClient held cached connections to Atlas over the public path; when those connections went stale, its DNS resolution cache (either driver-level or JVM-equivalent at the Node layer) held public IPs for the shards. Those IPs were no longer reachable because the NAT path to Atlas was now blocked.

Fix was cheap: aws ecs update-service --force-new-deployment. Fresh task → fresh SRV lookup from within the peered VPC → Atlas's split-horizon DNS returned private IPs → TLS handshake succeeded via the peering connection → /api/waitlist went green without the NAT EIP in the allowlist.

Noting this explicitly so Phase 8 prod runs force-new-deployment in the same breath as tightening the allowlist, not 20 minutes later.

Verification ​

  • GET /api/health through CloudFront → HTTP 200.
  • POST /api/waitlist with Atlas allowlist [10.40.0.0/16] only, NAT EIP absent, operator laptop IP absent → HTTP 200 + real Mongo waitlist_submission_id. This is the definitive proof that the peering path is carrying traffic — no other path is reachable.
  • GET /api/counties?state=TX&zip=75001 → HTTP 200 with TX county data — confirms CMS proxy path (which egresses to the public internet via NAT) is unaffected.
  • atlas accessLists list → exactly 1 entry (10.40.0.0/16).
  • aws ec2 describe-vpc-peering-connections pcx-05d74ae6d34a31a02 → status active, DNS resolution from remote VPC enabled on accepter side.
  • terraform plan clean after import + lifecycle change + apply.

What this session does NOT do ​

  • Does not touch Vercel prod. askflorence.health + www unchanged, as in every other phase so far.
  • Does not touch prod Atlas. Prod cluster's network path remains what it was (Vercel ingress allowlist). Phase 8 re-peers prod to the new prod VPC.
  • Does not enable cluster backup / PITR on staging. M10 supports it but staging has no PHI and no recovery requirement. Backup flag stays off.
  • Does not rotate any secrets. SRV hostname preserved through M0→M10 upgrade; staging/mongodb/* Secrets Manager entries still correct.
  • Does not remove the Atlas IP access list entirely. 10.40.0.0/16 stays on. Atlas requires at least one allowlist entry for authenticated access; VPC peering is the network substrate, the allowlist is the envelope.

Next ​

  • Phase 8: prod account mirror. Build askflorence-prod (039624954211) to the same shape as staging is today — VPC, KMS, secrets, ACM, SES, ECR, ECS, ALB, CloudFront + WAF. Re-peer the existing M10 HIPAA prod Atlas cluster to the new prod VPC. Add deploy-prod.yml GitHub Actions workflow behind a protected-environment manual approval. No DNS change at Cloudflare — the apex still points at Vercel. Prod ECS reachable only via a private canary hostname for validation.
Pager
Previous page2026-04-22 — Phase 8 prod AWS mirror
Next page2026-04-22 — Phase 6 CloudFront + WAF

AskFlorence Internal Documentation. Not for public distribution.

AskFlorence

Internal Documentation

Access restricted. Not for public distribution.