Skip to content
AskFlorence
Main Navigation ArchitectureFlorence AIAgentsMembersAgent PlatformValidationInfrastructure

Appearance

Sidebar Navigation

Overview

Home

Glossary

System Architecture

Consumer & Agent Flow

Florence AI

Overview

Principles

Runtime

Tool surface

Adding a tool

Tool registry

Knowledge: SBC scenarios & CSR

Voice

Evals & observability

Provider risk & portability

Outage playbook

Roadmap

Build plan

Agents

Overview

Workflows & pain points

Members

Overview

Medicaid coverage gap

Carriers

Overview

Marketplaces

Overview

Agency

Overview

Regulations

Overview

Agent Platform

Overview

Auth Architecture

MongoDB Permissioning

Compliance Model

Data Models

Data Sources

Overview

CMS Marketplace API

CMS dependency map

PUF Data

State Subsidies

SBE Ingestion Playbook

SBE State Watchouts + Decisions

CA Phase C/D Playbook

NY Phase C/D Playbook

Validation

Overview

Methodology

APTC Formula

California 2026

New York 2026

CAPS Formula

Scenario Results

Infrastructure

Account Inventory

AWS Setup Runbook

AWS Organizations

CloudTrail

GuardDuty

Security Hub

Config

CloudFront + WAFv2

Data sources & ingest

Phase 4 DNS

Change Log

Vulnerability Management

MongoDB Setup

Access Control

Data Classification

Documentation Hosting

Post-deploy Smoke

Development

Preflight (local CI mirror)

Testing strategy

Compliance

Overview (auditor entry point)

SOC 2 Control Mapping

HIPAA Control Mapping

CMS EDE Appendix A Mapping

Risk Assessment

Encryption Policy

Data Retention Policy

Privacy Impact Assessment

Consent Capture & Versioning

Incident Response Plan

Access Control Policy

Marketing vs. Portal Analytics

Vendor / Subprocessor Register

Dependency Vulnerability Policy

BAA / Compliance Evidence

Compliance-Automation Integration

Compliance-Automation Vendor Evaluation

Penetration Test Reports

Architecture

Portal entry handoff

Mobile app strategy

Deferred architecture decisions

Session cookie architecture

Share flows

Decisions (ADRs)

Index

0001 — Atlas project isolation

0002 — Append-only audit log

0003 — Narrow-scoped Mongo users

0004 — Cross-cluster Atlas PrivateLink

0005 — Delayed-job architecture

0006 — Mongo user simplification

0007 — Terraform owns ECS task def

0008 — E2E testing strategy

0009 — Self-hosted analytics + observability (superseded)

0010 — PostHog HIPAA Cloud (supersedes 0009)

Runbooks

Security Incident Response

Break-Glass Root Login

Onboard Team Member

Offboard Team Member

Atlas user provisioning

Deploy via Terraform (ENG-277)

Rollback via Terraform (ENG-277)

S3 data bucket migration (planned Phase 11)

Access Reviews

2026-Q2 Review

Session log

Index

2026-04-23 — Phase 10 DNS cutover

2026-04-22 — Phase 8 prod AWS mirror

2026-04-22 — Phase 7 Atlas VPC peering

2026-04-22 — Phase 6 CloudFront + WAF

2026-04-21 — Phase 5 staging go-live

2026-04-17 — Atlas staging

Briefs

Index

Member portal plan (ENG-187)

2026-04-16/17 handoff

2026-04-17 Atlas handoff

System briefing (2026-04-17)

Creative AdBundance proposal brief

Creative AdBundance analytics brief

ElevenLabs RN integration research

Policies

Overview

On this page

ADR 0008 — E2E testing strategy: local Playwright + PR-CI against staging, defer ephemeral PR previews ​

Status ​

Accepted — 2026-05-13.

Shipped under ENG-303 deferral + ENG-304 selection. Companion to ADR 0007 (terraform-driven deploy).

Context ​

ENG-284 shipped the foundation of "new work doesn't break old work" — preflight harness, PR-time required CI checks (typecheck, audits, build), post-deploy smoke (11 checks). That covers ~80% of the goal.

The remaining ~20% lives in E2E browser-level coverage. Today, the 9-step member + 7-step agent manual smoke flows in CLAUDE.md are run by Claude sessions via Preview MCP. At 40+ PRs/day (1-2 dev team scaling to ~3-4), running these manually per PR doesn't scale.

The project's Test Infrastructure & CI Reliability set out four milestones to close the gap. The decision question this ADR answers: do we ship ephemeral PR preview environments (M2 / ENG-303) at the same time as Playwright (M3 / ENG-304), or just ship Playwright and defer previews?

Decision ​

Defer M2 (ENG-303 ephemeral PR previews). Ship M3 (Playwright local + PR-CI against staging) and M4 (nightly drift) on cycle #1. Revisit M2 when specific triggers fire (see below).

E2E coverage shape going forward:

  1. Local Playwright via npm run preflight -- --full — devs run the suite against localhost:3004 (or staging) before push. Husky pre-push hook enforces a fast subset. The PR template requires preflight confirmation.
  2. PR-CI Playwright job — runs the same suite against https://stage.askflorence.health on every PR push to main. Required status check. Catches regressions for fork-PR contributors who can't be required to run husky.
  3. Nightly Playwright + post-deploy-smoke (M4 / ENG-305) — same suite on 0 9 * * * UTC cron against staging + prod. Catches out-of-band drift (third-party API changes, cert rotations, infra config drift).

Tool selection: Playwright. Reasoning detailed below.

Tool selection: alternatives considered ​

ToolFit for usDecision
Playwright (microsoft/playwright)Native TypeScript, free + OSS, free parallelization, deterministic auto-wait, trace viewer + video on failure, Chromium + Firefox + WebKit (incl. iOS Safari). State of JS 2024 shows it overtook Cypress in both satisfaction and usage growth.✅ Selected
Cypress (cypress.io)Strong interactive debugger, time-travel test runner, popular in JS world. But: no WebKit/Safari support on roadmap (per Shiplight comparison), market share declining, paid parallelization.❌
Selenium / WebdriverIOCross-language, older. Heavier setup, more flake without modern auto-wait, harder to maintain at our scale.❌
PuppeteerChromium-only, lower-level than Playwright. Playwright is its de-facto successor for our use case.❌
Stagehand (browserbase/stagehand)OSS Playwright wrapper that lets you write tests in natural language via act / observe / extract / agent primitives. Reduces boilerplate by 60-70% vs raw Playwright + LLM glue. Free locally; pairs with Browserbase ($0.10-0.40/browser-minute) for cloud execution.Consider as future enhancement — see "Tool evolution path" below.
Mabl / Reflect / Octomind (AI-native paid platforms)Cloud-native, low-code, auto-healing tests, analytics dashboards. Engineering teams report 40% of QA cycles spent on flaky-test maintenance (Bug0 analysis).❌ — paid tier conflicts with pre-funding budget; would re-evaluate at Series A with audit pressure.
Autonoma (getautonoma.com)AI-native, connects to source code, generates tests from analysis. Newer category.❌ — same paid-tier objection as Mabl class.

Why Playwright wins for our stage:

  • $0 license cost, $0 runtime cost (runs on local + GHA runners)
  • Same suite reused across local preflight, PR CI, nightly, and (future) ephemeral preview env
  • TypeScript-first matches our codebase
  • Trace viewer + video on failure means flakes diagnose quickly
  • Strong selector strategy when paired with data-testid attributes (industry best practice; BrowserStack guidance)
  • Future-compatible: if we want AI-augmented test generation later, Playwright is what those tools generate (e.g., Stagehand wraps Playwright; Octomind/Mabl emit Playwright)

Cost analysis: ephemeral previews vs PR-CI against staging ​

The core cost question. Numbers based on 40 PRs/day = ~1,200 PRs/month (current pace; scaling to 2 devs may double).

With ENG-303 (ephemeral PR previews) — DEFERRED ​

ItemCalculationMonthly
Fargate task lifetime~840 preview PRs after path-filter × $0.012/PR × ~12h avg open~$120
ECR image storage1,200 builds × ~500MB peak, ~50GB steady with lifecycle policy~$5
GHA minutes (preview spin-up + smoke per push)40 × 2 pushes × ~8min × 30 = 19,200 min, free tier 2,000, overage $0.008/min~$138
ALB target groups, CloudWatch logsNegligible~$5
Total ENG-303 incremental~$270/month, ~$3,240/year

Without ENG-303 (local Playwright + PR-CI against staging) — SELECTED ​

ItemCalculationMonthly
Local PlaywrightRuns on dev laptops$0
PR-CI Playwright against staging40 × 2 × ~3min × 30 = 7,200 min, overage 5,200 × $0.008~$41
Mongo writes (staging cluster, ~4,800 synthetic rows/month with cleanup)Atlas M10 absorbs trivially$0
CMS API calls (~24,000/month)Well under any rate limit$0
Total~$41/month, ~$490/year

Delta: ~$230/month, ~$2,750/year saved. At pre-funding stage with deliberate budget posture, this matters.

What's the actual gap that ephemeral previews close? ​

Honest accounting of what ENG-303 catches that local Playwright + PR-CI-against-staging doesn't:

Failure classLocal + PR-CI vs stagingEphemeral previewHow often it happens
UI / form / hydration / autocomplete✅ catches✅ catchesOften (the bulk of regressions)
Calculator math / plan filtering✅ catches✅ catchesCommon
API behavior (routes, validation, error shapes)✅ catches✅ catchesCommon
Secret-binding gaps (ENG-272 class)❌ (uses local .env.local or staging shared)✅ catches per-PRAlready caught by Phase 2a audit + post-deploy smoke before prod
EBS Scheduler missing infra (ENG-274 class)❌✅ catches per-PRAlready caught by Phase 2b audit + post-deploy smoke before prod
ALB routing / health check / target group config❌✅ catches per-PRRare — happens ~1/quarter, mostly on infra PRs
CloudFront cache + WAF rules❌✅ catches per-PRVery rare
IAM permission gaps on task runtime role❌✅ catches per-PRCaught by staging deploy + smoke before prod
VPC networking / Atlas PrivateLink❌✅ catches per-PRExtremely rare
Cross-host cookie / domain behavior❌✅ catches per-PRRare; only portal-handoff PRs
Reviewer "click URL and see change live"❌✅UX convenience, not a bug class
Per-PR test data isolation (vs staging shared)❌✅Only matters at 5+ devs concurrent

Honest conclusion: ephemeral previews catch a narrow slice of bugs that aren't already caught elsewhere. The big-ticket failure classes (write paths, secret bindings, EBS Scheduler) are caught by:

  • Phase 2 static audits (manifest ↔ Terraform)
  • Post-deploy smoke (against staging, before prod cutover)
  • Local + PR-CI Playwright against staging

The unique adds — AWS-infra-specific bugs + reviewer convenience + per-PR isolation — are rare or low-value at our stage.

Triggers to revisit ENG-303 (when ephemeral previews become worth it) ​

Reopen ENG-303 when ANY of these fires:

  1. Team grows past 3-4 devs — staging shared-state contention becomes real ("my test data overlaps yours"; "your migration broke my smoke run"); the per-PR isolation benefit grows
  2. SOC 2 Type II audit within ~6 months — auditors may prefer per-PR isolated test envs as evidence of controls-in-environment-similar-to-production
  3. 2-3 ALB/IAM-runtime-role regressions in a row — signal that staging deploys aren't catching infra-specific bugs early enough
  4. Funded — pre-funding budget posture relaxes; the $2,750/year is no longer a meaningful decision driver
  5. Member portal cross-host work matures — flows that span apex + app.askflorence.health need real cross-domain testing (cookies, CSP, OAuth handoff); local Playwright can't fully simulate this

When triggers fire, the implementation guide in ENG-303 description is the starting point. Same Playwright suite ports over without rewrite — just retarget baseURL env var per PR.

Tool evolution path ​

We're choosing Playwright today but the tool category is moving fast. The path forward:

  1. Today (ENG-304 / M3): vanilla Playwright + TypeScript + data-testid selectors. ~4 specs to start (calculator, /plans coverage, agent waitlist, agent discovery). Target ~5min wall-clock for the suite.
  2. Q3 2026 (when member portal stabilizes): add specs for member portal flows. Same Playwright suite, more .spec.ts files.
  3. Q4 2026 (if maintenance burden grows): evaluate Stagehand as a Playwright wrapper that lets us write natural-language tests for less-stable surfaces. Open-source, MIT, $0 cost locally. ~60-70% reduction in test boilerplate per published benchmarks.
  4. 2027+ (post-funding, audit pressure): re-evaluate paid AI-native platforms (Mabl, Reflect, Octomind) if maintenance cost on the Playwright suite exceeds ~1 day/month of engineering time.

We're explicitly NOT locking ourselves out of the AI-native path. Playwright is the foundation; AI tools layer on top.

Selector strategy (codified to avoid data-testid debt later) ​

Per BrowserStack 2026 Playwright best practices and industry guidance:

  1. Prefer role-based selectors (page.getByRole("button", { name: "Submit" })) — semantic, accessibility-aligned, resilient to CSS refactors
  2. Use data-testid for selectors where text/role is ambiguous or hostile to change — calculator inputs, dynamic plan cards, autocomplete results
  3. Avoid CSS class selectors and XPath — they break on every CSS refactor
  4. data-testid discipline: small, ongoing maintenance tax accepted. Source-file changes that touch UI gain a data-testid if Playwright needs to reference the element. PR checklist nudges this.

This is the most-cited cause of Playwright flake per Mergify analysis — and avoidable with selector discipline.

Flake budget + retries ​

Industry data: mid-stage SaaS teams hit 4% flake → ~1,000 spurious failures/week per Bug0 analysis. Engineering attention burned on triage.

Our target: <1% flake on the stable-surface suite within 4 weeks of M3 ship. Pattern:

  • retry: 2 per spec — built-in (handles transient network/hydration race)
  • Playwright auto-wait — built-in
  • Explicit await expect(locator).toBeVisible() before interactions
  • No page.waitForTimeout(N) calls — only event-based waits
  • Trace + video on failure → diagnosis under 5 min per flake

What lands tomorrow (cycle #1, due 2026-05-14) ​

IssueEffortOutcome
ENG-304 (M3)~1 day4 stable-surface Playwright specs + PR-CI workflow + husky pre-push integration + first-week flake monitoring
ENG-305 (M4)~0.5 day (builds on M3)Nightly cron workflow runs the same suite + post-deploy-smoke against staging + prod. Auto-Linear on failure.

Files / surfaces touched (when implementing) ​

  • New: tests/playwright.config.ts — main config with retry, parallel workers, baseURL
  • New: tests/smoke/calculator.spec.ts, plans-coverage.spec.ts, agent-waitlist.spec.ts, agent-discovery.spec.ts
  • New: .github/workflows/playwright.yml — runs against staging on PR + cron schedule
  • Modified: scripts/preflight.ts — adds Playwright to --full mode
  • Modified: package.json — adds @playwright/test devDep + test:e2e script
  • Modified: UI components touched per spec — add data-testid attributes where text/role selectors are insufficient

References ​

  • ADR 0007 — Terraform-driven deploy (companion architectural decision)
  • ENG-303 — deferred, with revisit triggers in description
  • ENG-304 — Playwright UI smoke (the work that lands)
  • ENG-305 — Nightly drift detection (builds on ENG-304)
  • ENG-320 — In-VPC smoke runner (blocks Atlas allowlist tightening; separate threat model from this ADR)
  • docs/development/preflight.md — Layer 1 local CI mirror; gains Playwright in --full mode
  • docs/infrastructure/post-deploy-smoke.md — Layer 3 deploy-time smoke; Playwright integrates here under M4

External research consulted ​

  • State of JS 2024 — Playwright overtakes Cypress in satisfaction + usage growth
  • Cypress vs Playwright analysis — Shiplight 2026
  • Stagehand documentation + pricing — Browserbase
  • Stagehand on GitHub — MIT-licensed Playwright wrapper
  • Playwright best practices — BrowserStack guide
  • Flaky tests in Playwright — Mergify deep-dive
  • Playwright MCP changes the build vs buy equation — Bug0 analysis on AI testing economics
Pager
Previous page0007 — Terraform owns ECS task def
Next page0009 — Self-hosted analytics + observability (superseded)

AskFlorence Internal Documentation. Not for public distribution.

AskFlorence

Internal Documentation

Access restricted. Not for public distribution.