ADR 0008 — E2E testing strategy: local Playwright + PR-CI against staging, defer ephemeral PR previews

Status

Accepted — 2026-05-13.

Shipped under ENG-303 deferral + ENG-304 selection. Companion to ADR 0007 (terraform-driven deploy).

Context

ENG-284 shipped the foundation of "new work doesn't break old work" — preflight harness, PR-time required CI checks (typecheck, audits, build), post-deploy smoke (11 checks). That covers ~80% of the goal.

The remaining ~20% lives in E2E browser-level coverage. Today, the 9-step member + 7-step agent manual smoke flows in CLAUDE.md are run by Claude sessions via Preview MCP. At 40+ PRs/day (1-2 dev team scaling to ~3-4), running these manually per PR doesn't scale.

The project's Test Infrastructure & CI Reliability set out four milestones to close the gap. The decision question this ADR answers: do we ship ephemeral PR preview environments (M2 / ENG-303) at the same time as Playwright (M3 / ENG-304), or just ship Playwright and defer previews?

Decision

Defer M2 (ENG-303 ephemeral PR previews). Ship M3 (Playwright local + PR-CI against staging) and M4 (nightly drift) on cycle #1. Revisit M2 when specific triggers fire (see below).

E2E coverage shape going forward:

Local Playwright via npm run preflight -- --full — devs run the suite against localhost:3004 (or staging) before push. Husky pre-push hook enforces a fast subset. The PR template requires preflight confirmation.
PR-CI Playwright job — runs the same suite against https://stage.askflorence.health on every PR push to main. Required status check. Catches regressions for fork-PR contributors who can't be required to run husky.
Nightly Playwright + post-deploy-smoke (M4 / ENG-305) — same suite on 0 9 * * * UTC cron against staging + prod. Catches out-of-band drift (third-party API changes, cert rotations, infra config drift).

Tool selection: Playwright. Reasoning detailed below.

Tool selection: alternatives considered

Tool	Fit for us	Decision
Playwright (microsoft/playwright)	Native TypeScript, free + OSS, free parallelization, deterministic auto-wait, trace viewer + video on failure, Chromium + Firefox + WebKit (incl. iOS Safari). State of JS 2024 shows it overtook Cypress in both satisfaction and usage growth.	✅ Selected
Cypress (cypress.io)	Strong interactive debugger, time-travel test runner, popular in JS world. But: no WebKit/Safari support on roadmap (per Shiplight comparison), market share declining, paid parallelization.	❌
Selenium / WebdriverIO	Cross-language, older. Heavier setup, more flake without modern auto-wait, harder to maintain at our scale.	❌
Puppeteer	Chromium-only, lower-level than Playwright. Playwright is its de-facto successor for our use case.	❌
Stagehand (browserbase/stagehand)	OSS Playwright wrapper that lets you write tests in natural language via `act` / `observe` / `extract` / `agent` primitives. Reduces boilerplate by 60-70% vs raw Playwright + LLM glue. Free locally; pairs with Browserbase ($0.10-0.40/browser-minute) for cloud execution.	Consider as future enhancement — see "Tool evolution path" below.
Mabl / Reflect / Octomind (AI-native paid platforms)	Cloud-native, low-code, auto-healing tests, analytics dashboards. Engineering teams report 40% of QA cycles spent on flaky-test maintenance (Bug0 analysis).	❌ — paid tier conflicts with pre-funding budget; would re-evaluate at Series A with audit pressure.
Autonoma (getautonoma.com)	AI-native, connects to source code, generates tests from analysis. Newer category.	❌ — same paid-tier objection as Mabl class.

Why Playwright wins for our stage:

$0 license cost, $0 runtime cost (runs on local + GHA runners)
Same suite reused across local preflight, PR CI, nightly, and (future) ephemeral preview env
TypeScript-first matches our codebase
Trace viewer + video on failure means flakes diagnose quickly
Strong selector strategy when paired with data-testid attributes (industry best practice; BrowserStack guidance)
Future-compatible: if we want AI-augmented test generation later, Playwright is what those tools generate (e.g., Stagehand wraps Playwright; Octomind/Mabl emit Playwright)

Cost analysis: ephemeral previews vs PR-CI against staging

The core cost question. Numbers based on 40 PRs/day = ~1,200 PRs/month (current pace; scaling to 2 devs may double).

With ENG-303 (ephemeral PR previews) — DEFERRED

Item	Calculation	Monthly
Fargate task lifetime	~840 preview PRs after path-filter × $0.012/PR × ~12h avg open	~$120
ECR image storage	1,200 builds × ~500MB peak, ~50GB steady with lifecycle policy	~$5
GHA minutes (preview spin-up + smoke per push)	40 × 2 pushes × ~8min × 30 = 19,200 min, free tier 2,000, overage $0.008/min	~$138
ALB target groups, CloudWatch logs	Negligible	~$5
Total ENG-303 incremental		~$270/month, ~$3,240/year

Without ENG-303 (local Playwright + PR-CI against staging) — SELECTED

Item	Calculation	Monthly
Local Playwright	Runs on dev laptops	$0
PR-CI Playwright against staging	40 × 2 × ~3min × 30 = 7,200 min, overage 5,200 × $0.008	~$41
Mongo writes (staging cluster, ~4,800 synthetic rows/month with cleanup)	Atlas M10 absorbs trivially	$0
CMS API calls (~24,000/month)	Well under any rate limit	$0
Total		~$41/month, ~$490/year

Delta: ~$230/month, ~$2,750/year saved. At pre-funding stage with deliberate budget posture, this matters.

What's the actual gap that ephemeral previews close?

Honest accounting of what ENG-303 catches that local Playwright + PR-CI-against-staging doesn't:

Failure class	Local + PR-CI vs staging	Ephemeral preview	How often it happens
UI / form / hydration / autocomplete	✅ catches	✅ catches	Often (the bulk of regressions)
Calculator math / plan filtering	✅ catches	✅ catches	Common
API behavior (routes, validation, error shapes)	✅ catches	✅ catches	Common
Secret-binding gaps (ENG-272 class)	❌ (uses local `.env.local` or staging shared)	✅ catches per-PR	Already caught by Phase 2a audit + post-deploy smoke before prod
EBS Scheduler missing infra (ENG-274 class)	❌	✅ catches per-PR	Already caught by Phase 2b audit + post-deploy smoke before prod
ALB routing / health check / target group config	❌	✅ catches per-PR	Rare — happens ~1/quarter, mostly on infra PRs
CloudFront cache + WAF rules	❌	✅ catches per-PR	Very rare
IAM permission gaps on task runtime role	❌	✅ catches per-PR	Caught by staging deploy + smoke before prod
VPC networking / Atlas PrivateLink	❌	✅ catches per-PR	Extremely rare
Cross-host cookie / domain behavior	❌	✅ catches per-PR	Rare; only portal-handoff PRs
Reviewer "click URL and see change live"	❌	✅	UX convenience, not a bug class
Per-PR test data isolation (vs staging shared)	❌	✅	Only matters at 5+ devs concurrent

Honest conclusion: ephemeral previews catch a narrow slice of bugs that aren't already caught elsewhere. The big-ticket failure classes (write paths, secret bindings, EBS Scheduler) are caught by:

Phase 2 static audits (manifest ↔ Terraform)
Post-deploy smoke (against staging, before prod cutover)
Local + PR-CI Playwright against staging

The unique adds — AWS-infra-specific bugs + reviewer convenience + per-PR isolation — are rare or low-value at our stage.

Triggers to revisit ENG-303 (when ephemeral previews become worth it)

Reopen ENG-303 when ANY of these fires:

Team grows past 3-4 devs — staging shared-state contention becomes real ("my test data overlaps yours"; "your migration broke my smoke run"); the per-PR isolation benefit grows
SOC 2 Type II audit within ~6 months — auditors may prefer per-PR isolated test envs as evidence of controls-in-environment-similar-to-production
2-3 ALB/IAM-runtime-role regressions in a row — signal that staging deploys aren't catching infra-specific bugs early enough
Funded — pre-funding budget posture relaxes; the $2,750/year is no longer a meaningful decision driver
Member portal cross-host work matures — flows that span apex + app.askflorence.health need real cross-domain testing (cookies, CSP, OAuth handoff); local Playwright can't fully simulate this

When triggers fire, the implementation guide in ENG-303 description is the starting point. Same Playwright suite ports over without rewrite — just retarget baseURL env var per PR.

Tool evolution path

We're choosing Playwright today but the tool category is moving fast. The path forward:

Today (ENG-304 / M3): vanilla Playwright + TypeScript + data-testid selectors. ~4 specs to start (calculator, /plans coverage, agent waitlist, agent discovery). Target ~5min wall-clock for the suite.
Q3 2026 (when member portal stabilizes): add specs for member portal flows. Same Playwright suite, more .spec.ts files.
Q4 2026 (if maintenance burden grows): evaluate Stagehand as a Playwright wrapper that lets us write natural-language tests for less-stable surfaces. Open-source, MIT, $0 cost locally. ~60-70% reduction in test boilerplate per published benchmarks.
2027+ (post-funding, audit pressure): re-evaluate paid AI-native platforms (Mabl, Reflect, Octomind) if maintenance cost on the Playwright suite exceeds ~1 day/month of engineering time.

We're explicitly NOT locking ourselves out of the AI-native path. Playwright is the foundation; AI tools layer on top.

Selector strategy (codified to avoid `data-testid` debt later)

Per BrowserStack 2026 Playwright best practices and industry guidance:

Prefer role-based selectors (page.getByRole("button", { name: "Submit" })) — semantic, accessibility-aligned, resilient to CSS refactors
Use data-testid for selectors where text/role is ambiguous or hostile to change — calculator inputs, dynamic plan cards, autocomplete results
Avoid CSS class selectors and XPath — they break on every CSS refactor
data-testid discipline: small, ongoing maintenance tax accepted. Source-file changes that touch UI gain a data-testid if Playwright needs to reference the element. PR checklist nudges this.

This is the most-cited cause of Playwright flake per Mergify analysis — and avoidable with selector discipline.

Flake budget + retries

Industry data: mid-stage SaaS teams hit 4% flake → ~1,000 spurious failures/week per Bug0 analysis. Engineering attention burned on triage.

Our target: <1% flake on the stable-surface suite within 4 weeks of M3 ship. Pattern:

retry: 2 per spec — built-in (handles transient network/hydration race)
Playwright auto-wait — built-in
Explicit await expect(locator).toBeVisible() before interactions
No page.waitForTimeout(N) calls — only event-based waits
Trace + video on failure → diagnosis under 5 min per flake

What lands tomorrow (cycle #1, due 2026-05-14)

Issue	Effort	Outcome
ENG-304 (M3)	~1 day	4 stable-surface Playwright specs + PR-CI workflow + husky pre-push integration + first-week flake monitoring
ENG-305 (M4)	~0.5 day (builds on M3)	Nightly cron workflow runs the same suite + post-deploy-smoke against staging + prod. Auto-Linear on failure.

Files / surfaces touched (when implementing)

New: tests/playwright.config.ts — main config with retry, parallel workers, baseURL
New: tests/smoke/calculator.spec.ts, plans-coverage.spec.ts, agent-waitlist.spec.ts, agent-discovery.spec.ts
New: .github/workflows/playwright.yml — runs against staging on PR + cron schedule
Modified: scripts/preflight.ts — adds Playwright to --full mode
Modified: package.json — adds @playwright/test devDep + test:e2e script
Modified: UI components touched per spec — add data-testid attributes where text/role selectors are insufficient

References

ADR 0007 — Terraform-driven deploy (companion architectural decision)
ENG-303 — deferred, with revisit triggers in description
ENG-304 — Playwright UI smoke (the work that lands)
ENG-305 — Nightly drift detection (builds on ENG-304)
ENG-320 — In-VPC smoke runner (blocks Atlas allowlist tightening; separate threat model from this ADR)
docs/development/preflight.md — Layer 1 local CI mirror; gains Playwright in --full mode
docs/infrastructure/post-deploy-smoke.md — Layer 3 deploy-time smoke; Playwright integrates here under M4