Appearance
ADR 0008 — E2E testing strategy: local Playwright + PR-CI against staging, defer ephemeral PR previews
Status
Accepted — 2026-05-13.
Shipped under ENG-303 deferral + ENG-304 selection. Companion to ADR 0007 (terraform-driven deploy).
Context
ENG-284 shipped the foundation of "new work doesn't break old work" — preflight harness, PR-time required CI checks (typecheck, audits, build), post-deploy smoke (11 checks). That covers ~80% of the goal.
The remaining ~20% lives in E2E browser-level coverage. Today, the 9-step member + 7-step agent manual smoke flows in CLAUDE.md are run by Claude sessions via Preview MCP. At 40+ PRs/day (1-2 dev team scaling to ~3-4), running these manually per PR doesn't scale.
The project's Test Infrastructure & CI Reliability set out four milestones to close the gap. The decision question this ADR answers: do we ship ephemeral PR preview environments (M2 / ENG-303) at the same time as Playwright (M3 / ENG-304), or just ship Playwright and defer previews?
Decision
Defer M2 (ENG-303 ephemeral PR previews). Ship M3 (Playwright local + PR-CI against staging) and M4 (nightly drift) on cycle #1. Revisit M2 when specific triggers fire (see below).
E2E coverage shape going forward:
- Local Playwright via
npm run preflight -- --full— devs run the suite againstlocalhost:3004(or staging) before push. Husky pre-push hook enforces a fast subset. The PR template requires preflight confirmation. - PR-CI Playwright job — runs the same suite against
https://stage.askflorence.healthon every PR push to main. Required status check. Catches regressions for fork-PR contributors who can't be required to run husky. - Nightly Playwright + post-deploy-smoke (M4 / ENG-305) — same suite on
0 9 * * *UTC cron against staging + prod. Catches out-of-band drift (third-party API changes, cert rotations, infra config drift).
Tool selection: Playwright. Reasoning detailed below.
Tool selection: alternatives considered
| Tool | Fit for us | Decision |
|---|---|---|
| Playwright (microsoft/playwright) | Native TypeScript, free + OSS, free parallelization, deterministic auto-wait, trace viewer + video on failure, Chromium + Firefox + WebKit (incl. iOS Safari). State of JS 2024 shows it overtook Cypress in both satisfaction and usage growth. | ✅ Selected |
| Cypress (cypress.io) | Strong interactive debugger, time-travel test runner, popular in JS world. But: no WebKit/Safari support on roadmap (per Shiplight comparison), market share declining, paid parallelization. | ❌ |
| Selenium / WebdriverIO | Cross-language, older. Heavier setup, more flake without modern auto-wait, harder to maintain at our scale. | ❌ |
| Puppeteer | Chromium-only, lower-level than Playwright. Playwright is its de-facto successor for our use case. | ❌ |
| Stagehand (browserbase/stagehand) | OSS Playwright wrapper that lets you write tests in natural language via act / observe / extract / agent primitives. Reduces boilerplate by 60-70% vs raw Playwright + LLM glue. Free locally; pairs with Browserbase ($0.10-0.40/browser-minute) for cloud execution. | Consider as future enhancement — see "Tool evolution path" below. |
| Mabl / Reflect / Octomind (AI-native paid platforms) | Cloud-native, low-code, auto-healing tests, analytics dashboards. Engineering teams report 40% of QA cycles spent on flaky-test maintenance (Bug0 analysis). | ❌ — paid tier conflicts with pre-funding budget; would re-evaluate at Series A with audit pressure. |
| Autonoma (getautonoma.com) | AI-native, connects to source code, generates tests from analysis. Newer category. | ❌ — same paid-tier objection as Mabl class. |
Why Playwright wins for our stage:
- $0 license cost, $0 runtime cost (runs on local + GHA runners)
- Same suite reused across local preflight, PR CI, nightly, and (future) ephemeral preview env
- TypeScript-first matches our codebase
- Trace viewer + video on failure means flakes diagnose quickly
- Strong selector strategy when paired with
data-testidattributes (industry best practice; BrowserStack guidance) - Future-compatible: if we want AI-augmented test generation later, Playwright is what those tools generate (e.g., Stagehand wraps Playwright; Octomind/Mabl emit Playwright)
Cost analysis: ephemeral previews vs PR-CI against staging
The core cost question. Numbers based on 40 PRs/day = ~1,200 PRs/month (current pace; scaling to 2 devs may double).
With ENG-303 (ephemeral PR previews) — DEFERRED
| Item | Calculation | Monthly |
|---|---|---|
| Fargate task lifetime | ~840 preview PRs after path-filter × $0.012/PR × ~12h avg open | ~$120 |
| ECR image storage | 1,200 builds × ~500MB peak, ~50GB steady with lifecycle policy | ~$5 |
| GHA minutes (preview spin-up + smoke per push) | 40 × 2 pushes × ~8min × 30 = 19,200 min, free tier 2,000, overage $0.008/min | ~$138 |
| ALB target groups, CloudWatch logs | Negligible | ~$5 |
| Total ENG-303 incremental | ~$270/month, ~$3,240/year |
Without ENG-303 (local Playwright + PR-CI against staging) — SELECTED
| Item | Calculation | Monthly |
|---|---|---|
| Local Playwright | Runs on dev laptops | $0 |
| PR-CI Playwright against staging | 40 × 2 × ~3min × 30 = 7,200 min, overage 5,200 × $0.008 | ~$41 |
| Mongo writes (staging cluster, ~4,800 synthetic rows/month with cleanup) | Atlas M10 absorbs trivially | $0 |
| CMS API calls (~24,000/month) | Well under any rate limit | $0 |
| Total | ~$41/month, ~$490/year |
Delta: ~$230/month, ~$2,750/year saved. At pre-funding stage with deliberate budget posture, this matters.
What's the actual gap that ephemeral previews close?
Honest accounting of what ENG-303 catches that local Playwright + PR-CI-against-staging doesn't:
| Failure class | Local + PR-CI vs staging | Ephemeral preview | How often it happens |
|---|---|---|---|
| UI / form / hydration / autocomplete | ✅ catches | ✅ catches | Often (the bulk of regressions) |
| Calculator math / plan filtering | ✅ catches | ✅ catches | Common |
| API behavior (routes, validation, error shapes) | ✅ catches | ✅ catches | Common |
| Secret-binding gaps (ENG-272 class) | ❌ (uses local .env.local or staging shared) | ✅ catches per-PR | Already caught by Phase 2a audit + post-deploy smoke before prod |
| EBS Scheduler missing infra (ENG-274 class) | ❌ | ✅ catches per-PR | Already caught by Phase 2b audit + post-deploy smoke before prod |
| ALB routing / health check / target group config | ❌ | ✅ catches per-PR | Rare — happens ~1/quarter, mostly on infra PRs |
| CloudFront cache + WAF rules | ❌ | ✅ catches per-PR | Very rare |
| IAM permission gaps on task runtime role | ❌ | ✅ catches per-PR | Caught by staging deploy + smoke before prod |
| VPC networking / Atlas PrivateLink | ❌ | ✅ catches per-PR | Extremely rare |
| Cross-host cookie / domain behavior | ❌ | ✅ catches per-PR | Rare; only portal-handoff PRs |
| Reviewer "click URL and see change live" | ❌ | ✅ | UX convenience, not a bug class |
| Per-PR test data isolation (vs staging shared) | ❌ | ✅ | Only matters at 5+ devs concurrent |
Honest conclusion: ephemeral previews catch a narrow slice of bugs that aren't already caught elsewhere. The big-ticket failure classes (write paths, secret bindings, EBS Scheduler) are caught by:
- Phase 2 static audits (manifest ↔ Terraform)
- Post-deploy smoke (against staging, before prod cutover)
- Local + PR-CI Playwright against staging
The unique adds — AWS-infra-specific bugs + reviewer convenience + per-PR isolation — are rare or low-value at our stage.
Triggers to revisit ENG-303 (when ephemeral previews become worth it)
Reopen ENG-303 when ANY of these fires:
- Team grows past 3-4 devs — staging shared-state contention becomes real ("my test data overlaps yours"; "your migration broke my smoke run"); the per-PR isolation benefit grows
- SOC 2 Type II audit within ~6 months — auditors may prefer per-PR isolated test envs as evidence of controls-in-environment-similar-to-production
- 2-3 ALB/IAM-runtime-role regressions in a row — signal that staging deploys aren't catching infra-specific bugs early enough
- Funded — pre-funding budget posture relaxes; the $2,750/year is no longer a meaningful decision driver
- Member portal cross-host work matures — flows that span apex +
app.askflorence.healthneed real cross-domain testing (cookies, CSP, OAuth handoff); local Playwright can't fully simulate this
When triggers fire, the implementation guide in ENG-303 description is the starting point. Same Playwright suite ports over without rewrite — just retarget baseURL env var per PR.
Tool evolution path
We're choosing Playwright today but the tool category is moving fast. The path forward:
- Today (ENG-304 / M3): vanilla Playwright + TypeScript +
data-testidselectors. ~4 specs to start (calculator, /plans coverage, agent waitlist, agent discovery). Target ~5min wall-clock for the suite. - Q3 2026 (when member portal stabilizes): add specs for member portal flows. Same Playwright suite, more
.spec.tsfiles. - Q4 2026 (if maintenance burden grows): evaluate Stagehand as a Playwright wrapper that lets us write natural-language tests for less-stable surfaces. Open-source, MIT, $0 cost locally. ~60-70% reduction in test boilerplate per published benchmarks.
- 2027+ (post-funding, audit pressure): re-evaluate paid AI-native platforms (Mabl, Reflect, Octomind) if maintenance cost on the Playwright suite exceeds ~1 day/month of engineering time.
We're explicitly NOT locking ourselves out of the AI-native path. Playwright is the foundation; AI tools layer on top.
Selector strategy (codified to avoid data-testid debt later)
Per BrowserStack 2026 Playwright best practices and industry guidance:
- Prefer role-based selectors (
page.getByRole("button", { name: "Submit" })) — semantic, accessibility-aligned, resilient to CSS refactors - Use
data-testidfor selectors where text/role is ambiguous or hostile to change — calculator inputs, dynamic plan cards, autocomplete results - Avoid CSS class selectors and XPath — they break on every CSS refactor
data-testiddiscipline: small, ongoing maintenance tax accepted. Source-file changes that touch UI gain adata-testidif Playwright needs to reference the element. PR checklist nudges this.
This is the most-cited cause of Playwright flake per Mergify analysis — and avoidable with selector discipline.
Flake budget + retries
Industry data: mid-stage SaaS teams hit 4% flake → ~1,000 spurious failures/week per Bug0 analysis. Engineering attention burned on triage.
Our target: <1% flake on the stable-surface suite within 4 weeks of M3 ship. Pattern:
retry: 2per spec — built-in (handles transient network/hydration race)- Playwright auto-wait — built-in
- Explicit
await expect(locator).toBeVisible()before interactions - No
page.waitForTimeout(N)calls — only event-based waits - Trace + video on failure → diagnosis under 5 min per flake
What lands tomorrow (cycle #1, due 2026-05-14)
| Issue | Effort | Outcome |
|---|---|---|
| ENG-304 (M3) | ~1 day | 4 stable-surface Playwright specs + PR-CI workflow + husky pre-push integration + first-week flake monitoring |
| ENG-305 (M4) | ~0.5 day (builds on M3) | Nightly cron workflow runs the same suite + post-deploy-smoke against staging + prod. Auto-Linear on failure. |
Files / surfaces touched (when implementing)
- New:
tests/playwright.config.ts— main config with retry, parallel workers, baseURL - New:
tests/smoke/calculator.spec.ts,plans-coverage.spec.ts,agent-waitlist.spec.ts,agent-discovery.spec.ts - New:
.github/workflows/playwright.yml— runs against staging on PR + cron schedule - Modified:
scripts/preflight.ts— adds Playwright to--fullmode - Modified:
package.json— adds@playwright/testdevDep +test:e2escript - Modified: UI components touched per spec — add
data-testidattributes where text/role selectors are insufficient
References
- ADR 0007 — Terraform-driven deploy (companion architectural decision)
- ENG-303 — deferred, with revisit triggers in description
- ENG-304 — Playwright UI smoke (the work that lands)
- ENG-305 — Nightly drift detection (builds on ENG-304)
- ENG-320 — In-VPC smoke runner (blocks Atlas allowlist tightening; separate threat model from this ADR)
- docs/development/preflight.md — Layer 1 local CI mirror; gains Playwright in
--fullmode - docs/infrastructure/post-deploy-smoke.md — Layer 3 deploy-time smoke; Playwright integrates here under M4
External research consulted
- State of JS 2024 — Playwright overtakes Cypress in satisfaction + usage growth
- Cypress vs Playwright analysis — Shiplight 2026
- Stagehand documentation + pricing — Browserbase
- Stagehand on GitHub — MIT-licensed Playwright wrapper
- Playwright best practices — BrowserStack guide
- Flaky tests in Playwright — Mergify deep-dive
- Playwright MCP changes the build vs buy equation — Bug0 analysis on AI testing economics