Appearance
ADR 0009 — Self-hosted analytics + observability: OpenPanel + GlitchTip, full-journey first-party
Status
Superseded by ADR 0010 — 2026-05-26. The portal-CSP argument below remains factually correct and is preserved as design context; ADR 0010 explicitly carves the portal surface out of the new PostHog HIPAA Cloud path because of it. The "self-hosted from day one" moat was conceded for v1 timing reasons; revisit triggers are documented in ADR 0010.
Originally accepted — 2026-05-16 (ENG-347). Superseded the "self-hosted Umami" direction recorded in ENG-217 / GitHub #75 (sub-deliverable A — PostHog rip — shipped 2026-05-12, PRs #184/#186). Was tracked for build under ENG-347 / GitHub #342.
Context
ENG-217 removed PostHog Cloud (cost: $250+/mo for the BAA tier; no FedRAMP path for EDE Phase 3; broken-on-apex via WAF). It named "self-hosted Umami" as the replacement. Two things invalidated that specific choice before build started:
The product is going web → mobile. The first mobile app iteration is the full onboarding → pricing → plan → waitlist funnel plus Florence AI. Umami has no first-party mobile SDK — its documented mobile path is a hidden WebView, and the real-world workaround is hand-rolled per-platform HTTP against
/api/sendwith a silent-data-loss footgun (missing/wrongUser-Agent→ 200 response, events discarded, no error). Umami is a website-traffic tool, not a product-analytics tool: thin funnels/paths, anonymous-only identity.The questions are product-analytics questions. "What are users doing, where do they struggle, are they shopping or poking the prefilled demo, how many hit a state-based exchange we don't serve yet" require funnels, path analysis, retention, and cross-surface identity — exactly Umami's documented weak spots.
A HIPAA BAA does not solve this: a BAA makes a third-party processor legally permitted to handle PHI, but the portal + mobile surfaces block third-party scripts/processors architecturally (CSP, per the Creative AdBundance subdomain-cut model). A legal BAA does not override a CSP. So PostHog Cloud — even on the $250+/mo Boost-for-BAA tier — structurally cannot see the app/portal, the exact surfaces we most want to measure. PostHog self-hosted can (first-party) but is the heaviest stack of all options (ClickHouse + Kafka + Redis + PG, 4 vCPU / 16 GB floor), is the deployment path PostHog actively de-emphasizes, and the Cloud→self-hosted migration direction is deprecated/unreliable.
The compliance posture is not a constraint to route around — it is the moat. First-party + self-hosted is the only analytics allowed across the full journey including the PHI portal and mobile. Tools Creative AdBundance brings (Meta CAPI, Hotjar, etc.) die at "the cut"; ours does not. "Log everything from everywhere, one continuous funnel" is the goal, and we are the only party who can have it.
Decision
Self-hosted from day one, two tools, full-journey first-party:
| Layer | Tool | Scope |
|---|---|---|
| Product / behavioral analytics | OpenPanel (AGPL-3.0, self-hosted; Docker Compose; Postgres + ClickHouse + Redis; first-party SDKs Web / Swift / Kotlin / React Native) | Marketing apex → /plans → portal/enroll → mobile app, one funnel |
| Errors / crash / perf | GlitchTip (self-hosted, lightweight Django + Postgres + Redis; Sentry-SDK-compatible: web JS + iOS/Android/RN) | Same surfaces; web now, mobile when the app ships |
| Server health | Structured src/lib/logger.ts → CloudWatch Logs + CloudWatch dashboards/alarms + a synthetic canary running the ENG-275 critical-flow smoke on a schedule | All API surfaces |
| Florence AI interaction logging | Not an analytics tool. First-party encrypted transcript/tool-trace store + structured logs, designed under #61. Feeds derived florence_* events into OpenPanel only. | — |
Architecture invariants:
- Server-side event spine is the through-line. Business events fire server-side from the shared
/api/session/*+/api/share/*routes (the ADR-adjacent #274 transport-agnostic session API). Web sends cookies, mobile sends Bearer — same routes, same events. This makes the whole stack tool-portable: swapping an analytics vendor is changing an ingest target, not re-instrumenting. - No "cut" for first-party. The browser-cookie boundary at the subdomain cut stays enforced (it fences third-party tools). Journey continuity is reconstructed server-side via the identity graph
anonymous af_visitor_id → member_id (server-side join at login/enroll) → device session (mobile Bearer). The funnel is defined end-to-end throughenrollment_submitted, not truncated at the cut. - Privacy by construction. Income always bucketed, never raw. No raw doctor/drug strings (identifying / health data) — only search patterns. Opaque per-session plan IDs resolved to real
plan_idserver-side at the fire site; real IDs never in a client URL/payload. Identity =af_visitor_id, not PII. - Self-hosted on our AWS = compliant via the data-control path. No analytics-vendor BAA. The infra sits under the existing AWS Organizations BAA. This is the EDE posture (operating history accrues now), not a deferral.
Consequences
Accepted:
- OpenPanel adds ClickHouse as an ops component — medium weight (single container; no Kafka), far from the PostHog monster, but a new thing to operate/back up/upgrade.
- OpenPanel is a younger project than Umami (longevity risk). Mitigated by the server-side spine: events fire from our API, so a future vendor swap is an ingest-target change, not a re-instrumentation.
- Reverses the recorded "Umami" decision → doc-trail update (CLAUDE.md, Creative AdBundance brief, session-cookie + share-flows architecture docs, #274). The compliance argument is identical (both self-hosted first-party); only the tool name changes. The cross-surface-moat framing already in CLAUDE.md is correct and preserved verbatim, retargeted to the new tools.
- Cross-surface identity stitching (
visitor → member → device) is a new required design item. The join key must exist in the event schema from day one so marketing→portal→app stitching is not a retrofit (same discipline as the reservedflorence_*slots). Coordinated with the Phase-5 portal work and #274's identity model.
Gained:
- ~$45-90/mo all-in, zero new BAAs, vs the $250+/mo PostHog-BAA path or the ~$2,500-4,000/12-mo phased-PostHog route (Cloud now → self-host at EDE), which was also ~3-5x more expensive and back-loaded the heaviest infra build to right before the Sept audit.
- One continuous marketing→portal→app funnel under one identity graph — the report Creative AdBundance's third-party stack structurally cannot produce. The moat made visible.
- Mobile is additive, not a re-platform: OpenPanel/Sentry SDKs drop onto the same server-side taxonomy when the app ships. This is the entire reason OpenPanel beat Umami.
- The ECS task-def drift hazard that recurred four times in ten days (the MONGODB_WRITE_URI incident class) is closed by ADR 0007 (Terraform owns the task def, shipped 2026-05-13 / ENG-277). Adding
OPENPANEL_*/GLITCHTIP_DSNenv vars is now a normalterraform applychange, not a drift risk — no #163 mitigation needed.
Alternatives considered
- Self-hosted Umami (the prior ENG-217 direction) — rejected: no first-party mobile SDK (hand-rolled
/api/sendwith silent-fail footgun), website-traffic tool not product-analytics, would force a re-platform exactly when mobile + Florence land. - PostHog Cloud + HIPAA BAA, then self-host at EDE — rejected: BAA ≠ CSP override, so it cannot see portal/mobile even while paid; ~3-5x cost over 12 months; back-loads the heaviest self-host build to right before the EDE audit (weaker evidence — auditors weigh operating history); migrates the deprecated Cloud→self-hosted direction.
- PostHog self-hosted — rejected: heaviest stack of all options (ClickHouse + Kafka, 4 vCPU/16 GB floor), the path PostHog de-emphasizes, and we just removed PostHog for cost/compliance; reintroducing the monster for capabilities OpenPanel+GlitchTip already cover is a bad bootstrapped trade.
- Single all-in-one (PostHog/Mixpanel/Amplitude) — rejected: either third-party (blocked post-cut regardless of BAA) or heavy/expensive; the OpenPanel + GlitchTip split covers ~90% of the relevant capability at a fraction of ops/cash. The 10% gap (integrated session replay, feature flags) is deferred (OpenReplay only-if-needed) or already solved (env-flag dormant→flip rollout, ENG-322 precedent).
References
- ENG-347 / GitHub #342 — build plan-of-record (taxonomy, phasing, decision + cost thread)
- ENG-217 / GitHub #75 — parent (sub-deliverable A, PostHog rip, shipped)
- GitHub #274 — server-side session + clean URLs (the tool-portable through-line)
- GitHub #298 / #163 — share flows / task-def drift root cause
- GitHub #61 — Florence AI architecture (owns Layer-3 transcript logging)
- ADR 0007 — Terraform owns the ECS task def (closes the env-var drift hazard)
docs/briefs/creative-adbundance-analytics-brief.md— the subdomain-cut model (fences third-party tools, not ours)