Skip to content
AskFlorence
Main Navigation ArchitectureFlorence AIAgentsMembersAgent PlatformValidationInfrastructure

Appearance

Sidebar Navigation

Overview

Home

Glossary

System Architecture

Consumer & Agent Flow

Florence AI

Overview

Principles

Runtime

Tool surface

Adding a tool

Tool registry

Knowledge: SBC scenarios & CSR

Voice

Evals & observability

Provider risk & portability

Outage playbook

Roadmap

Build plan

Agents

Overview

Workflows & pain points

Members

Overview

Medicaid coverage gap

Carriers

Overview

Marketplaces

Overview

Agency

Overview

Regulations

Overview

Agent Platform

Overview

Auth Architecture

MongoDB Permissioning

Compliance Model

Data Models

Data Sources

Overview

CMS Marketplace API

CMS dependency map

PUF Data

State Subsidies

SBE Ingestion Playbook

SBE State Watchouts + Decisions

CA Phase C/D Playbook

NY Phase C/D Playbook

Validation

Overview

Methodology

APTC Formula

California 2026

New York 2026

CAPS Formula

Scenario Results

Infrastructure

Account Inventory

AWS Setup Runbook

AWS Organizations

CloudTrail

GuardDuty

Security Hub

Config

CloudFront + WAFv2

Data sources & ingest

Phase 4 DNS

Change Log

Vulnerability Management

MongoDB Setup

Access Control

Data Classification

Documentation Hosting

Post-deploy Smoke

Development

Preflight (local CI mirror)

Testing strategy

Compliance

Overview (auditor entry point)

SOC 2 Control Mapping

HIPAA Control Mapping

CMS EDE Appendix A Mapping

Risk Assessment

Encryption Policy

Data Retention Policy

Privacy Impact Assessment

Consent Capture & Versioning

Incident Response Plan

Access Control Policy

Marketing vs. Portal Analytics

Vendor / Subprocessor Register

Dependency Vulnerability Policy

BAA / Compliance Evidence

Compliance-Automation Integration

Compliance-Automation Vendor Evaluation

Penetration Test Reports

Architecture

Portal entry handoff

Mobile app strategy

Deferred architecture decisions

Session cookie architecture

Share flows

Decisions (ADRs)

Index

0001 — Atlas project isolation

0002 — Append-only audit log

0003 — Narrow-scoped Mongo users

0004 — Cross-cluster Atlas PrivateLink

0005 — Delayed-job architecture

0006 — Mongo user simplification

0007 — Terraform owns ECS task def

0008 — E2E testing strategy

0009 — Self-hosted analytics + observability (superseded)

0010 — PostHog HIPAA Cloud (supersedes 0009)

Runbooks

Security Incident Response

Break-Glass Root Login

Onboard Team Member

Offboard Team Member

Atlas user provisioning

Deploy via Terraform (ENG-277)

Rollback via Terraform (ENG-277)

S3 data bucket migration (planned Phase 11)

Access Reviews

2026-Q2 Review

Session log

Index

2026-04-23 — Phase 10 DNS cutover

2026-04-22 — Phase 8 prod AWS mirror

2026-04-22 — Phase 7 Atlas VPC peering

2026-04-22 — Phase 6 CloudFront + WAF

2026-04-21 — Phase 5 staging go-live

2026-04-17 — Atlas staging

Briefs

Index

Member portal plan (ENG-187)

2026-04-16/17 handoff

2026-04-17 Atlas handoff

System briefing (2026-04-17)

Creative AdBundance proposal brief

Creative AdBundance analytics brief

ElevenLabs RN integration research

Policies

Overview

On this page

ADR 0005 — Delayed-job architecture for sub-hour transactional + 24h+ marketing ​

Status ​

Accepted — 2026-05-09.

Context ​

Several agent-flow features need to execute deferred work at a specific later time:

  • 15-minute discovery survey reminder (ENG-242) — agent signs up via /agent-onboarding, doesn't complete /agent-discovery within 15 min, gets a single nudge with a resume link.
  • Resume email on first partial save (ENG-244) — fires inline from the partial-save handler. Already event-driven; not a delayed-job concern.
  • 24h / 72h / 7d second/third nudges — marketing-class lifecycle cadence. Same trigger condition as the 15-min reminder but on longer windows.
  • Future: agent-activation email after admin approval ([Phase 5 / ENG-202 family]) — once-off email when an admin approves a NIPR-validated agent.
  • Future: renewal alerts at policy renewal time — service-of-record reminders; ~30-day notice window.
  • Future: Florence-AI-driven personalized nudges — high-volume per-member events.

The class of problem ("schedule arbitrary work for arbitrary later time, conditional on intermediate state") will recur. We need a default architecture for it.

We are deployed on AWS ECS Fargate post-#47 (Phase 10 cutover, v0.18.0). Email goes through AWS SES v2 (Resend retired in v0.33.0). AWS BAA covers ECS, SES, S3, Secrets Manager, CloudFront, CloudWatch, EventBridge, Lambda, SQS, Step Functions, and DynamoDB. We have an AWS Activate grant covering ~$1K+ of AWS-side spend.

The original ENG-242 implementation used a Vercel-Cron polling pattern. That was wrong for our deploy target — we are no longer on Vercel. Refactored in commit 18d9abd to AWS-native primitives. This ADR captures why.

We reviewed the 2025 SaaS landscape — Inngest (post-Mergent acquisition), Trigger.dev v3, Hatchet (YC-backed Postgres-based, MIT-licensed), Defer (sunset), Quirrel (acquired by Netlify), Vercel Queues (limited beta as of June 2025), Temporal Cloud (priced ~$100-500/mo floor with $6K startup credits), Convex (full-stack, no managed-tier BAA). The decision criteria forced by AskFlorence's stage:

  • HIPAA BAA mandatory. SES, EventBridge Scheduler, SQS, Lambda, Step Functions are all on the AWS BAA. Inngest BAA is enterprise-only (negotiated, reports peg ~$500-1,500/mo+). Trigger.dev managed cloud has no HIPAA. Convex enterprise-only. Vercel Queues BAA is on Pro tier but we are not on Vercel.
  • Cost-sensitive (pre-revenue, AWS Activate grant covers AWS). $0 floor on EventBridge Scheduler (free tier covers 14M invocations/mo) vs $500-1,500/mo+ for Inngest BAA tier.
  • Single engineer. Whatever ships needs to not require dedicated platform-team operations.
  • Deploy target is AWS ECS — the choice should ride on existing IaC patterns rather than introducing a parallel deploy surface.

Decision ​

Two-tier architecture by time horizon and audience:

  1. Sub-hour transactional delays (engineering-controlled, individual events): AWS EventBridge Scheduler one-shot per row, target = AWS Lambda thin-proxy invoking our app's HTTP endpoint with an internal token (INTERNAL_REMINDER_TOKEN).
  2. 24h+ marketing/lifecycle cadences (marketer-controlled, cohort-based): HubSpot lifecycle workflows triggered by HubSpot contact properties synced from the app. The marketer (Ian) builds + iterates campaigns in HubSpot UI without engineering tickets.

Concretely for the 15-minute reminder (ENG-242):

/api/waitlist agent-signup
  → scheduleDiscoveryReminder() — CreateScheduleCommand at submittedAt+15m
  → EventBridge Scheduler (one-shot, ActionAfterCompletion=DELETE)
T+15min:
  → EBS invokes Lambda target with { email } payload
  → Lambda POSTs to /api/agents/discovery/send-reminder with internal token
  → route atomically claims the row, sends SES email, fires PostHog event
If user submits full survey within 15min:
  → /api/agents/discovery → cancelDiscoveryReminder() → DeleteScheduleCommand
  → reminder never fires

Step Functions are reserved for multi-step durable workflows that aren't needed yet. Kafka / MSK is reserved for the case where Florence AI produces high-throughput multi-consumer event streams (Year 2-3+); even then, Kinesis Data Streams is the cheaper AWS-native first step before considering MSK.

Consequences ​

What we accept ​

  • Lambda thin-proxy is required. EventBridge Scheduler does not support raw HTTPS targets directly (Universal Target API is limited to AWS service ARNs). HTTPS targets need either an EventBridge API Destination (heavier Terraform) or a small Lambda. We picked Lambda. The Lambda code is ~30 lines, deployable via Terraform alongside the schedule group + IAM roles. It's another deployable surface, but it's small and AWS-native.
  • Per-row schedule resource overhead. EventBridge Scheduler limits 1M concurrent schedules per region (raisable). At AskFlorence scale (forecast: <10/day Year 1 → 1K/day Year 2 → 10K/day Year 3), this is well under the cap.
  • Schedule names must be deterministic for cancel idempotency. We use agent-reminder-${sha256(email).slice(0,32)} so cancel can find the schedule without storing its name on the waitlist row. Re-creating an already-existing schedule returns ConflictException (treated as success); deleting a non-existent schedule returns ResourceNotFoundException (treated as success).
  • Vendor risk = none added. All primitives are AWS-managed. AWS BAA already signed. No new vendor contracts, no new BAAs, no new deploy pipelines.
  • HubSpot is the marketer surface. Means we don't build 24h+ cadence in code. Means Ian owns the templates + scheduling rules + cohort segmentation. Trade-off: we depend on HubSpot's workflow engine being reliable (it is, at our scale). Engineering still owns the property sync that powers the workflows.

What we don't get ​

  • Inngest's superior dev experience — multi-step step.sleep(), step.waitForEvent(), observability dashboard. AWS-native equivalent is Step Functions + CloudWatch which is more verbose. Acceptable until use case count crosses 5+ AND coordination becomes painful.
  • Generic delayed-job abstraction. We're building one-off integrations per use case. With only one use case today (the 15-min reminder), a generic abstraction would be premature. Re-evaluate at use case 3+.
  • Exact-time precision below 1 second. EventBridge Scheduler precision is ~30 seconds in practice. Acceptable for "approximately 15 min after signup" semantics.

Alternatives considered ​

Inngest (Enterprise tier with BAA) ​

Rejected. Best dev experience in 2025 (TypeScript-native step.* API, dashboard, retry semantics built in). But BAA is enterprise-only with no public pricing — reports indicate $500-1,500/mo+. At our pre-revenue stage with the AWS Activate grant offsetting AWS costs to ~$0, the Inngest premium has no offsetting benefit yet. Re-evaluate when use case count crosses 5+ AND we have engineering time being burned on AWS-side workflow boilerplate.

Trigger.dev (managed cloud) ​

Rejected. No HIPAA on managed cloud. BYOC option exists but adds operational burden (run our own Trigger.dev backplane on AWS) without proportional benefit over EventBridge Scheduler.

Hatchet (Team tier with BAA) ​

Considered, deferred. YC-backed, MIT-licensed, Postgres-based. Team tier includes BAA at lower price than Inngest Enterprise. Self-host on our existing ECS cluster is technically feasible. Re-evaluate at use case count 5+ if Inngest pricing hasn't improved.

Vercel Queues ​

Rejected. Limited Beta as of June 2025 with BAA on Pro tier — but we are not on Vercel post-#47.

AWS Step Functions (with Wait state) ​

Reserved for multi-step workflows. Pricing $25 per million state transitions makes it expensive for single-delay use cases. Right tool when we need durable multi-step orchestration (e.g. "wait 24h, check status, branch on result, loop"). Not needed for the single 15-minute reminder.

AWS SQS (with DelaySeconds) ​

Considered. Max DelaySeconds is 15 min — exactly our case for the reminder. Would require an SQS-triggered Lambda to drain the queue. Equivalent operational complexity to EventBridge Scheduler + Lambda; chose EBS for the per-schedule observability + the cleaner cancel semantics (DeleteScheduleCommand vs SQS message-purge dance).

Kafka / MSK ​

Rejected. Wrong category — Kafka is a stream/log, not a scheduler. Would need a delayed-message scheduler built on top. Cost floor ~$3,300/yr for unused capacity. Right tool when we have hundreds of MB/sec sustained throughput AND multiple downstream consumers per event AND want event-sourcing semantics — none of which apply at AskFlorence scale yet.

MongoDB-backed delayed worker (poll DB for runAt <= now) ​

Considered, deferred. $0 incremental cost (uses existing M10 cluster), familiar pattern (Sidekiq / DelayedJob / pg-boss). Trade-off: we own worker reliability. Acceptable for very-early-stage but EBS gives us AWS-managed reliability for the same effort. Re-evaluate if we grow into multiple use cases that benefit from a single generic worker.

DynamoDB TTL + Streams ​

Right tool for >24h delays. TTL precision is "within 48 hours of TTL expiry" per AWS docs, so wrong for sub-hour use cases. Strong candidate for renewal alerts (30 days out) and weekly digests. Not the answer for the 15-min reminder.

Cron via ECS Scheduled Tasks ​

Rejected. Polling-based — wastes compute when no rows match (most ticks are empty). Per-row precision is 0-N min depending on cron interval. EventBridge Scheduler beats it on every axis for our case.

HubSpot for everything ​

Rejected for transactional. HubSpot's workflow scheduling has minimum ~5-10 minute precision (depending on the trigger model) and is not designed for sub-hour transactional sends. The 15-min reminder is too tight + too engineering-controlled to live in HubSpot.

Revisit triggers (explicit) ​

Reopen this ADR (file ADR 0006 superseding) when one or more of these fire:

TriggerNew consideration
Delayed-job use case count crosses 5+ AND multi-step coordination becomes painful (e.g. "wait, check, branch, loop")Generic abstraction layer + maybe Inngest if BAA pricing has improved, or Step Functions per workflow
Florence AI requires high-throughput stream processing (>1M events/day, multiple downstream consumers)Kinesis Data Streams (not Kafka) — AWS-native + BAA-covered + 10-20x cheaper floor than MSK
Inngest publishes self-serve BAA pricing < $200/moRe-evaluate Inngest as the workflow tier — DX wins are real
Hatchet Cloud (managed) ships with BAA tier < $500/moRe-evaluate Hatchet — closest open-source competitor
ECS team grows to 3+ engineers AND we accumulate 10+ delayed-job use casesBuild vs buy reconsideration; might justify a small custom abstraction (Stripe Pelican / Shopify Postal pattern at smaller scale)
EDE Phase 3 audit demands an immutable event log beyond CloudTrail+Mongo append-onlyEventBridge bus + Kinesis (or MSK at scale)
Vercel Queues exits Limited Beta with BAA on Pro tier AND we move part of the system back to VercelReconsider if Vercel ever returns to the deploy mix

References ​

  • GitHub #103 — Discovery survey reminder email (15min stall) (ENG-242)
  • GitHub #105 — Save & resume token flow + bundled flow polish (ENG-244)
  • GitHub #110 — HubSpot lifecycle workflows for 24h+ agent nudges (ENG-249)
  • GitHub #111 — This ADR (ENG-250)
  • GitHub #47 — AWS migration (Phase 10 cutover, ECS Fargate)
  • GitHub #57 — Vendor BAA coverage tracking
  • Implementation: commit 18d9abd — Vercel Cron → AWS EventBridge Scheduler one-shot per row
  • AWS docs: EventBridge Scheduler limits, HIPAA-eligible services
  • 2025 modern-SaaS landscape research: see Linear ENG-244 architecture thread for vendor-by-vendor BAA + pricing table
Pager
Previous page0004 — Cross-cluster Atlas PrivateLink
Next page0006 — Mongo user simplification

AskFlorence Internal Documentation. Not for public distribution.

AskFlorence

Internal Documentation

Access restricted. Not for public distribution.