Deferred architecture decisions

This page is the canonical home for architecture decisions that are correctly deferred today but worth documenting so future engineers (and auditors) can see we've thought about them. Pattern: keep Linear backlog focused on actively-actionable work; park "revisit when conditions X happen" decisions here.

For each deferred decision: current state, known limitations, trigger conditions to revisit, proposed migration plan + effort, cross-references.

Companion to: docs/data-sources/cms-dependency-map.md (same pattern, applied to CMS dependency posture).

How to use this page:

Reviewing this page quarterly is enough to catch shifts in trigger conditions
When a trigger fires, file a Linear issue for the migration work and reference the section here
New deferred decisions land here, not as perpetual Linear backlog issues

Rate-limiter storage: per-task in-memory → Redis (ElastiCache)

Source: ENG-286 audit M8. Originally filed as ENG-334 (cancelled 2026-05-14 — moved here).

Current state

Per-task in-memory Map<ip, timestamps[]> rate limiter in src/lib/agent-db.ts:136-195. Used by waitlist + agent-discovery + (post-ENG-321) every state-changing POST + every CMS-proxy route.

typescript

// Pattern (simplified):
const buckets = new Map<string, number[]>();
function checkRateLimit(ip: string, limit: number, windowMs: number): boolean {
  const now = Date.now();
  const timestamps = (buckets.get(ip) ?? []).filter(t => now - t < windowMs);
  if (timestamps.length >= limit) return false;
  timestamps.push(now);
  buckets.set(ip, timestamps);
  return true;
}

Known limitation: fuzzy cap

ECS service runs N tasks (currently 2 in prod)
Each task holds its own in-memory Map
Effective user-facing cap is N × configured cap — a user load-balanced across tasks gets up to N× the per-task throughput
Not a breakage; just means configured 30/5min is in practice up to 60/5min for a real user

ENG-321 explicitly documents acceptance of this fuzziness — for anti-scraping defense it's a speed bump, not a hard ceiling. Real scraper still hits a meaningful (even if fuzzy) cap.

Trigger conditions to revisit

Migrate to shared-state rate limiting when any of:

ECS scales to ≥4 tasks (effective cap drift ≥4x, abuse defense gets too loose)
Legitimate traffic grows to where N × per-task cap matters for legitimate UX (currently low volume; user portal milestone will change this)
Anti-scraping precision becomes a strategic requirement vs. "speed bump" deterrent
Specific abuse pattern observed that the fuzzy cap permits (e.g., scraper exploiting per-task state intentionally)

Proposed migration plan

Target: ElastiCache Redis cluster in VPC, shared across all ECS tasks for rate-limit state.

Scope:

Provision ElastiCache Redis cluster (cache.t4g.micro for start) in existing VPC subnet group
Wire security group: ECS task SG → Redis SG, port 6379
Add Redis connection string to Secrets Manager (prod/redis-rate-limit, staging/redis-rate-limit)
Refactor src/lib/agent-db.ts rate-limit logic to use Redis INCR + EXPIRE:

typescript

async function checkRateLimit(ip: string, route: string, limit: number, windowSec: number): Promise<boolean> {
  const key = `rl:${route}:${ip}`;
  const count = await redis.incr(key);
  if (count === 1) await redis.expire(key, windowSec);
  return count <= limit;
}

Update ECS task-def to inject Redis connection string env var
Add Redis health check to startup probes
Fall-open behavior: if Redis unreachable, allow request (log WARN with [rate-limit-degraded] marker per ENG-330 observability pattern)

Effort: ~4h (Terraform + code + verification)

Reversibility: trivial — revert code change keeps in-memory map; ElastiCache cluster can stay running for future use or be destroyed.

When Redis lands for rate limiting, two adjacent opportunities to consider (file separate Linear issues at that time):

Marketing session storage — ENG-322's marketing_sessions Mongo collection could optionally move to Redis. Faster reads (~ms vs ~10-50ms), but Mongo with encryption-at-rest is sufficient for marketing-tier (non-PHI) data. Decide at migration time based on actual perf data.
Distributed locks for delayed-job coordination — currently scheduler-coordinated; Redis-backed locks would enable finer-grained job orchestration. Phase 5 (user portal) work may need this.

Cross-references

src/lib/agent-db.ts:134 — current rate-limiter
ENG-321 — rate limits + Origin allowlist + test bypass (consumer of this rate-limiter today)
ENG-322 — session-cookie architecture (potential future co-tenant on Redis)
ENG-330 — graceful degradation + observability pattern (same [degraded] log marker pattern applies)
ENG-286 audit doc docs/audit/comprehensive-code-review-2026-05-12.md — finding M8

ECS task execution role: shared → per-task-def secret ARN scoping

Source: ENG-286 audit I16. Originally filed as ENG-333 (cancelled 2026-05-14 — moved here).

Current state

Prod ECS task execution role gets secretsmanager:GetSecretValue on every ARN in values(module.secrets.secret_arns) — broader than the per-task-def need.

hcl

# infra/envs/prod/ecs.tf:117
task_execution_secret_arns = values(module.secrets.secret_arns)

Known limitation: defense-in-depth gap

The TASK role (different from execution role) is correctly narrow
The EXECUTION role's job is to pull secrets at task startup and inject them as env vars
Execution role is invoked once per task spawn, never used by the running app
Even if an attacker compromised the execution role (highly unusual — startup-time credential, not runtime), they could pull secrets the task doesn't reference
BUT: the task definition's secrets_from_manager map limits which secrets actually get injected into the running task at runtime

So this is defense-in-depth: tightening the IAM grant matches the task-def need, but the gap doesn't change runtime behavior. Audit explicitly flagged this as Info-severity ("fine for current scale, refine when scaling").

Trigger conditions to revisit

Tighten to per-task-def secret ARN scoping when any of:

User portal milestone adds new task definitions (multiple task defs sharing one execution role = bigger blast radius if execution role compromised)
SOC 2 audit specifically flags least-privilege evidence requirements
General Terraform refactor sweeps the ECS module (fold this in for free)

Proposed migration plan

In infra/modules/ecs-service (or per-env config), build the list of secret ARNs from the actual task_definition.secrets_from_manager map rather than values(module.secrets.secret_arns). Each task def's execution role policy contains only the ARNs that task def references.

Effort: ~30min Terraform refactor + verification (IAM policy JSON diff pre/post)

Reversibility: trivial — revert the Terraform change.

Cross-references

infra/envs/prod/ecs.tf:117 — current task-execution-role secret ARN grant
ENG-286 audit doc docs/audit/comprehensive-code-review-2026-05-12.md — finding I16

Pattern: when does a decision belong here vs in Linear?

Belongs in Linear (actionable now, time-boxed, milestone-bound):

The work has an immediate acute pain it addresses
The work is part of an active milestone or sprint
The work has a definite "done" state achievable in the current cycle

Belongs here (deferred-pending-trigger):

No acute pain today
Specific trigger conditions exist that would change the calculus
Migration plan can be sketched but execution waits for the trigger
Auditor / future engineer benefits from documented thinking

When a trigger fires:

File a new Linear issue
Reference the relevant section here
The Linear issue captures the execution; this doc captures the decision and trigger

This page is reviewed quarterly to catch shifts in trigger conditions and surface any items that have become actionable.