Skip to content
AskFlorence
Main Navigation ArchitectureFlorence AIAgentsMembersAgent PlatformValidationInfrastructure

Appearance

Sidebar Navigation

Overview

Home

Glossary

System Architecture

Consumer & Agent Flow

Florence AI

Overview

Principles

Runtime

Tool surface

Adding a tool

Tool registry

Knowledge: SBC scenarios & CSR

Voice

Evals & observability

Provider risk & portability

Outage playbook

Roadmap

Build plan

Agents

Overview

Workflows & pain points

Members

Overview

Medicaid coverage gap

Carriers

Overview

Marketplaces

Overview

Agency

Overview

Regulations

Overview

Agent Platform

Overview

Auth Architecture

MongoDB Permissioning

Compliance Model

Data Models

Data Sources

Overview

CMS Marketplace API

CMS dependency map

PUF Data

State Subsidies

SBE Ingestion Playbook

SBE State Watchouts + Decisions

CA Phase C/D Playbook

NY Phase C/D Playbook

Validation

Overview

Methodology

APTC Formula

California 2026

New York 2026

CAPS Formula

Scenario Results

Infrastructure

Account Inventory

AWS Setup Runbook

AWS Organizations

CloudTrail

GuardDuty

Security Hub

Config

CloudFront + WAFv2

Data sources & ingest

Phase 4 DNS

Change Log

Vulnerability Management

MongoDB Setup

Access Control

Data Classification

Documentation Hosting

Post-deploy Smoke

Development

Preflight (local CI mirror)

Testing strategy

Compliance

Overview (auditor entry point)

SOC 2 Control Mapping

HIPAA Control Mapping

CMS EDE Appendix A Mapping

Risk Assessment

Encryption Policy

Data Retention Policy

Privacy Impact Assessment

Consent Capture & Versioning

Incident Response Plan

Access Control Policy

Marketing vs. Portal Analytics

Vendor / Subprocessor Register

Dependency Vulnerability Policy

BAA / Compliance Evidence

Compliance-Automation Integration

Compliance-Automation Vendor Evaluation

Penetration Test Reports

Architecture

Portal entry handoff

Mobile app strategy

Deferred architecture decisions

Session cookie architecture

Share flows

Decisions (ADRs)

Index

0001 — Atlas project isolation

0002 — Append-only audit log

0003 — Narrow-scoped Mongo users

0004 — Cross-cluster Atlas PrivateLink

0005 — Delayed-job architecture

0006 — Mongo user simplification

0007 — Terraform owns ECS task def

0008 — E2E testing strategy

0009 — Self-hosted analytics + observability (superseded)

0010 — PostHog HIPAA Cloud (supersedes 0009)

Runbooks

Security Incident Response

Break-Glass Root Login

Onboard Team Member

Offboard Team Member

Atlas user provisioning

Deploy via Terraform (ENG-277)

Rollback via Terraform (ENG-277)

S3 data bucket migration (planned Phase 11)

Access Reviews

2026-Q2 Review

Session log

Index

2026-04-23 — Phase 10 DNS cutover

2026-04-22 — Phase 8 prod AWS mirror

2026-04-22 — Phase 7 Atlas VPC peering

2026-04-22 — Phase 6 CloudFront + WAF

2026-04-21 — Phase 5 staging go-live

2026-04-17 — Atlas staging

Briefs

Index

Member portal plan (ENG-187)

2026-04-16/17 handoff

2026-04-17 Atlas handoff

System briefing (2026-04-17)

Creative AdBundance proposal brief

Creative AdBundance analytics brief

ElevenLabs RN integration research

Policies

Overview

On this page

Rollback via Terraform ​

Operational runbook for rolling back a bad deploy on the new Terraform-driven pipeline (ADR 0007).

Quick reference ​

ScenarioFastest mitigationThen
New deploy made the app worse but service is still healthyRe-dispatch with the previous good refInvestigate the bad change in a follow-up PR
New deploy made the service unhealthy (5xx, crashloop)ECS circuit breaker auto-rolls back to previous revision; verify and confirmInvestigate why the circuit breaker triggered
Circuit breaker didn't catch it (rare; new task is "healthy" but app behaves wrong)Pin to previous revision via aws ecs update-service --task-definition <family>:<old-N>Then revert the PR + re-dispatch
Bad infra change (Terraform source) blocking deploysgit revert <PR-commit> on main; re-dispatchRe-architect the change

Path A — Redeploy a previous good commit (preferred) ​

This is the cleanest rollback. The same Terraform-driven pipeline rebuilds + applies a known-good image.

Prod ​

bash
# Find the last known-good commit
git log --oneline main -20

# Dispatch deploy-prod with that commit
gh workflow run deploy-prod.yml --ref main -f ref=<good-commit-sha>

The ref input determines what gets checked out + built. Approve the GitHub environment protection prompt.

Staging ​

bash
# Force-push the staging branch to point at the previous good commit
git push origin <good-commit-sha>:staging --force-with-lease

The deploy workflow fires on the push and applies the previous commit's state. Terraform will see whatever drift exists between the bad apply and the good source, plan to reconcile, and apply.

This is the canonical "fix forward via rollback" path. Use it whenever possible. The state stays clean because every state mutation comes from a real workflow run, not from out-of-band CLI surgery.

Path B — Pin ECS service to a previous task def revision (emergency) ​

Use only when path A is slow (build + apply takes ~5-10 min) AND the regression is severe.

bash
# Identify the previous good task definition revision
aws ecs describe-services \
  --cluster askflorence-prod \
  --services askflorence-prod-app \
  --query 'services[0].deployments[]' \
  --profile askflorence-prod

# Pin the service to a specific revision
aws ecs update-service \
  --cluster askflorence-prod \
  --service askflorence-prod-app \
  --task-definition askflorence-prod-app-task:<N> \
  --profile askflorence-prod

# Wait for stability
aws ecs wait services-stable \
  --cluster askflorence-prod \
  --services askflorence-prod-app \
  --profile askflorence-prod

WARNING: this creates Terraform-state drift. The next terraform apply (deploy or local) will see that the service's task_definition ARN doesn't match what Terraform tracks, plan to "reconcile" by registering a new revision matching the bad source, and re-apply the bad change.

You MUST follow Path B with one of:

  1. git revert <bad-PR-commit> on main + dispatch a new deploy (preferred). The next deploy applies the reverted source over the manually-pinned revision; the service ends up on the reverted code via a clean Terraform apply.
  2. Or terraform import / terraform refresh carefully so state matches the pinned reality. This is fragile; reserve for unrecoverable scenarios.

Path C — Revert the infra PR + redeploy (when infra source is the problem) ​

For deploys broken by an infra change (not application code):

bash
git revert <bad-infra-PR-commit>
git push origin <branch-with-revert>
# Open + merge revert PR
# Re-dispatch deploy

Same out-of-band IAM-apply pattern used during ENG-308 + ENG-313 may be needed if the bad change touches the deploy role's permissions and iam:PutRolePolicy is denied (by design — the deploy role can't mutate itself).

Rolling back the ENG-277 deploy pipeline itself ​

If the new Terraform-driven pipeline needs to be reverted entirely:

  1. Revert PRs #234 (prod) + #264 (staging) — restores lifecycle.ignore_changes = [container_definitions] and the legacy deploy workflow chain.
  2. Live ECS task def is not affected by the source revert — Terraform respects ignore_changes again, so it leaves the live revision alone.
  3. Next deploy uses the restored legacy chain (aws ecs describe-task-definition + amazon-ecs-render-task-definition + amazon-ecs-deploy-task-definition).
  4. The ENG-272 bug class returns. Document the revert reason carefully because the team learned the hard way that detection-only mitigations are insufficient.

This shouldn't be necessary — prod has been stable on the new pipeline since 2026-05-13T07:08Z. But it's a real escape hatch.

Verifying rollback success ​

After any rollback path:

bash
# Confirm live task def matches expectations
aws ecs describe-task-definition \
  --task-definition askflorence-<env>-app-task \
  --query 'taskDefinition.{revision:revision,image:containerDefinitions[0].image,envCount:length(containerDefinitions[0].environment),secretCount:length(containerDefinitions[0].secrets)}' \
  --profile askflorence-<env>

# Confirm service is stable
aws ecs describe-services \
  --cluster askflorence-<env> \
  --services askflorence-<env>-app \
  --query 'services[0].{desiredCount:desiredCount,runningCount:runningCount,taskDefinition:taskDefinition}' \
  --profile askflorence-<env>

# Apex / stage health
curl -sS https://askflorence.health/api/health     # or https://stage.askflorence.health/api/health
# Should return {"status":"ok","commit":"<expected-sha>","env":"prod"}

See also ​

  • ADR 0007 — the deploy pipeline design
  • docs/runbooks/deploy-via-terraform.md — normal deploy operations
  • docs/infrastructure/change-log.md — recent changes to look at when diagnosing a regression
Pager
Previous pageDeploy via Terraform (ENG-277)
Next pageS3 data bucket migration (planned Phase 11)

AskFlorence Internal Documentation. Not for public distribution.

AskFlorence

Internal Documentation

Access restricted. Not for public distribution.