Appearance
Deploying via Terraform
Operational runbook for the deploy pipeline that landed in ENG-277 / ADR 0007.
How deploys fire
| Env | Trigger | Approval |
|---|---|---|
| Prod | gh workflow run deploy-prod.yml --ref <ref> (manual workflow_dispatch) OR GitHub Actions UI | GitHub environment protection rule on production requires Taha to click Approve |
| Staging | git push origin <commit>:staging (push to the staging branch fires automatically) OR gh workflow run deploy-staging.yml | None (auto on push) |
Vercel prod from main is retired as of Phase 10 (2026-04-29). All prod traffic goes through AWS ECS via the prod deploy workflow.
What happens during a deploy run (both envs, same shape)
- Validate-secrets gate (ENG-297 / ENG-309). Reusable
validate-secrets.ymlworkflow runs first, assumes the env's dedicatedValidateSecretsRole, and iterates every secret in the env's Secrets Manager checking for\n, trailing whitespace, empty values, placeholder strings. Failure blocks the build job. - Checkout + OIDC + ECR login. Deploy job assumes
GitHubActionsDeployRolefor the env account. - Build + push image.
docker buildx buildagainstlinux/amd64, push to ECR with:<sha>tag (prod uses immutable tags; staging additionally pushes:latest). - Ensure-indexes pre-deploy task (ENG-266 Phase 3.5). In-VPC ECS task assumes its own execution role (which has the
app_admin_schemaadmin-tier Mongo secret), runs index creation, awaits completion, checks exit code. Failure aborts the deploy with old service still active. - Setup Terraform.
hashicorp/setup-terraform@v3pinned to1.14.0. terraform initininfra/envs/<env>. AssumesTerraformBackendRolein the management account (778477254880) via the deploy role'sAssumeRoleBackendSid.terraform validate.terraform apply -auto-approve -input=false -var "app_image_uri=<built-image-uri>". This is the moment of truth. Terraform refreshes the env state, computes the diff (image SHA change + any source-side env/secret additions), registers a new task definition revision, updates the ECS service to point at it. Servicedeployment_circuit_breaker { enable = true, rollback = true }handles the rolling deployment with auto-rollback on health failure.aws ecs wait services-stablewith timeout (prod 15min, staging 10min). Polls every 15s untilrunningCount == desiredCounton the new revision.- Scale to N if first deploy. (Prod: 0→2. Staging: 0→1.) No-op on subsequent deploys.
- ALB smoke against
origin.<stage.>askflorence.health/api/health— must return 200. - Fetch smoke secrets from Secrets Manager (4 ARNs:
mongodb/app-write,resume-token-secret,internal-reminder-token,hubspot-access-token). - Post-deploy smoke via
npx tsx scripts/audit/post-deploy-smoke.ts— 11 checks (6 read-path from ENG-272 + 5 write-path from ENG-275). - Report summary to the workflow run summary page.
Deploying a specific commit (rollback or canary)
Prod
bash
gh workflow run deploy-prod.yml --ref <commit-or-branch> -f ref=<commit-or-branch>The ref input determines what code gets checked out + built. Default is main.
Staging
bash
# Push the specific commit to the staging branch.
git push origin <commit>:staging --force-with-leaseOr via merge if a clean history is desired:
bash
git checkout staging
git merge --no-ff <commit> -m "deploy: promote <commit-summary> to staging"
git push origin stagingWhen to expect state-lock contention
- Within a single env: GitHub Actions
concurrency: deploy-<env>group serializes deploys. Two concurrent dispatches queue rather than fight. - Cross-env: prod and staging have independent state files. No cross-contention.
- Local
terraform plan/applyfrom a developer machine WILL conflict with an in-flight deploy. Coordinate via team chat before running local ops against the same env root.
When to expect refresh slowdowns
- First deploy after a long quiet period: refresh reads every resource in the env root. Can take 60-90s on prod (more state). Normal deploys 30-50s.
- After a Terraform provider upgrade: refresh may re-read attribute shapes. One-time per provider bump.
- After unrelated infra source changes that haven't been applied yet: refresh + plan will surface them. Either apply them as part of the deploy (intentional) or rebase the PR so they're already in state (cleaner).
Ensure-indexes interaction
The ensure-indexes task family (<env>-app-ensure-indexes-task) still uses the legacy CLI path (register-task-definition + run-task) and has its own lifecycle.ignore_changes = [container_definitions] in infra/modules/ecs-ensure-indexes/main.tf. This is intentional and tracked as a separate concern — ensure-indexes is a one-shot pre-deploy task with visible non-zero-exit failure mode, not a long-running service. Its drift surface is smaller and less consequential.
If the ensure-indexes task fails (non-zero exit), the deploy aborts BEFORE Terraform apply runs, leaving the old service active. Clean rollback by default.
Debug-on-failure paths
- Workflow failed at
Build and push image to ECR: docker build or push errored. Check the step log for the actual error. - Workflow failed at
Run ensure-indexes ECS task: the task ran in-VPC and exited non-zero. Check/aws/ecs/<env>-app-ensure-indexeslog group (aws logs tailshown in the workflow's debug-on-failure step). - Workflow failed at
Terraform init: state backend access issue. Most common: OIDC role'sAssumeRoleBackendSid is missing or the backend KMS key isn't readable. See ENG-308 for the original IAM expansion. - Workflow failed at
Terraform apply: can mean (a) AccessDenied on a refresh path (deploy role missing aGet*/Describe*action — add toTerraformRefreshReadOnlySid), (b) AccessDenied on a write path (deploy role can't actually mutate the resource — usually by design; check what Terraform was trying to change), or (c) a real diff that fails to apply (e.g. ECS service circuit breaker auto-rollback). Check the apply step log. - Workflow failed at
Wait for ECS service stability: new task def is unhealthy; circuit breaker should have auto-rolled back. Confirm withaws ecs describe-services— if the service is still on the old revision, the rollback worked. - Workflow failed at
Smoke test /api/health: ALB target is reachable but the app is unhealthy. Check/aws/ecs/<env>-applog group. - Workflow failed at
Post-deploy smoke: one of the 11 checks failed. The script prints which check + the response. Most common failure mode: a recently-added secret is in Terraform source but the live task def'ssecrets[]doesn't have it — except after ENG-277 this can't happen by construction, so any failure here points to a real app-side regression.
Why this works (the design intent)
lifecycle.ignore_changes = [container_definitions] on aws_ecs_task_definition is gone. Terraform now owns the task def end-to-end: image SHA via -var, env vars + secret bindings from Terraform source. The deploy workflow's job reduces to "build the image + tell Terraform what SHA to use + smoke-test the result." Every other artifact (revision number, secrets[] composition, environment[] composition) is reconciled by terraform apply against the source-of-truth Terraform code.
Result: the silent-secret-binding-drift bug class (ENG-249, ENG-271, ENG-272, ENG-279) cannot recur. Terraform source ↔ live ECS task def parity is enforced by construction, not by a nightly checker that catches drift after-the-fact.
See also
- ADR 0007 — the design decision
docs/runbooks/rollback-via-terraform.md— rollback proceduresinfra/modules/ecs-service/main.tf— the module that owns the task def.github/workflows/deploy-prod.yml+deploy-staging.yml— the workflow definitions