Deploying via Terraform

Operational runbook for the deploy pipeline that landed in ENG-277 / ADR 0007.

How deploys fire

Env	Trigger	Approval
Prod	`gh workflow run deploy-prod.yml --ref <ref>` (manual `workflow_dispatch`) OR GitHub Actions UI	GitHub environment protection rule on `production` requires Taha to click Approve
Staging	`git push origin <commit>:staging` (push to the `staging` branch fires automatically) OR `gh workflow run deploy-staging.yml`	None (auto on push)

Vercel prod from main is retired as of Phase 10 (2026-04-29). All prod traffic goes through AWS ECS via the prod deploy workflow.

What happens during a deploy run (both envs, same shape)

Validate-secrets gate (ENG-297 / ENG-309). Reusable validate-secrets.yml workflow runs first, assumes the env's dedicated ValidateSecretsRole, and iterates every secret in the env's Secrets Manager checking for \n, trailing whitespace, empty values, placeholder strings. Failure blocks the build job.
Checkout + OIDC + ECR login. Deploy job assumes GitHubActionsDeployRole for the env account.
Build + push image. docker buildx build against linux/amd64, push to ECR with :<sha> tag (prod uses immutable tags; staging additionally pushes :latest).
Ensure-indexes pre-deploy task (ENG-266 Phase 3.5). In-VPC ECS task assumes its own execution role (which has the app_admin_schema admin-tier Mongo secret), runs index creation, awaits completion, checks exit code. Failure aborts the deploy with old service still active.
Setup Terraform. hashicorp/setup-terraform@v3 pinned to 1.14.0.
terraform init in infra/envs/<env>. Assumes TerraformBackendRole in the management account (778477254880) via the deploy role's AssumeRoleBackend Sid.
terraform validate.
terraform apply -auto-approve -input=false -var "app_image_uri=<built-image-uri>". This is the moment of truth. Terraform refreshes the env state, computes the diff (image SHA change + any source-side env/secret additions), registers a new task definition revision, updates the ECS service to point at it. Service deployment_circuit_breaker { enable = true, rollback = true } handles the rolling deployment with auto-rollback on health failure.
aws ecs wait services-stable with timeout (prod 15min, staging 10min). Polls every 15s until runningCount == desiredCount on the new revision.
Scale to N if first deploy. (Prod: 0→2. Staging: 0→1.) No-op on subsequent deploys.
ALB smoke against origin.<stage.>askflorence.health/api/health — must return 200.
Fetch smoke secrets from Secrets Manager (4 ARNs: mongodb/app-write, resume-token-secret, internal-reminder-token, hubspot-access-token).
Post-deploy smoke via npx tsx scripts/audit/post-deploy-smoke.ts — 11 checks (6 read-path from ENG-272 + 5 write-path from ENG-275).
Report summary to the workflow run summary page.

Deploying a specific commit (rollback or canary)

Prod

bash

gh workflow run deploy-prod.yml --ref <commit-or-branch> -f ref=<commit-or-branch>

The ref input determines what code gets checked out + built. Default is main.

Staging

bash

# Push the specific commit to the staging branch.
git push origin <commit>:staging --force-with-lease

Or via merge if a clean history is desired:

bash

git checkout staging
git merge --no-ff <commit> -m "deploy: promote <commit-summary> to staging"
git push origin staging

When to expect state-lock contention

Within a single env: GitHub Actions concurrency: deploy-<env> group serializes deploys. Two concurrent dispatches queue rather than fight.
Cross-env: prod and staging have independent state files. No cross-contention.
Local terraform plan/apply from a developer machine WILL conflict with an in-flight deploy. Coordinate via team chat before running local ops against the same env root.

When to expect refresh slowdowns

First deploy after a long quiet period: refresh reads every resource in the env root. Can take 60-90s on prod (more state). Normal deploys 30-50s.
After a Terraform provider upgrade: refresh may re-read attribute shapes. One-time per provider bump.
After unrelated infra source changes that haven't been applied yet: refresh + plan will surface them. Either apply them as part of the deploy (intentional) or rebase the PR so they're already in state (cleaner).

Ensure-indexes interaction

The ensure-indexes task family (<env>-app-ensure-indexes-task) still uses the legacy CLI path (register-task-definition + run-task) and has its own lifecycle.ignore_changes = [container_definitions] in infra/modules/ecs-ensure-indexes/main.tf. This is intentional and tracked as a separate concern — ensure-indexes is a one-shot pre-deploy task with visible non-zero-exit failure mode, not a long-running service. Its drift surface is smaller and less consequential.

If the ensure-indexes task fails (non-zero exit), the deploy aborts BEFORE Terraform apply runs, leaving the old service active. Clean rollback by default.

Debug-on-failure paths

Workflow failed at Build and push image to ECR: docker build or push errored. Check the step log for the actual error.
Workflow failed at Run ensure-indexes ECS task: the task ran in-VPC and exited non-zero. Check /aws/ecs/<env>-app-ensure-indexes log group (aws logs tail shown in the workflow's debug-on-failure step).
Workflow failed at Terraform init: state backend access issue. Most common: OIDC role's AssumeRoleBackend Sid is missing or the backend KMS key isn't readable. See ENG-308 for the original IAM expansion.
Workflow failed at Terraform apply: can mean (a) AccessDenied on a refresh path (deploy role missing a Get* / Describe* action — add to TerraformRefreshReadOnly Sid), (b) AccessDenied on a write path (deploy role can't actually mutate the resource — usually by design; check what Terraform was trying to change), or (c) a real diff that fails to apply (e.g. ECS service circuit breaker auto-rollback). Check the apply step log.
Workflow failed at Wait for ECS service stability: new task def is unhealthy; circuit breaker should have auto-rolled back. Confirm with aws ecs describe-services — if the service is still on the old revision, the rollback worked.
Workflow failed at Smoke test /api/health: ALB target is reachable but the app is unhealthy. Check /aws/ecs/<env>-app log group.
Workflow failed at Post-deploy smoke: one of the 11 checks failed. The script prints which check + the response. Most common failure mode: a recently-added secret is in Terraform source but the live task def's secrets[] doesn't have it — except after ENG-277 this can't happen by construction, so any failure here points to a real app-side regression.

Why this works (the design intent)

lifecycle.ignore_changes = [container_definitions] on aws_ecs_task_definition is gone. Terraform now owns the task def end-to-end: image SHA via -var, env vars + secret bindings from Terraform source. The deploy workflow's job reduces to "build the image + tell Terraform what SHA to use + smoke-test the result." Every other artifact (revision number, secrets[] composition, environment[] composition) is reconciled by terraform apply against the source-of-truth Terraform code.

Result: the silent-secret-binding-drift bug class (ENG-249, ENG-271, ENG-272, ENG-279) cannot recur. Terraform source ↔ live ECS task def parity is enforced by construction, not by a nightly checker that catches drift after-the-fact.

Deploying via Terraform

How deploys fire

What happens during a deploy run (both envs, same shape)

Deploying a specific commit (rollback or canary)

Prod

Staging

When to expect state-lock contention

When to expect refresh slowdowns

Ensure-indexes interaction

Debug-on-failure paths

Why this works (the design intent)

See also

AskFlorence