ADR 0007 — Terraform owns the ECS task definition; deploys via `terraform apply -var app_image_uri=`

Status

Accepted — 2026-05-13.

Shipped under ENG-277. Prod went live on the new pipeline 2026-05-13T07:08Z (deploy run 25784042404). Staging mirror landed 2026-05-13 via PR #264.

Supersedes the prior "CI image-swap + Terraform lifecycle.ignore_changes on the task definition" pattern documented in PRs #150 (ENG-272 Layers 1-4) and #162 (ENG-272 Layer 5 drift detection).

Context

The ECS module pinned lifecycle.ignore_changes = [container_definitions] so CI could register new task-definition revisions on every deploy without Terraform fighting it. The block is all-or-nothing on the JSON-encoded container_definitions attribute — when Terraform source added a new secrets[] or environment[] entry, Terraform silently stopped tracking it, and the deploy workflow's describe-task-definition + render-task-definition chain only swapped the image. The new binding never landed on the running task.

Four recurrences of this bug class in ten days:

Date	Issue	What broke
2026-05-04	ENG-249	HubSpot CRM sync code shipped to apex but never created a contact (silent fire-and-forget — `HUBSPOT_ACCESS_TOKEN` missing from live task def)
2026-05-08	ENG-271	15-min agent-reminder never fired (silent no-op — `SCHEDULER_*` env vars missing from live task def)
2026-05-11	ENG-272	"Tyler Wood not covered" wrong on the YC application surface (`MONGODB_REFERENCE_URI` missing from live staging task def)
2026-05-12	ENG-279	`POST /api/waitlist` 500 on the YC link smoke (`MONGODB_WRITE_URI` + `MONGODB_AUDIT_WRITE_URI` missing from live staging task def revision 75)

Each recurrence cost real founder-time to diagnose. Detection layers (Layer 1-5 documented in ENG-272 retro) catch drift between PR-time and runtime, but only AFTER the bad code has shipped and run for some window. The right answer was structural: stop the gap from existing.

Decision

Drop lifecycle.ignore_changes = [container_definitions] from infra/modules/ecs-service. Make Terraform own the whole task definition (including image). Have the deploy workflow drive Terraform with the new image SHA as a per-deploy variable.

Module shape (infra/modules/ecs-service/main.tf):

aws_ecs_task_definition.this — no lifecycle.ignore_changes block on container_definitions. Image is set via var.container_image which env callsites plumb through.
aws_ecs_service.this — lifecycle.ignore_changes shrunk from [desired_count, task_definition] to [desired_count]. (Kept desired_count so the first-deploy 0→2 scaling step doesn't fight Terraform.)

Env callsite shape (infra/envs/{prod,staging}/):

ecs.tf — container_image = var.app_image_uri (instead of hardcoded placeholder).
variables.tf — declares app_image_uri with default "public.ecr.aws/docker/library/nginx:alpine" so no-arg terraform apply parses.

Deploy workflow shape (.github/workflows/deploy-{prod,staging}.yml):

Build + push image to ECR (unchanged).
Ensure-indexes pre-deploy task (unchanged — still uses legacy register-task-definition + run-task CLI path because its own lifecycle.ignore_changes is a separate, smaller-blast-radius concern; future tightening tracked outside this ADR).
hashicorp/setup-terraform@v3 pinned terraform_version: 1.14.0.
terraform init -input=false in infra/envs/<env>.
terraform validate.
terraform apply -auto-approve -input=false -var "app_image_uri=<ecr-uri-from-build-step-output>" (the workflow uses the build step's image-uri output).
timeout <secs> aws ecs wait services-stable mirroring legacy wait-for-service-stability semantics.

Consequences

Good

The ENG-249 / ENG-271 / ENG-272 / ENG-279 bug class is structurally retired. Terraform-source secrets[] and environment[] additions land on the running task on the next deploy by construction, not by hoping the detection layers catch the gap before users do.
Single source of truth. Terraform owns the task definition; the manifest declares users/roles; the workflow only declares "what image SHA." Everything reconciles by Terraform's plan/apply, not by string concatenation in the workflow.
Reproducible past deploys. terraform apply -var app_image_uri=<old-sha> reproduces any past deploy. Better than reading old task def revisions out of AWS state.
Removed ~600 lines of CI machinery (ENG-272 Layer 5 drift checker + workflow + manual patch helper + custom rendering in deploy.yml — replaced by a single terraform apply).
Cleaner SOC 2 / EDE evidence story. Every state change has one auditable source (the workflow run) instead of "the workflow rendered the task def with this content, and Terraform separately said something different which the live state ignored."

Bad

Deploy time grows by ~30-60s. Terraform init + plan + apply overhead vs the legacy image-swap chain. Net per-deploy delta: ~30-50s slower (acceptable; deploys aren't time-critical).
Deploy workflows gain a Terraform dependency. Workflow now needs setup-terraform action + Terraform state backend access. Both were already used by infra workflows so the patterns existed.
Concurrent deploys serialize via state lock. Terraform's state lock means two deploys can't apply in parallel. GH Actions concurrency: deploy-<env> already enforced serial deploys, so no real behavior change.
Deploy role needs broader read perms. terraform apply triggers a full state refresh that reads every resource. The original ENG-277 issue text claimed "the deploy role already has terraform-grade perms" — that was wrong. Resolved via ENG-308 (added TerraformRefreshReadOnly Sid with ~50 read-only Get* / Describe* / List* actions across iam, ec2, acm, elbv2, cloudfront, wafv2, scheduler, lambda, logs, kms, sesv2, secretsmanager-metadata-only, route53, ecs, ecr). Documented compliance trade-off; all actions are read-only with no escalation possible.
Secret-value read scope is broader than ideal. aws_secretsmanager_secret_version.placeholder resources force the deploy role to have secretsmanager:GetSecretValue on every secret Terraform tracks (the provider's Read function calls it even with lifecycle.ignore_changes = [secret_string] — empirically proven in ENG-313). Current resting scope: arn:aws:secretsmanager:us-east-1:<env-account>:secret:*. Accepted trade-off, documented in the DeploySecretsRead Sid comment block. Two architectural paths to safely re-narrow (cancelled as theater under the realistic threat model — see ENG-313 + ENG-314 + ENG-315 for the analysis).

Compliance posture trade-off (broader deploy-role read access)

The deploy role's broader secret-read scope is documented as an accepted trade-off because:

All deploy-role access is gated by OIDC trust scoped to main branch + production environment
GitHub environment protection rule requires manual approval per dispatch
max_session_duration = 3600 (1h) limits any leaked credential's window
No IAM escalation actions (no iam:Create* / Put* / Attach* / Update*) — verified via the failed ENG-313 deploy where iam:PutRolePolicy was correctly denied
Resource boundary to env account (cross-account read impossible)
CloudTrail records every GetSecretValue with role + ARN attribution
Separate ValidateSecretsRole (ENG-309) prevents the validate-secrets workflow from inheriting broad read

Defensible under SOC 2 CC6.1 (logical access) + HIPAA §164.308(a)(4) (information access management). EDE Phase 3 auditor narrative: "deploy role has env-account secret read because Terraform manages secret-version resources; least-privilege scoped to the env account; cross-account read impossible; manual approval gate prevents unauthorized invocation."

Alternatives considered

Narrow lifecycle.ignore_changes to ignore only the image field. Rejected — container_definitions is a JSON-encoded string attribute; AWS provider doesn't expose sub-field-level lifecycle control.
Keep ignore_changes and add stricter manifest↔Terraform PR-time checks. Tried this (ENG-272 Layer 4 + Layer 5 detection). Caught some recurrences after-the-fact but didn't prevent them. Detection is necessary but not sufficient.
Move ECS service management out of Terraform entirely. Rejected — that loses the structural benefits Terraform provides (state lock, plan review, audit trail).
Replace aws_ecs_task_definition with replace_triggered_by on image_uri data source. Rejected — replace_triggered_by is brittle and still suffers the ignore_changes problem on secondary attributes.
CloudFormation StackSets instead of Terraform. Rejected — team has zero CFN tooling; full re-platform of infra layer for marginal benefit.

References

ENG-277 — the structural fix issue
ENG-272 — the bug class root-cause analysis (Layers 1-5 detection that this ADR retires Layer 5 of)
ENG-279 — the 4th recurrence that motivated the structural fix
ENG-308 — TerraformRefreshReadOnly IAM expansion (unblocked the new pipeline)
ENG-309 — ValidateSecretsRole separation (preserves SOC 2 role-per-workload posture)
ENG-313 — compliance hardening attempt + empirical revert; documented why the secret-read scope stays at env-account
PR #234 — PR 1, prod-side ship
PR #264 — PR 2, staging-side ship
PR #234 onward chain — PRs #240, #241, #244, #245, #252, #253, #254, #255, #256, #260, #261 — the IAM-perms unblock chain
docs/runbooks/deploy-via-terraform.md — operational runbook for the new pipeline
docs/runbooks/rollback-via-terraform.md — rollback procedures
docs/infrastructure/atlas-access-matrix.md — updated to remove the Layer 5 drift-checker reference
infra/modules/ecs-service/main.tf:215 (legacy line; now removed) — the historical location of lifecycle.ignore_changes = [container_definitions]