Appearance
ADR 0007 — Terraform owns the ECS task definition; deploys via terraform apply -var app_image_uri=
Status
Accepted — 2026-05-13.
Shipped under ENG-277. Prod went live on the new pipeline 2026-05-13T07:08Z (deploy run 25784042404). Staging mirror landed 2026-05-13 via PR #264.
Supersedes the prior "CI image-swap + Terraform lifecycle.ignore_changes on the task definition" pattern documented in PRs #150 (ENG-272 Layers 1-4) and #162 (ENG-272 Layer 5 drift detection).
Context
The ECS module pinned lifecycle.ignore_changes = [container_definitions] so CI could register new task-definition revisions on every deploy without Terraform fighting it. The block is all-or-nothing on the JSON-encoded container_definitions attribute — when Terraform source added a new secrets[] or environment[] entry, Terraform silently stopped tracking it, and the deploy workflow's describe-task-definition + render-task-definition chain only swapped the image. The new binding never landed on the running task.
Four recurrences of this bug class in ten days:
| Date | Issue | What broke |
|---|---|---|
| 2026-05-04 | ENG-249 | HubSpot CRM sync code shipped to apex but never created a contact (silent fire-and-forget — HUBSPOT_ACCESS_TOKEN missing from live task def) |
| 2026-05-08 | ENG-271 | 15-min agent-reminder never fired (silent no-op — SCHEDULER_* env vars missing from live task def) |
| 2026-05-11 | ENG-272 | "Tyler Wood not covered" wrong on the YC application surface (MONGODB_REFERENCE_URI missing from live staging task def) |
| 2026-05-12 | ENG-279 | POST /api/waitlist 500 on the YC link smoke (MONGODB_WRITE_URI + MONGODB_AUDIT_WRITE_URI missing from live staging task def revision 75) |
Each recurrence cost real founder-time to diagnose. Detection layers (Layer 1-5 documented in ENG-272 retro) catch drift between PR-time and runtime, but only AFTER the bad code has shipped and run for some window. The right answer was structural: stop the gap from existing.
Decision
Drop lifecycle.ignore_changes = [container_definitions] from infra/modules/ecs-service. Make Terraform own the whole task definition (including image). Have the deploy workflow drive Terraform with the new image SHA as a per-deploy variable.
Module shape (infra/modules/ecs-service/main.tf):
aws_ecs_task_definition.this— nolifecycle.ignore_changesblock oncontainer_definitions. Image is set viavar.container_imagewhich env callsites plumb through.aws_ecs_service.this—lifecycle.ignore_changesshrunk from[desired_count, task_definition]to[desired_count]. (Keptdesired_countso the first-deploy 0→2 scaling step doesn't fight Terraform.)
Env callsite shape (infra/envs/{prod,staging}/):
ecs.tf—container_image = var.app_image_uri(instead of hardcoded placeholder).variables.tf— declaresapp_image_uriwith default"public.ecr.aws/docker/library/nginx:alpine"so no-argterraform applyparses.
Deploy workflow shape (.github/workflows/deploy-{prod,staging}.yml):
- Build + push image to ECR (unchanged).
- Ensure-indexes pre-deploy task (unchanged — still uses legacy
register-task-definition+run-taskCLI path because its ownlifecycle.ignore_changesis a separate, smaller-blast-radius concern; future tightening tracked outside this ADR). hashicorp/setup-terraform@v3pinnedterraform_version: 1.14.0.terraform init -input=falseininfra/envs/<env>.terraform validate.terraform apply -auto-approve -input=false -var "app_image_uri=<ecr-uri-from-build-step-output>"(the workflow uses the build step'simage-urioutput).timeout <secs> aws ecs wait services-stablemirroring legacywait-for-service-stabilitysemantics.
Consequences
Good
- The ENG-249 / ENG-271 / ENG-272 / ENG-279 bug class is structurally retired. Terraform-source
secrets[]andenvironment[]additions land on the running task on the next deploy by construction, not by hoping the detection layers catch the gap before users do. - Single source of truth. Terraform owns the task definition; the manifest declares users/roles; the workflow only declares "what image SHA." Everything reconciles by Terraform's plan/apply, not by string concatenation in the workflow.
- Reproducible past deploys.
terraform apply -var app_image_uri=<old-sha>reproduces any past deploy. Better than reading old task def revisions out of AWS state. - Removed ~600 lines of CI machinery (ENG-272 Layer 5 drift checker + workflow + manual patch helper + custom rendering in deploy.yml — replaced by a single
terraform apply). - Cleaner SOC 2 / EDE evidence story. Every state change has one auditable source (the workflow run) instead of "the workflow rendered the task def with this content, and Terraform separately said something different which the live state ignored."
Bad
- Deploy time grows by ~30-60s. Terraform init + plan + apply overhead vs the legacy image-swap chain. Net per-deploy delta: ~30-50s slower (acceptable; deploys aren't time-critical).
- Deploy workflows gain a Terraform dependency. Workflow now needs
setup-terraformaction + Terraform state backend access. Both were already used by infra workflows so the patterns existed. - Concurrent deploys serialize via state lock. Terraform's state lock means two deploys can't apply in parallel. GH Actions
concurrency: deploy-<env>already enforced serial deploys, so no real behavior change. - Deploy role needs broader read perms.
terraform applytriggers a full state refresh that reads every resource. The original ENG-277 issue text claimed "the deploy role already has terraform-grade perms" — that was wrong. Resolved via ENG-308 (addedTerraformRefreshReadOnlySid with ~50 read-onlyGet*/Describe*/List*actions across iam, ec2, acm, elbv2, cloudfront, wafv2, scheduler, lambda, logs, kms, sesv2, secretsmanager-metadata-only, route53, ecs, ecr). Documented compliance trade-off; all actions are read-only with no escalation possible. - Secret-value read scope is broader than ideal.
aws_secretsmanager_secret_version.placeholderresources force the deploy role to havesecretsmanager:GetSecretValueon every secret Terraform tracks (the provider's Read function calls it even withlifecycle.ignore_changes = [secret_string]— empirically proven in ENG-313). Current resting scope:arn:aws:secretsmanager:us-east-1:<env-account>:secret:*. Accepted trade-off, documented in theDeploySecretsReadSid comment block. Two architectural paths to safely re-narrow (cancelled as theater under the realistic threat model — see ENG-313 + ENG-314 + ENG-315 for the analysis).
Compliance posture trade-off (broader deploy-role read access)
The deploy role's broader secret-read scope is documented as an accepted trade-off because:
- All deploy-role access is gated by OIDC trust scoped to
mainbranch +productionenvironment - GitHub environment protection rule requires manual approval per dispatch
max_session_duration = 3600(1h) limits any leaked credential's window- No IAM escalation actions (no
iam:Create*/Put*/Attach*/Update*) — verified via the failed ENG-313 deploy whereiam:PutRolePolicywas correctly denied - Resource boundary to env account (cross-account read impossible)
- CloudTrail records every
GetSecretValuewith role + ARN attribution - Separate
ValidateSecretsRole(ENG-309) prevents the validate-secrets workflow from inheriting broad read
Defensible under SOC 2 CC6.1 (logical access) + HIPAA §164.308(a)(4) (information access management). EDE Phase 3 auditor narrative: "deploy role has env-account secret read because Terraform manages secret-version resources; least-privilege scoped to the env account; cross-account read impossible; manual approval gate prevents unauthorized invocation."
Alternatives considered
- Narrow
lifecycle.ignore_changesto ignore only theimagefield. Rejected —container_definitionsis a JSON-encoded string attribute; AWS provider doesn't expose sub-field-level lifecycle control. - Keep
ignore_changesand add stricter manifest↔Terraform PR-time checks. Tried this (ENG-272 Layer 4 + Layer 5 detection). Caught some recurrences after-the-fact but didn't prevent them. Detection is necessary but not sufficient. - Move ECS service management out of Terraform entirely. Rejected — that loses the structural benefits Terraform provides (state lock, plan review, audit trail).
- Replace
aws_ecs_task_definitionwithreplace_triggered_byonimage_uridata source. Rejected —replace_triggered_byis brittle and still suffers theignore_changesproblem on secondary attributes. - CloudFormation StackSets instead of Terraform. Rejected — team has zero CFN tooling; full re-platform of infra layer for marginal benefit.
References
- ENG-277 — the structural fix issue
- ENG-272 — the bug class root-cause analysis (Layers 1-5 detection that this ADR retires Layer 5 of)
- ENG-279 — the 4th recurrence that motivated the structural fix
- ENG-308 —
TerraformRefreshReadOnlyIAM expansion (unblocked the new pipeline) - ENG-309 —
ValidateSecretsRoleseparation (preserves SOC 2 role-per-workload posture) - ENG-313 — compliance hardening attempt + empirical revert; documented why the secret-read scope stays at env-account
- PR #234 — PR 1, prod-side ship
- PR #264 — PR 2, staging-side ship
- PR #234 onward chain — PRs #240, #241, #244, #245, #252, #253, #254, #255, #256, #260, #261 — the IAM-perms unblock chain
docs/runbooks/deploy-via-terraform.md— operational runbook for the new pipelinedocs/runbooks/rollback-via-terraform.md— rollback proceduresdocs/infrastructure/atlas-access-matrix.md— updated to remove the Layer 5 drift-checker referenceinfra/modules/ecs-service/main.tf:215(legacy line; now removed) — the historical location oflifecycle.ignore_changes = [container_definitions]