Terraform at Scale: Multi-Tenant Azure Deployment
Every multi-tenant Terraform project eventually hits the same wall. You start with a reasonable setup: a shared backend, one service principal, a handful of subscriptions. Then a new tenant gets added, then a second region, then someone needs production isolated from staging at the subscription level. The thing that made the original setup clean becomes the thing that makes it unmanageable.
The architecture I landed on after working through this problem treats each deployment as a unique cell in a four-dimensional matrix: tenant, region, environment, and concern layer. Every cell gets its own state file, its own backend, and its own OIDC credential. There's no shared state storage account. There's no credential that can cross subscription boundaries. The entire matrix is derived from a single YAML file.
The failure modes I was trying to avoid
Before getting into the design, the table below captures the naive approaches I've either tried or inherited, and what breaks in each one:
| Approach | What breaks |
|---|---|
| Shared state storage account | One credential can read all state across all tenants |
| Shared provider block | Wrong subscription gets targeted, blast radius is total |
| Per-env GitHub secrets | Doesn't scale past 20 subscriptions, no per-region isolation |
| Terraform workspaces | State is still in the same backend file, provider is still shared |
| Deep nested folder structure | Requires cd into hundreds of dirs, CI complexity explodes |
Core design axioms
These aren't guidelines. They're constraints the architecture enforces structurally:
1. One state file per concern per subscription — never shared
2. Backend storage lives INSIDE the target subscription — not centrally
3. Provider credentials are injected at CI time — never hardcoded
4. OIDC only — no stored secrets, no client secrets, no SPN passwords
5. subscriptions.yaml is the single source of truth — CI derives everything from it
6. Cross-concern output sharing via Key Vault — not terraform_remote_state
The four-dimensional matrix
Every deployment unit is a unique cell in this matrix. Each cell is one Terraform root, one state file, and one OIDC credential.
Concretely, this generates a list like:
tenant-a × uksouth × prod × resource-groups → 1 stack
tenant-a × uksouth × prod × managed-identities → 1 stack
tenant-a × uksouth × prod × networking → 1 stack
tenant-a × uksouth × prod × workloads → 1 stack
tenant-a × eastus × prod × resource-groups → 1 stack
tenant-a × eastus × prod × networking → 1 stack
tenant-b × westeurope × prod × resource-groups → 1 stack
...
Each row is a completely independent deployment with its own backend, provider, and credential.
Repository structure
I keep the structure flat. Stack folders are named with -- delimiters so they're machine-parseable without path traversal. No deep nesting, no environment subdirectories.
infra/
├── _modules/ # Shared modules — no state, no providers
│ ├── resource-group/
│ │ ├── main.tf
│ │ ├── variables.tf
│ │ └── outputs.tf
│ ├── managed-identity/
│ ├── vnet/
│ └── subnet/
│
├── _config/
│ └── subscriptions.yaml # 🔑 Single source of truth
│
├── stacks/ # One folder = one Terraform root = one state
│ ├── tenant-a--uksouth--prod--resource-groups/
│ │ ├── main.tf # What resources
│ │ ├── providers.tf # Empty skeleton — vars injected by CI
│ │ ├── backend.tf # Empty block — config injected at init
│ │ ├── variables.tf
│ │ └── terraform.tfvars # Non-secret defaults only
│ │
│ ├── tenant-a--uksouth--prod--managed-identities/
│ ├── tenant-a--uksouth--prod--networking/
│ ├── tenant-a--uksouth--prod--workloads/
│ ├── tenant-a--eastus--prod--resource-groups/
│ ├── tenant-a--eastus--prod--networking/
│ ├── tenant-b--westeurope--prod--resource-groups/
│ └── ...
│
└── .github/
└── workflows/
├── _tf-deploy.yaml # Reusable workflow (never edited)
└── dispatch.yaml # Matrix generator (reads subscriptions.yaml)
State isolation strategy
The rule I don't bend on: the storage account for a subscription's state lives in that subscription. The UAMI that deploys into uksouth-prod has Storage Blob Data Contributor only on satfstateuksouthprod. It cannot reach satfstateeastusprod at all.
Provider and authentication strategy
Provider configuration per stack
Every stack has an identical, credential-free provider skeleton. Nothing is hardcoded. The ARM_* environment variables are injected by CI at runtime.
# stacks/tenant-a--uksouth--prod--networking/providers.tf
terraform {
required_providers {
azurerm = { source = "hashicorp/azurerm", version = "~> 3.0" }
}
}
# Nothing hardcoded — ARM_* env vars injected by CI
provider "azurerm" {
features {}
use_oidc = true
}
The backend file is intentionally empty. The backend config is injected with -backend-config flags at terraform init time.
# stacks/tenant-a--uksouth--prod--networking/backend.tf
terraform {
backend "azurerm" {}
}
OIDC credential flow
The token lifecycle is scoped to the job. When the job ends, the token expires. There are no stored secrets anywhere in this flow.
One UAMI per subscription
Each GitHub environment maps to exactly one UAMI via a federated credential. The UAMI's role assignments are scoped to a single subscription.
Source of truth: subscriptions.yaml
This is the only file I touch to add a tenant, region, or environment. CI reads it and generates the entire deployment matrix. Adding a new region means adding one entry here. Done.
# infra/_config/subscriptions.yaml
subscriptions:
- id: tenant-a--uksouth--prod
tenant_id: "aaaaaaaa-bbbb-cccc-dddd-eeeeeeeeeeee"
subscription_id: "11111111-2222-3333-4444-555555555555"
region: uksouth
environment: prod
client_id: "uami-client-id-uksouth-prod"
backend_sa: "satfstateuksouthprod"
backend_rg: "rg-tfstate-uksouthprod"
concerns:
- resource-groups
- managed-identities
- networking
- workloads
- id: tenant-a--eastus--prod
tenant_id: "aaaaaaaa-bbbb-cccc-dddd-eeeeeeeeeeee"
subscription_id: "22222222-3333-4444-5555-666666666666"
region: eastus
environment: prod
client_id: "uami-client-id-eastus-prod"
backend_sa: "satfstateeastusprod"
backend_rg: "rg-tfstate-eastusprod"
concerns:
- resource-groups
- networking
- id: tenant-b--westeurope--prod
tenant_id: "ffffffff-gggg-hhhh-iiii-jjjjjjjjjjjj"
subscription_id: "33333333-4444-5555-6666-777777777777"
region: westeurope
environment: prod
client_id: "uami-client-id-we-prod"
backend_sa: "satfstateweprod"
backend_rg: "rg-tfstate-weprod"
concerns:
- resource-groups
- networking
- workloads
GitHub Actions pipeline
Overall pipeline architecture
I cap parallel jobs at 20. That's the point where ARM API throttling becomes a problem across multiple subscriptions simultaneously. I've hit it.
Matrix generation (dispatch.yaml)
The Python script runs inside the CI job and produces a JSON matrix that GitHub Actions fans out across parallel jobs. The matrix items are derived entirely from subscriptions.yaml — there's no other configuration to maintain.
# .github/workflows/dispatch.yaml
name: Terraform Dispatch
on:
push:
branches: [main]
workflow_dispatch:
jobs:
generate-matrix:
runs-on: ubuntu-latest
outputs:
matrix: ${{ steps.matrix.outputs.result }}
steps:
- uses: actions/checkout@v4
- name: Generate deployment matrix
id: matrix
run: |
python3 - <<'EOF'
import yaml, json
with open("infra/_config/subscriptions.yaml") as f:
cfg = yaml.safe_load(f)
items = []
for sub in cfg["subscriptions"]:
for concern in sub["concerns"]:
items.append({
"stack_id": f"{sub['id']}--{concern}",
"working_dir": f"infra/stacks/{sub['id']}--{concern}",
"subscription_id": sub["subscription_id"],
"tenant_id": sub["tenant_id"],
"client_id": sub["client_id"],
"backend_sa": sub["backend_sa"],
"backend_rg": sub["backend_rg"],
"concern": concern,
"environment": sub["environment"],
})
print(f"result={json.dumps({'include': items})}")
EOF
deploy:
needs: generate-matrix
strategy:
matrix: ${{ fromJson(needs.generate-matrix.outputs.matrix) }}
fail-fast: false
max-parallel: 20
uses: ./.github/workflows/_tf-deploy.yaml
with:
working_dir: ${{ matrix.working_dir }}
backend_sa: ${{ matrix.backend_sa }}
backend_rg: ${{ matrix.backend_rg }}
concern: ${{ matrix.concern }}
environment: ${{ matrix.environment }}
secrets:
arm_client_id: ${{ matrix.client_id }}
arm_tenant_id: ${{ matrix.tenant_id }}
arm_subscription_id: ${{ matrix.subscription_id }}
fail-fast: false is non-negotiable here. Without it, a failure in one subscription cancels all other in-flight jobs. The isolation is only useful if a failure in one cell doesn't stop everything else.
Reusable deploy workflow (_tf-deploy.yaml)
This is the workflow that never gets edited directly. It receives inputs from the matrix and handles the actual init, plan, and apply. The -backend-config flags at init time are how the empty backend.tf skeleton gets wired to the correct storage account.
# .github/workflows/_tf-deploy.yaml
name: Terraform Deploy (Reusable)
on:
workflow_call:
inputs:
working_dir: { type: string, required: true }
backend_sa: { type: string, required: true }
backend_rg: { type: string, required: true }
concern: { type: string, required: true }
environment: { type: string, required: true }
secrets:
arm_client_id: { required: true }
arm_tenant_id: { required: true }
arm_subscription_id: { required: true }
jobs:
deploy:
runs-on: ubuntu-latest
environment: ${{ inputs.concern }} # GitHub env for prod approval gates
permissions:
id-token: write # Required for OIDC
contents: read
env:
ARM_CLIENT_ID: ${{ secrets.arm_client_id }}
ARM_TENANT_ID: ${{ secrets.arm_tenant_id }}
ARM_SUBSCRIPTION_ID: ${{ secrets.arm_subscription_id }}
ARM_USE_OIDC: "true"
defaults:
run:
working-directory: ${{ inputs.working_dir }}
steps:
- uses: actions/checkout@v4
- uses: hashicorp/setup-terraform@v3
- name: Terraform Init
run: |
terraform init \
-backend-config="storage_account_name=${{ inputs.backend_sa }}" \
-backend-config="resource_group_name=${{ inputs.backend_rg }}" \
-backend-config="container_name=tfstate" \
-backend-config="key=${{ inputs.concern }}.tfstate" \
-backend-config="use_oidc=true"
- name: Terraform Validate
run: terraform validate
- name: Terraform Plan
run: terraform plan -out=tfplan -detailed-exitcode
- name: Terraform Apply
run: terraform apply -noninteractive tfplan
Cross-concern dependency resolution
I don't use terraform_remote_state. Reading another stack's state requires credentials for that stack's backend storage account. That means cross-subscription auth, which breaks the isolation model entirely.
Instead, the networking stack writes its outputs into Azure Key Vault inside the same subscription. The workloads stack reads from that same Key Vault using the same UAMI it already has. No new credentials, no cross-subscription reads.
The pattern in code is straightforward. The networking stack writes outputs to Key Vault secrets. The workloads stack reads those secrets as data sources.
# networking/outputs-to-kv.tf
resource "azurerm_key_vault_secret" "subnet_app_id" {
name = "subnet-app-id"
value = azurerm_subnet.app.id
key_vault_id = data.azurerm_key_vault.this.id
}
# workloads/inputs-from-kv.tf
data "azurerm_key_vault_secret" "subnet_app_id" {
name = "subnet-app-id"
key_vault_id = data.azurerm_key_vault.this.id
}
resource "azurerm_linux_virtual_machine_scale_set" "app" {
# ...
network_interface {
ip_configuration {
subnet_id = data.azurerm_key_vault_secret.subnet_app_id.value
}
}
}
Concern deploy order
Within a subscription, concerns have hard dependencies. These are enforced with needs: in the GitHub Actions DAG. Across subscriptions, everything deploys in parallel.
Full end-to-end flow
Blast radius analysis
The blast radius of any incident (bad apply, credential leak, state corruption) is bounded to exactly one cell in the matrix.
| What breaks | Impact |
|---|---|
Bad apply in uksouth-prod--workloads | Only workloads in uksouth-prod affected |
Leaked UAMI credential for eastus-prod | Can only access eastus-prod subscription |
State corruption in networking.tfstate | Only networking layer in that one subscription |
| CI pipeline failure | fail-fast: false means other stacks continue |
| Adding wrong resource to wrong stack | terraform plan diff is small and easy to catch |
At scale, the metrics look like this:
| Dimension | Value |
|---|---|
| State files | N subscriptions × M concerns |
| OIDC credentials | 1 per subscription (not per concern) |
| GitHub Environments | 1 per concern (for prod approval gates) |
| GitHub Secrets | 3 per subscription (client_id, tenant_id, sub_id) |
| Files to edit to add a region | 1 (subscriptions.yaml) |
| Files to edit to add a concern | 1 (subscriptions.yaml) + create the stack folder |
Checklist for a new subscription or region
□ Create Azure subscription
□ Create resource group for Terraform state: rg-tfstate-{region}{env}
□ Create storage account in that RG: satfstate{region}{env}
□ Create blob container: tfstate
□ Create User-Assigned Managed Identity: tf-deployer-{region}-{env}
□ Assign UAMI: Contributor on subscription
□ Assign UAMI: Storage Blob Data Contributor on the state storage account
□ Create federated credential on UAMI pointing to GitHub org/repo/environment
□ Create GitHub Environment: {tenant}--{region}--{env} (with reviewers for prod)
□ Add environment secrets: AZURE_CLIENT_ID, AZURE_TENANT_ID, AZURE_SUBSCRIPTION_ID
□ Add entry to infra/_config/subscriptions.yaml
□ Create stack folders: infra/stacks/{id}--{concern}/
□ Push — CI auto-discovers and deploys