Skip to main content

Terraform at Scale: Multi-Tenant Azure Deployment

Every multi-tenant Terraform project eventually hits the same wall. You start with a reasonable setup: a shared backend, one service principal, a handful of subscriptions. Then a new tenant gets added, then a second region, then someone needs production isolated from staging at the subscription level. The thing that made the original setup clean becomes the thing that makes it unmanageable.

The architecture I landed on after working through this problem treats each deployment as a unique cell in a four-dimensional matrix: tenant, region, environment, and concern layer. Every cell gets its own state file, its own backend, and its own OIDC credential. There's no shared state storage account. There's no credential that can cross subscription boundaries. The entire matrix is derived from a single YAML file.

The failure modes I was trying to avoid

Before getting into the design, the table below captures the naive approaches I've either tried or inherited, and what breaks in each one:

ApproachWhat breaks
Shared state storage accountOne credential can read all state across all tenants
Shared provider blockWrong subscription gets targeted, blast radius is total
Per-env GitHub secretsDoesn't scale past 20 subscriptions, no per-region isolation
Terraform workspacesState is still in the same backend file, provider is still shared
Deep nested folder structureRequires cd into hundreds of dirs, CI complexity explodes

Core design axioms

These aren't guidelines. They're constraints the architecture enforces structurally:

1. One state file per concern per subscription — never shared
2. Backend storage lives INSIDE the target subscription — not centrally
3. Provider credentials are injected at CI time — never hardcoded
4. OIDC only — no stored secrets, no client secrets, no SPN passwords
5. subscriptions.yaml is the single source of truth — CI derives everything from it
6. Cross-concern output sharing via Key Vault — not terraform_remote_state

The four-dimensional matrix

Every deployment unit is a unique cell in this matrix. Each cell is one Terraform root, one state file, and one OIDC credential.

Concretely, this generates a list like:

tenant-a  × uksouth     × prod × resource-groups      → 1 stack
tenant-a × uksouth × prod × managed-identities → 1 stack
tenant-a × uksouth × prod × networking → 1 stack
tenant-a × uksouth × prod × workloads → 1 stack
tenant-a × eastus × prod × resource-groups → 1 stack
tenant-a × eastus × prod × networking → 1 stack
tenant-b × westeurope × prod × resource-groups → 1 stack
...

Each row is a completely independent deployment with its own backend, provider, and credential.

Repository structure

I keep the structure flat. Stack folders are named with -- delimiters so they're machine-parseable without path traversal. No deep nesting, no environment subdirectories.

infra/
├── _modules/ # Shared modules — no state, no providers
│ ├── resource-group/
│ │ ├── main.tf
│ │ ├── variables.tf
│ │ └── outputs.tf
│ ├── managed-identity/
│ ├── vnet/
│ └── subnet/

├── _config/
│ └── subscriptions.yaml # 🔑 Single source of truth

├── stacks/ # One folder = one Terraform root = one state
│ ├── tenant-a--uksouth--prod--resource-groups/
│ │ ├── main.tf # What resources
│ │ ├── providers.tf # Empty skeleton — vars injected by CI
│ │ ├── backend.tf # Empty block — config injected at init
│ │ ├── variables.tf
│ │ └── terraform.tfvars # Non-secret defaults only
│ │
│ ├── tenant-a--uksouth--prod--managed-identities/
│ ├── tenant-a--uksouth--prod--networking/
│ ├── tenant-a--uksouth--prod--workloads/
│ ├── tenant-a--eastus--prod--resource-groups/
│ ├── tenant-a--eastus--prod--networking/
│ ├── tenant-b--westeurope--prod--resource-groups/
│ └── ...

└── .github/
└── workflows/
├── _tf-deploy.yaml # Reusable workflow (never edited)
└── dispatch.yaml # Matrix generator (reads subscriptions.yaml)

State isolation strategy

The rule I don't bend on: the storage account for a subscription's state lives in that subscription. The UAMI that deploys into uksouth-prod has Storage Blob Data Contributor only on satfstateuksouthprod. It cannot reach satfstateeastusprod at all.

Provider and authentication strategy

Provider configuration per stack

Every stack has an identical, credential-free provider skeleton. Nothing is hardcoded. The ARM_* environment variables are injected by CI at runtime.

# stacks/tenant-a--uksouth--prod--networking/providers.tf
terraform {
required_providers {
azurerm = { source = "hashicorp/azurerm", version = "~> 3.0" }
}
}

# Nothing hardcoded — ARM_* env vars injected by CI
provider "azurerm" {
features {}
use_oidc = true
}

The backend file is intentionally empty. The backend config is injected with -backend-config flags at terraform init time.

# stacks/tenant-a--uksouth--prod--networking/backend.tf
terraform {
backend "azurerm" {}
}

OIDC credential flow

The token lifecycle is scoped to the job. When the job ends, the token expires. There are no stored secrets anywhere in this flow.

One UAMI per subscription

Each GitHub environment maps to exactly one UAMI via a federated credential. The UAMI's role assignments are scoped to a single subscription.

Source of truth: subscriptions.yaml

This is the only file I touch to add a tenant, region, or environment. CI reads it and generates the entire deployment matrix. Adding a new region means adding one entry here. Done.

# infra/_config/subscriptions.yaml
subscriptions:
- id: tenant-a--uksouth--prod
tenant_id: "aaaaaaaa-bbbb-cccc-dddd-eeeeeeeeeeee"
subscription_id: "11111111-2222-3333-4444-555555555555"
region: uksouth
environment: prod
client_id: "uami-client-id-uksouth-prod"
backend_sa: "satfstateuksouthprod"
backend_rg: "rg-tfstate-uksouthprod"
concerns:
- resource-groups
- managed-identities
- networking
- workloads

- id: tenant-a--eastus--prod
tenant_id: "aaaaaaaa-bbbb-cccc-dddd-eeeeeeeeeeee"
subscription_id: "22222222-3333-4444-5555-666666666666"
region: eastus
environment: prod
client_id: "uami-client-id-eastus-prod"
backend_sa: "satfstateeastusprod"
backend_rg: "rg-tfstate-eastusprod"
concerns:
- resource-groups
- networking

- id: tenant-b--westeurope--prod
tenant_id: "ffffffff-gggg-hhhh-iiii-jjjjjjjjjjjj"
subscription_id: "33333333-4444-5555-6666-777777777777"
region: westeurope
environment: prod
client_id: "uami-client-id-we-prod"
backend_sa: "satfstateweprod"
backend_rg: "rg-tfstate-weprod"
concerns:
- resource-groups
- networking
- workloads

GitHub Actions pipeline

Overall pipeline architecture

I cap parallel jobs at 20. That's the point where ARM API throttling becomes a problem across multiple subscriptions simultaneously. I've hit it.

Matrix generation (dispatch.yaml)

The Python script runs inside the CI job and produces a JSON matrix that GitHub Actions fans out across parallel jobs. The matrix items are derived entirely from subscriptions.yaml — there's no other configuration to maintain.

# .github/workflows/dispatch.yaml
name: Terraform Dispatch

on:
push:
branches: [main]
workflow_dispatch:

jobs:
generate-matrix:
runs-on: ubuntu-latest
outputs:
matrix: ${{ steps.matrix.outputs.result }}
steps:
- uses: actions/checkout@v4

- name: Generate deployment matrix
id: matrix
run: |
python3 - <<'EOF'
import yaml, json

with open("infra/_config/subscriptions.yaml") as f:
cfg = yaml.safe_load(f)

items = []
for sub in cfg["subscriptions"]:
for concern in sub["concerns"]:
items.append({
"stack_id": f"{sub['id']}--{concern}",
"working_dir": f"infra/stacks/{sub['id']}--{concern}",
"subscription_id": sub["subscription_id"],
"tenant_id": sub["tenant_id"],
"client_id": sub["client_id"],
"backend_sa": sub["backend_sa"],
"backend_rg": sub["backend_rg"],
"concern": concern,
"environment": sub["environment"],
})

print(f"result={json.dumps({'include': items})}")
EOF

deploy:
needs: generate-matrix
strategy:
matrix: ${{ fromJson(needs.generate-matrix.outputs.matrix) }}
fail-fast: false
max-parallel: 20
uses: ./.github/workflows/_tf-deploy.yaml
with:
working_dir: ${{ matrix.working_dir }}
backend_sa: ${{ matrix.backend_sa }}
backend_rg: ${{ matrix.backend_rg }}
concern: ${{ matrix.concern }}
environment: ${{ matrix.environment }}
secrets:
arm_client_id: ${{ matrix.client_id }}
arm_tenant_id: ${{ matrix.tenant_id }}
arm_subscription_id: ${{ matrix.subscription_id }}

fail-fast: false is non-negotiable here. Without it, a failure in one subscription cancels all other in-flight jobs. The isolation is only useful if a failure in one cell doesn't stop everything else.

Reusable deploy workflow (_tf-deploy.yaml)

This is the workflow that never gets edited directly. It receives inputs from the matrix and handles the actual init, plan, and apply. The -backend-config flags at init time are how the empty backend.tf skeleton gets wired to the correct storage account.

# .github/workflows/_tf-deploy.yaml
name: Terraform Deploy (Reusable)

on:
workflow_call:
inputs:
working_dir: { type: string, required: true }
backend_sa: { type: string, required: true }
backend_rg: { type: string, required: true }
concern: { type: string, required: true }
environment: { type: string, required: true }
secrets:
arm_client_id: { required: true }
arm_tenant_id: { required: true }
arm_subscription_id: { required: true }

jobs:
deploy:
runs-on: ubuntu-latest
environment: ${{ inputs.concern }} # GitHub env for prod approval gates
permissions:
id-token: write # Required for OIDC
contents: read
env:
ARM_CLIENT_ID: ${{ secrets.arm_client_id }}
ARM_TENANT_ID: ${{ secrets.arm_tenant_id }}
ARM_SUBSCRIPTION_ID: ${{ secrets.arm_subscription_id }}
ARM_USE_OIDC: "true"
defaults:
run:
working-directory: ${{ inputs.working_dir }}
steps:
- uses: actions/checkout@v4
- uses: hashicorp/setup-terraform@v3

- name: Terraform Init
run: |
terraform init \
-backend-config="storage_account_name=${{ inputs.backend_sa }}" \
-backend-config="resource_group_name=${{ inputs.backend_rg }}" \
-backend-config="container_name=tfstate" \
-backend-config="key=${{ inputs.concern }}.tfstate" \
-backend-config="use_oidc=true"

- name: Terraform Validate
run: terraform validate

- name: Terraform Plan
run: terraform plan -out=tfplan -detailed-exitcode

- name: Terraform Apply
run: terraform apply -noninteractive tfplan

Cross-concern dependency resolution

I don't use terraform_remote_state. Reading another stack's state requires credentials for that stack's backend storage account. That means cross-subscription auth, which breaks the isolation model entirely.

Instead, the networking stack writes its outputs into Azure Key Vault inside the same subscription. The workloads stack reads from that same Key Vault using the same UAMI it already has. No new credentials, no cross-subscription reads.

The pattern in code is straightforward. The networking stack writes outputs to Key Vault secrets. The workloads stack reads those secrets as data sources.

# networking/outputs-to-kv.tf
resource "azurerm_key_vault_secret" "subnet_app_id" {
name = "subnet-app-id"
value = azurerm_subnet.app.id
key_vault_id = data.azurerm_key_vault.this.id
}

# workloads/inputs-from-kv.tf
data "azurerm_key_vault_secret" "subnet_app_id" {
name = "subnet-app-id"
key_vault_id = data.azurerm_key_vault.this.id
}

resource "azurerm_linux_virtual_machine_scale_set" "app" {
# ...
network_interface {
ip_configuration {
subnet_id = data.azurerm_key_vault_secret.subnet_app_id.value
}
}
}

Concern deploy order

Within a subscription, concerns have hard dependencies. These are enforced with needs: in the GitHub Actions DAG. Across subscriptions, everything deploys in parallel.

Full end-to-end flow

Blast radius analysis

The blast radius of any incident (bad apply, credential leak, state corruption) is bounded to exactly one cell in the matrix.

What breaksImpact
Bad apply in uksouth-prod--workloadsOnly workloads in uksouth-prod affected
Leaked UAMI credential for eastus-prodCan only access eastus-prod subscription
State corruption in networking.tfstateOnly networking layer in that one subscription
CI pipeline failurefail-fast: false means other stacks continue
Adding wrong resource to wrong stackterraform plan diff is small and easy to catch

At scale, the metrics look like this:

DimensionValue
State filesN subscriptions × M concerns
OIDC credentials1 per subscription (not per concern)
GitHub Environments1 per concern (for prod approval gates)
GitHub Secrets3 per subscription (client_id, tenant_id, sub_id)
Files to edit to add a region1 (subscriptions.yaml)
Files to edit to add a concern1 (subscriptions.yaml) + create the stack folder

Checklist for a new subscription or region

□ Create Azure subscription
□ Create resource group for Terraform state: rg-tfstate-{region}{env}
□ Create storage account in that RG: satfstate{region}{env}
□ Create blob container: tfstate
□ Create User-Assigned Managed Identity: tf-deployer-{region}-{env}
□ Assign UAMI: Contributor on subscription
□ Assign UAMI: Storage Blob Data Contributor on the state storage account
□ Create federated credential on UAMI pointing to GitHub org/repo/environment
□ Create GitHub Environment: {tenant}--{region}--{env} (with reviewers for prod)
□ Add environment secrets: AZURE_CLIENT_ID, AZURE_TENANT_ID, AZURE_SUBSCRIPTION_ID
□ Add entry to infra/_config/subscriptions.yaml
□ Create stack folders: infra/stacks/{id}--{concern}/
□ Push — CI auto-discovers and deploys