Terraform Shared State at Scale
At some point, every enterprise landing zone project has to answer this question: one state file per landing zone, or shared state files grouped by concern? I built a progressive load test to find out where shared state breaks. I scaled from 180 to 360 Azure landing zones across 15 shared Terraform stacks, and the answer turned out to be more nuanced than "it depends."
The test harness
I created three Terraform modules representing a minimal landing zone:
- resource-groups — one
azurerm_resource_groupper landing zone, tagged for tracking - entra-groups — three
azuread_groupresources per landing zone (contributors, readers, admins) - rbac — three
azurerm_role_assignmentresources per landing zone, binding Entra groups to resource groups with Contributor, Reader, and Owner roles
The landing zones span two regions (Australia East, New Zealand North) and three environments (prod, dev, staging), giving six region-environment buckets.
The architecture: shared state by concern
Instead of giving each landing zone its own Terraform state, I grouped them into 15 shared stacks:
stacks/
resource-groups/ ← 6 stacks (region × environment)
australiaeast/prod/ → N LZs via for_each
australiaeast/dev/
...
entra-groups/ ← 3 stacks (environment only, Entra is global)
prod/ → 2N LZs via for_each
dev/
staging/
rbac/ ← 6 stacks (region × environment)
australiaeast/prod/ → N LZs via for_each
...
The stack count stays constant at 15 regardless of how many landing zones I provision. Each stack uses for_each over a generated list of landing zone names, and the RBAC stacks read upstream outputs via terraform_remote_state.
I orchestrated everything with a Node.js harness that runs stacks in dependency order — resource-groups and entra-groups in parallel first, then RBAC — and captures per-stack timing, warnings, and errors.
180 landing zones: clean success
At 30 landing zones per region-environment combination, everything worked perfectly:
| Metric | Value |
|---|---|
| Total landing zones | 180 |
| Total stacks | 15 |
| Wall clock time | 819 seconds (~14 minutes) |
| Cumulative Terraform time | 4,325 seconds |
| Parallelism gain | 5.3x |
| Failures | 0 |
The per-concern breakdown told an interesting story:
| Concern | Avg Duration | Min | Max |
|---|---|---|---|
| resource-groups | 148s | 100s | 244s |
| entra-groups | 488s | 419s | 557s |
| rbac | 329s | 278s | 366s |
Entra groups were the slowest concern at 30 landing zones, not RBAC. The Microsoft Graph API for creating security groups is inherently sequential-feeling even with Terraform's parallelism. But everything completed well within Azure's timeout windows.
The 5.3x parallelism gain confirmed the core thesis: by splitting concerns into independent stacks, I could run all six resource-group stacks and all three entra-group stacks simultaneously, then all six RBAC stacks simultaneously, instead of processing 180 landing zones one at a time.
360 landing zones: RBAC hit a wall
Doubling the landing zone count to 360 is where things got real. Resource-groups and entra-groups scaled fine. But every single RBAC stack failed.
| Phase | Stacks | Result |
|---|---|---|
| resource-groups | 6/6 | All succeeded |
| entra-groups | 3/3 | All succeeded |
| rbac | 0/6 | All failed |
Each RBAC stack ran for approximately 8.8 hours before dying with errors like:
failed waiting for Role Assignment to finish replicating: timeout
context deadline exceeded
Each RBAC stack was trying to create 180 role assignments (60 landing zones times 3 roles). Terraform's default parallelism of 10 meant it was firing 10 concurrent azurerm_role_assignment creates against the same Azure subscription. With six RBAC stacks running simultaneously, that's up to 60 concurrent role assignment API calls hitting Azure's ARM RBAC replication service.
Azure's role assignment API has an internal replication step. After creating an assignment, Azure waits for it to propagate across its distributed authorization store. At 60 concurrent operations, the replication queue backed up, individual assignments took longer than the 30-minute replication timeout, and Terraform gave up.
What this tells us about shared state
The failure at 360 landing zones wasn't a fundamental flaw in the shared state architecture. It was a specific Azure API bottleneck with a specific resource type.
Where shared state wins
Operational simplicity. Fifteen stacks means 15 terraform init calls, 15 state files to back up, 15 pipelines to maintain. At 100 landing zones per bucket, a per-LZ architecture would need 1,800 stacks. That's 1,800 inits, 1,800 state files, and a CI/CD system that can dynamically manage 1,800 pipeline instances.
Fast drift detection. Running terraform plan across 15 stacks gives me a complete picture of every landing zone in minutes. With per-LZ state, I'd need to plan 1,800 stacks. Even with parallelism, that's a significant CI runner cost.
Efficient provisioning. At 180 landing zones, the entire estate applied in 14 minutes wall clock. Terraform's for_each within a single state handles the fan-out efficiently, and the orchestrator handles the fan-out across stacks.
Constant overhead. Adding landing zone 181 doesn't add a new stack. It adds one entry to an existing for_each set. No new init, no new pipeline, no new state file.
Where shared state hurts
Blast radius. A bad variable or module change in the resource-groups stack for australiaeast/prod affects every landing zone in that bucket. With per-LZ state, a bad change only breaks one landing zone.
API rate limiting exposure. This is what killed the 360-LZ run. When a single Terraform apply creates N resources of the same type against the same API endpoint, you're much more likely to hit rate limits than if those same resources were spread across N independent applies with natural timing gaps between them.
State file growth. At 100 landing zones per stack, the RBAC state file contains 300 role assignments plus their data sources. It's manageable but plans get slower as the state grows.
Lock contention in CI/CD. If two PRs both modify landing zones in the same bucket, the second one has to wait for the first to release the state lock. Per-LZ state has zero cross-LZ contention.
When I'd pick per-LZ state instead
Per-LZ state is better when I'm in a regulated environment where blast radius isolation is a compliance requirement, when landing zones are modified independently and frequently (shared state forces replanning the whole bucket), when I have a mature platform team with CI/CD that can manage thousands of dynamic stacks through Terragrunt or Spacelift, or when individual landing zone owners need to run their own plans without affecting others.
Fixing the RBAC bottleneck
The shared state architecture doesn't need to be abandoned. It needs refinement at the RBAC layer. Three approaches, in order of invasiveness:
-
Reduce Terraform parallelism for RBAC stacks. Adding
-parallelism=5or even-parallelism=3to RBAC applies would reduce concurrent ARM API calls from 60 to 15-18. This is a one-line change in the orchestrator. -
Stagger RBAC stack execution. Instead of running all 6 RBAC stacks in parallel, run 2-3 at a time. This halves the concurrent API pressure without changing any Terraform code.
-
Shard RBAC more finely. Split each region-environment RBAC stack into batches of 20-30 landing zones. This increases stack count from 15 to around 30 but keeps each stack's API footprint well within Azure's comfort zone.
The numbers that matter
| Metric | 180 LZs | 360 LZs |
|---|---|---|
| Stacks | 15 | 15 |
| resource-groups | 14 min | succeeded |
| entra-groups | 14 min | succeeded |
| rbac | 14 min | failed (8.8 hrs) |
| Bottleneck | entra-groups (Graph API) | rbac (ARM replication) |
| Resources per RBAC stack | 90 | 180 |
| Concurrent ARM calls (peak) | ~30 | ~60 |
The inflection point is somewhere between 90 and 180 role assignments per stack, with 6 stacks running concurrently. Azure's ARM RBAC replication can handle 30 concurrent role assignment operations across a subscription but not 60.
My take
Shared state by concern is a sound architecture for Azure landing zones at scale. It dramatically reduces operational complexity compared to per-LZ state, and it works well up to at least 180 landing zones with the simple three-concern model I tested.
The ceiling isn't the architecture. It's Azure's API rate limits on specific resource types, particularly RBAC role assignments. With straightforward throttling or finer sharding of the RBAC concern, I believe this architecture can comfortably handle 600+ landing zones across the 15-stack structure.
If I were building a landing zone vending machine today, I'd start with shared state by concern and only move to per-LZ state if blast radius isolation became a hard requirement. The operational cost of managing thousands of independent state files is real, and most teams underestimate it.