Skip to main content

Terraform Shared State at Scale

At some point, every enterprise landing zone project has to answer this question: one state file per landing zone, or shared state files grouped by concern? I built a progressive load test to find out where shared state breaks. I scaled from 180 to 360 Azure landing zones across 15 shared Terraform stacks, and the answer turned out to be more nuanced than "it depends."

The test harness

I created three Terraform modules representing a minimal landing zone:

  • resource-groups — one azurerm_resource_group per landing zone, tagged for tracking
  • entra-groups — three azuread_group resources per landing zone (contributors, readers, admins)
  • rbac — three azurerm_role_assignment resources per landing zone, binding Entra groups to resource groups with Contributor, Reader, and Owner roles

The landing zones span two regions (Australia East, New Zealand North) and three environments (prod, dev, staging), giving six region-environment buckets.

The architecture: shared state by concern

Instead of giving each landing zone its own Terraform state, I grouped them into 15 shared stacks:

stacks/
resource-groups/ ← 6 stacks (region × environment)
australiaeast/prod/ → N LZs via for_each
australiaeast/dev/
...
entra-groups/ ← 3 stacks (environment only, Entra is global)
prod/ → 2N LZs via for_each
dev/
staging/
rbac/ ← 6 stacks (region × environment)
australiaeast/prod/ → N LZs via for_each
...

The stack count stays constant at 15 regardless of how many landing zones I provision. Each stack uses for_each over a generated list of landing zone names, and the RBAC stacks read upstream outputs via terraform_remote_state.

I orchestrated everything with a Node.js harness that runs stacks in dependency order — resource-groups and entra-groups in parallel first, then RBAC — and captures per-stack timing, warnings, and errors.

180 landing zones: clean success

At 30 landing zones per region-environment combination, everything worked perfectly:

MetricValue
Total landing zones180
Total stacks15
Wall clock time819 seconds (~14 minutes)
Cumulative Terraform time4,325 seconds
Parallelism gain5.3x
Failures0

The per-concern breakdown told an interesting story:

ConcernAvg DurationMinMax
resource-groups148s100s244s
entra-groups488s419s557s
rbac329s278s366s

Entra groups were the slowest concern at 30 landing zones, not RBAC. The Microsoft Graph API for creating security groups is inherently sequential-feeling even with Terraform's parallelism. But everything completed well within Azure's timeout windows.

The 5.3x parallelism gain confirmed the core thesis: by splitting concerns into independent stacks, I could run all six resource-group stacks and all three entra-group stacks simultaneously, then all six RBAC stacks simultaneously, instead of processing 180 landing zones one at a time.

360 landing zones: RBAC hit a wall

Doubling the landing zone count to 360 is where things got real. Resource-groups and entra-groups scaled fine. But every single RBAC stack failed.

PhaseStacksResult
resource-groups6/6All succeeded
entra-groups3/3All succeeded
rbac0/6All failed

Each RBAC stack ran for approximately 8.8 hours before dying with errors like:

failed waiting for Role Assignment to finish replicating: timeout
context deadline exceeded

Each RBAC stack was trying to create 180 role assignments (60 landing zones times 3 roles). Terraform's default parallelism of 10 meant it was firing 10 concurrent azurerm_role_assignment creates against the same Azure subscription. With six RBAC stacks running simultaneously, that's up to 60 concurrent role assignment API calls hitting Azure's ARM RBAC replication service.

Azure's role assignment API has an internal replication step. After creating an assignment, Azure waits for it to propagate across its distributed authorization store. At 60 concurrent operations, the replication queue backed up, individual assignments took longer than the 30-minute replication timeout, and Terraform gave up.

What this tells us about shared state

The failure at 360 landing zones wasn't a fundamental flaw in the shared state architecture. It was a specific Azure API bottleneck with a specific resource type.

Where shared state wins

Operational simplicity. Fifteen stacks means 15 terraform init calls, 15 state files to back up, 15 pipelines to maintain. At 100 landing zones per bucket, a per-LZ architecture would need 1,800 stacks. That's 1,800 inits, 1,800 state files, and a CI/CD system that can dynamically manage 1,800 pipeline instances.

Fast drift detection. Running terraform plan across 15 stacks gives me a complete picture of every landing zone in minutes. With per-LZ state, I'd need to plan 1,800 stacks. Even with parallelism, that's a significant CI runner cost.

Efficient provisioning. At 180 landing zones, the entire estate applied in 14 minutes wall clock. Terraform's for_each within a single state handles the fan-out efficiently, and the orchestrator handles the fan-out across stacks.

Constant overhead. Adding landing zone 181 doesn't add a new stack. It adds one entry to an existing for_each set. No new init, no new pipeline, no new state file.

Where shared state hurts

Blast radius. A bad variable or module change in the resource-groups stack for australiaeast/prod affects every landing zone in that bucket. With per-LZ state, a bad change only breaks one landing zone.

API rate limiting exposure. This is what killed the 360-LZ run. When a single Terraform apply creates N resources of the same type against the same API endpoint, you're much more likely to hit rate limits than if those same resources were spread across N independent applies with natural timing gaps between them.

State file growth. At 100 landing zones per stack, the RBAC state file contains 300 role assignments plus their data sources. It's manageable but plans get slower as the state grows.

Lock contention in CI/CD. If two PRs both modify landing zones in the same bucket, the second one has to wait for the first to release the state lock. Per-LZ state has zero cross-LZ contention.

When I'd pick per-LZ state instead

Per-LZ state is better when I'm in a regulated environment where blast radius isolation is a compliance requirement, when landing zones are modified independently and frequently (shared state forces replanning the whole bucket), when I have a mature platform team with CI/CD that can manage thousands of dynamic stacks through Terragrunt or Spacelift, or when individual landing zone owners need to run their own plans without affecting others.

Fixing the RBAC bottleneck

The shared state architecture doesn't need to be abandoned. It needs refinement at the RBAC layer. Three approaches, in order of invasiveness:

  1. Reduce Terraform parallelism for RBAC stacks. Adding -parallelism=5 or even -parallelism=3 to RBAC applies would reduce concurrent ARM API calls from 60 to 15-18. This is a one-line change in the orchestrator.

  2. Stagger RBAC stack execution. Instead of running all 6 RBAC stacks in parallel, run 2-3 at a time. This halves the concurrent API pressure without changing any Terraform code.

  3. Shard RBAC more finely. Split each region-environment RBAC stack into batches of 20-30 landing zones. This increases stack count from 15 to around 30 but keeps each stack's API footprint well within Azure's comfort zone.

The numbers that matter

Metric180 LZs360 LZs
Stacks1515
resource-groups14 minsucceeded
entra-groups14 minsucceeded
rbac14 minfailed (8.8 hrs)
Bottleneckentra-groups (Graph API)rbac (ARM replication)
Resources per RBAC stack90180
Concurrent ARM calls (peak)~30~60

The inflection point is somewhere between 90 and 180 role assignments per stack, with 6 stacks running concurrently. Azure's ARM RBAC replication can handle 30 concurrent role assignment operations across a subscription but not 60.

My take

Shared state by concern is a sound architecture for Azure landing zones at scale. It dramatically reduces operational complexity compared to per-LZ state, and it works well up to at least 180 landing zones with the simple three-concern model I tested.

The ceiling isn't the architecture. It's Azure's API rate limits on specific resource types, particularly RBAC role assignments. With straightforward throttling or finer sharding of the RBAC concern, I believe this architecture can comfortably handle 600+ landing zones across the 15-stack structure.

If I were building a landing zone vending machine today, I'd start with shared state by concern and only move to per-LZ state if blast radius isolation became a hard requirement. The operational cost of managing thousands of independent state files is real, and most teams underestimate it.