Policy as Code

The way most Azure policy programs start is with manual portal assignments. Someone adds a policy here, an initiative there, and over time you end up with hundreds of policies deployed across tenants with no source of truth for where they came from, no test coverage, and no way to safely promote changes. I've inherited that state more than once.

Policy as code solves this by treating Azure Policy the same way any other infrastructure gets treated — defined in files, reviewed through pull requests, tested in CI, and deployed through a controlled pipeline.

The problem with testing policies against what exists

The specific mistake I see most often in policy testing is that the tests are auto-generated from the current Azure state. The test says "check that policy X is assigned to management group Y" — which is exactly the thing the deployment just did. It's a circular dependency: the test validates what the pipeline created rather than validating that the right thing was created.

The tests I want are the ones that fail if I haven't deployed correctly — not the ones that pass because I just read back what I wrote.

What I do instead:

Write tests before deployment that describe the intent of the policy (what it should allow, what it should deny)
Validate policy definitions with az policy definition create --validate-only in CI before any assignment
Run compliance scans in a non-production management group after deployment to verify effect before promoting to production

Preventive vs detective controls

Every policy effect has a different role:

Effect	Type	When I use it
`deny`	Preventive	Hard requirements: resource types that must never be created, regions that aren't approved
`audit`	Detective	Hygiene standards: tags, naming, config settings I want visibility on but don't want to block
`modify`	Remediation	Auto-fix: adding missing tags, setting default properties on creation
`deployIfNotExists`	Remediation	Auto-provision: enabling diagnostics, deploying agents
`auditIfNotExists`	Detective	Detect missing child resources without blocking the parent

I default to audit when rolling out any new policy, then promote to deny after a remediation cycle has cleared the estate.

Policy framework structure

When I'm building an enterprise policy program, I organise policies into three logical tiers:

Baseline controls — Applied at the root management group. These are the non-negotiable controls: approved regions, required tags, no public storage accounts in production. The blast radius if these are wrong is the whole tenant, so I test them thoroughly before assignment and I don't make exceptions below the root.

Workload controls — Applied at the landing zone management group level. These are workload-class-specific policies: Corp workloads route through the firewall, Online workloads can't peer to the hub. These can vary by organisation and I tune them per engagement.

Exception controls — Applied at the subscription level as policy exemptions with expiry dates. I treat exemptions the same way I treat technical debt: they're tracked, they have owners, and they have review dates.

Terraform implementation

# Policy definition
resource "azurerm_policy_definition" "deny_public_storage" {
  name         = "deny-public-storage-accounts"
  policy_type  = "Custom"
  mode         = "All"
  display_name = "Deny public storage account access"

  metadata = jsonencode({
    category = "Storage"
    version  = "1.0.0"
  })

  policy_rule = jsonencode({
    if = {
      allOf = [
        {
          field  = "type"
          equals = "Microsoft.Storage/storageAccounts"
        },
        {
          field  = "Microsoft.Storage/storageAccounts/allowBlobPublicAccess"
          equals = "true"
        }
      ]
    }
    then = {
      effect = "deny"
    }
  })
}

# Assignment at management group
resource "azurerm_management_group_policy_assignment" "deny_public_storage" {
  name                 = "deny-public-storage"
  management_group_id  = azurerm_management_group.corp.id
  policy_definition_id = azurerm_policy_definition.deny_public_storage.id
  display_name         = "Deny public storage account access"

  # Enforce: true = deny; false = audit only (use this during rollout)
  enforce = true
}

Policy syntax validation in CI

Before I let any policy definition reach a real environment, I validate its syntax. I use PowerShell with the Pester framework for this:

# validate-policy.ps1
Describe "Policy Definition Validation" {
    BeforeAll {
        $policyFiles = Get-ChildItem -Path "./policies" -Filter "*.json" -Recurse
    }

    It "Policy file '<File>' has valid JSON" -ForEach @(
        foreach ($file in $policyFiles) { @{ File = $file.FullName } }
    ) {
        { Get-Content $File | ConvertFrom-Json } | Should -Not -Throw
    }

    It "Policy file '<File>' contains required fields" -ForEach @(
        foreach ($file in $policyFiles) { @{ File = $file.FullName } }
    ) {
        $policy = Get-Content $File | ConvertFrom-Json
        $policy.properties.displayName | Should -Not -BeNullOrEmpty
        $policy.properties.policyRule | Should -Not -BeNullOrEmpty
        $policy.properties.policyRule.if | Should -Not -BeNullOrEmpty
        $policy.properties.policyRule.then | Should -Not -BeNullOrEmpty
        $policy.properties.policyRule.then.effect | Should -Not -BeNullOrEmpty
    }

    It "Policy file '<File>' uses an approved effect" -ForEach @(
        foreach ($file in $policyFiles) { @{ File = $file.FullName } }
    ) {
        $policy = Get-Content $File | ConvertFrom-Json
        $approvedEffects = @("audit", "deny", "modify", "deployIfNotExists",
                             "auditIfNotExists", "disabled", "append")
        $policy.properties.policyRule.then.effect | Should -BeIn $approvedEffects
    }
}

Multi-tenant deployment SOP

When I'm managing policies across multiple tenants, I follow a fixed promotion sequence. Taking shortcuts here is how breaking changes reach production policy.

Stages:

Development tenant — Define and validate the policy definition. Run syntax tests. Deploy to a non-production management group. Trigger manual compliance scan and review results.
Staging tenant — Assign with enforce: false first. Let the compliance scan run (up to 30 minutes for initial evaluation). Review audit results. If the expected resources appear as non-compliant, flip to enforce: true and verify no unintended denies occur.
Production tenant — Deploy from the exact same immutable artefact (Git tag, not branch). Assign with enforce: false, review compliance, then enforce. Never make direct changes to production policy assignments — every change goes through the pipeline.

Release process:

All policy changes go through a pull request with at least one review from someone outside the team that authored the change
Policy versions are tracked in Git tags: policy/v1.2.0
Rollback is git revert plus pipeline re-run — not manual portal changes

Policy ownership

The question of who writes and maintains policies is as important as the technical implementation. The model I use:

Platform team owns baseline controls at root and platform management groups
Security team reviews and approves any deny-effect policy before it goes to production
Application teams can request exemptions — they don't create them unilaterally
Quarterly review of all exemptions by security and platform teams

The problem with testing policies against what exists​

Preventive vs detective controls​

Policy framework structure​

Terraform implementation​

Policy syntax validation in CI​

Multi-tenant deployment SOP​

Policy ownership​

Related pages​