Policy as Code
The way most Azure policy programs start is with manual portal assignments. Someone adds a policy here, an initiative there, and over time you end up with hundreds of policies deployed across tenants with no source of truth for where they came from, no test coverage, and no way to safely promote changes. I've inherited that state more than once.
Policy as code solves this by treating Azure Policy the same way any other infrastructure gets treated — defined in files, reviewed through pull requests, tested in CI, and deployed through a controlled pipeline.
The problem with testing policies against what exists
The specific mistake I see most often in policy testing is that the tests are auto-generated from the current Azure state. The test says "check that policy X is assigned to management group Y" — which is exactly the thing the deployment just did. It's a circular dependency: the test validates what the pipeline created rather than validating that the right thing was created.
The tests I want are the ones that fail if I haven't deployed correctly — not the ones that pass because I just read back what I wrote.
What I do instead:
- Write tests before deployment that describe the intent of the policy (what it should allow, what it should deny)
- Validate policy definitions with
az policy definition create --validate-onlyin CI before any assignment - Run compliance scans in a non-production management group after deployment to verify effect before promoting to production
Preventive vs detective controls
Every policy effect has a different role:
| Effect | Type | When I use it |
|---|---|---|
deny | Preventive | Hard requirements: resource types that must never be created, regions that aren't approved |
audit | Detective | Hygiene standards: tags, naming, config settings I want visibility on but don't want to block |
modify | Remediation | Auto-fix: adding missing tags, setting default properties on creation |
deployIfNotExists | Remediation | Auto-provision: enabling diagnostics, deploying agents |
auditIfNotExists | Detective | Detect missing child resources without blocking the parent |
I default to audit when rolling out any new policy, then promote to deny after a remediation cycle has cleared the estate.
Policy framework structure
When I'm building an enterprise policy program, I organise policies into three logical tiers:
Baseline controls — Applied at the root management group. These are the non-negotiable controls: approved regions, required tags, no public storage accounts in production. The blast radius if these are wrong is the whole tenant, so I test them thoroughly before assignment and I don't make exceptions below the root.
Workload controls — Applied at the landing zone management group level. These are workload-class-specific policies: Corp workloads route through the firewall, Online workloads can't peer to the hub. These can vary by organisation and I tune them per engagement.
Exception controls — Applied at the subscription level as policy exemptions with expiry dates. I treat exemptions the same way I treat technical debt: they're tracked, they have owners, and they have review dates.
Terraform implementation
# Policy definition
resource "azurerm_policy_definition" "deny_public_storage" {
name = "deny-public-storage-accounts"
policy_type = "Custom"
mode = "All"
display_name = "Deny public storage account access"
metadata = jsonencode({
category = "Storage"
version = "1.0.0"
})
policy_rule = jsonencode({
if = {
allOf = [
{
field = "type"
equals = "Microsoft.Storage/storageAccounts"
},
{
field = "Microsoft.Storage/storageAccounts/allowBlobPublicAccess"
equals = "true"
}
]
}
then = {
effect = "deny"
}
})
}
# Assignment at management group
resource "azurerm_management_group_policy_assignment" "deny_public_storage" {
name = "deny-public-storage"
management_group_id = azurerm_management_group.corp.id
policy_definition_id = azurerm_policy_definition.deny_public_storage.id
display_name = "Deny public storage account access"
# Enforce: true = deny; false = audit only (use this during rollout)
enforce = true
}
Policy syntax validation in CI
Before I let any policy definition reach a real environment, I validate its syntax. I use PowerShell with the Pester framework for this:
# validate-policy.ps1
Describe "Policy Definition Validation" {
BeforeAll {
$policyFiles = Get-ChildItem -Path "./policies" -Filter "*.json" -Recurse
}
It "Policy file '<File>' has valid JSON" -ForEach @(
foreach ($file in $policyFiles) { @{ File = $file.FullName } }
) {
{ Get-Content $File | ConvertFrom-Json } | Should -Not -Throw
}
It "Policy file '<File>' contains required fields" -ForEach @(
foreach ($file in $policyFiles) { @{ File = $file.FullName } }
) {
$policy = Get-Content $File | ConvertFrom-Json
$policy.properties.displayName | Should -Not -BeNullOrEmpty
$policy.properties.policyRule | Should -Not -BeNullOrEmpty
$policy.properties.policyRule.if | Should -Not -BeNullOrEmpty
$policy.properties.policyRule.then | Should -Not -BeNullOrEmpty
$policy.properties.policyRule.then.effect | Should -Not -BeNullOrEmpty
}
It "Policy file '<File>' uses an approved effect" -ForEach @(
foreach ($file in $policyFiles) { @{ File = $file.FullName } }
) {
$policy = Get-Content $File | ConvertFrom-Json
$approvedEffects = @("audit", "deny", "modify", "deployIfNotExists",
"auditIfNotExists", "disabled", "append")
$policy.properties.policyRule.then.effect | Should -BeIn $approvedEffects
}
}
Multi-tenant deployment SOP
When I'm managing policies across multiple tenants, I follow a fixed promotion sequence. Taking shortcuts here is how breaking changes reach production policy.
Stages:
-
Development tenant — Define and validate the policy definition. Run syntax tests. Deploy to a non-production management group. Trigger manual compliance scan and review results.
-
Staging tenant — Assign with
enforce: falsefirst. Let the compliance scan run (up to 30 minutes for initial evaluation). Review audit results. If the expected resources appear as non-compliant, flip toenforce: trueand verify no unintended denies occur. -
Production tenant — Deploy from the exact same immutable artefact (Git tag, not branch). Assign with
enforce: false, review compliance, then enforce. Never make direct changes to production policy assignments — every change goes through the pipeline.
Release process:
- All policy changes go through a pull request with at least one review from someone outside the team that authored the change
- Policy versions are tracked in Git tags:
policy/v1.2.0 - Rollback is
git revertplus pipeline re-run — not manual portal changes
Policy ownership
The question of who writes and maintains policies is as important as the technical implementation. The model I use:
- Platform team owns baseline controls at root and platform management groups
- Security team reviews and approves any
deny-effect policy before it goes to production - Application teams can request exemptions — they don't create them unilaterally
- Quarterly review of all exemptions by security and platform teams