Skip to main content

Consuming ADLS Gen2 Storage from Databricks

When I first set up a private Databricks workspace with default_storage_firewall_enabled = true and no public IP on the cluster nodes, reading from ADLS Gen2 failed with UnknownRemoteException on every spark.read call. The storage account resolved to its public endpoint, which was blocked. Getting this working required three things in the right order: ADLS Gen2 with HNS enabled, a private endpoint using the dfs subresource, and a role assignment wiring the UAMI to the storage account.

Why ADLS Gen2 and not regular Blob Storage

Databricks uses the ABFS (Azure Blob File System) driver with the abfss:// scheme to talk to storage. The abfss:// scheme requires hierarchical namespace (is_hns_enabled = true) on the storage account. Without it, the ABFS driver either throws an error or silently falls back to the wasbs:// scheme, which is the older, slower Azure Blob driver. I've always used ADLS Gen2 for Databricks workloads — there's no good reason to use flat Blob Storage here.

Terraform: provisioning the storage account

The key settings for a Databricks-compatible storage account:

resource "azurerm_storage_account" "data" {
name = local.storage_account_name
resource_group_name = azurerm_resource_group.data.name
location = azurerm_resource_group.data.location
account_tier = "Standard"
account_replication_type = "LRS"
account_kind = "StorageV2"
is_hns_enabled = true # Required for abfss:// driver
min_tls_version = "TLS1_2"
https_traffic_only_enabled = true
allow_nested_items_to_be_public = false
public_network_access_enabled = false # All traffic via private endpoint

tags = local.tags
}

resource "azurerm_storage_container" "logs" {
name = "logs"
storage_account_id = azurerm_storage_account.data.id
container_access_type = "private"
}

public_network_access_enabled = false means no traffic reaches the storage blob plane from outside the VNet. This also means you can't create azurerm_storage_blob resources from a Terraform runner that isn't inside the VNet — the provider's HTTP client can't reach the blob endpoint. I learned this the hard way when my pipeline tried to seed a sample CSV and got a 403 on every attempt. The file upload has to happen from inside the network (from a Databricks notebook or the jump VM via Bastion).

Role assignment — no keys or SAS tokens

The UAMI on the access connector needs Storage Blob Data Contributor on the storage account. This covers read, write, and delete on blob data. The role assignment is scoped to the storage account, not the container, which makes it easier to manage across multiple containers if you add more later.

resource "azurerm_role_assignment" "uami_storage_blob_contributor" {
scope = azurerm_storage_account.data.id
role_definition_name = "Storage Blob Data Contributor"
principal_id = azurerm_user_assigned_identity.databricks.principal_id
}

No spark.conf settings needed on the cluster or in the notebook. When Unity Catalog is enabled or the access connector credential is in use, the ABFS driver resolves the UAMI automatically. I spent longer than I should have trying to configure fs.azure.account.auth.type before realising the credential passthrough just works when the identity is wired correctly.

Unity Catalog: storage credential

The Azure role assignment gets the UAMI the permissions it needs at the Azure layer. But Databricks won't use that identity for data access unless you also register it inside Unity Catalog as a storage credential. The storage credential is the Databricks-side record that says "this access connector identity is authorised to be used for storage access from this metastore."

Via Terraform

resource "databricks_storage_credential" "adls" {
name = "sc-techanalytics-adls"

azure_managed_identity {
access_connector_id = azurerm_databricks_access_connector.databricks.id
managed_identity_id = azurerm_user_assigned_identity.databricks.id
}

comment = "UAMI credential for ADLS Gen2 access"
}

access_connector_id is the resource ID of the access connector. managed_identity_id is the resource ID of the UAMI. Both must be supplied when using a user-assigned identity — if you omit managed_identity_id, Databricks assumes the system-assigned identity on the connector, which may not have the storage role assignment.

Via the UI

  1. Open the workspace and go to Catalog in the left sidebar
  2. Click External Data > Credentials
  3. Click Add credential
  4. Set Credential type to Azure Managed Identity
  5. Give it a name (e.g. sc-techanalytics-adls)
  6. Paste the Access connector ID — this is the full Azure resource ID of the access connector, in the format /subscriptions/<sub-id>/resourceGroups/<rg>/providers/Microsoft.Databricks/accessConnectors/<name>
  7. Paste the Managed identity ID — this is the full Azure resource ID of the UAMI, in the format /subscriptions/<sub-id>/resourceGroups/<rg>/providers/Microsoft.ManagedIdentity/userAssignedIdentities/<name>
  8. Click Create

Both resource IDs are available as Terraform outputs if you output them, or from the Azure portal under the respective resource's Properties blade.

Via a notebook

The Unity Catalog SQL CREATE STORAGE CREDENTIAL statement works from any notebook cell attached to a cluster with Unity Catalog enabled. Run this once — you only need one storage credential per access connector identity regardless of how many external locations you create later.

CREATE STORAGE CREDENTIAL IF NOT EXISTS `sc-techanalytics-adls`
WITH AZURE_MANAGED_IDENTITY (
CONNECTOR = '/subscriptions/<subscription-id>/resourceGroups/<rg-name>/providers/Microsoft.Databricks/accessConnectors/<connector-name>',
MANAGED_IDENTITY_ID = '/subscriptions/<subscription-id>/resourceGroups/<rg-name>/providers/Microsoft.ManagedIdentity/userAssignedIdentities/<uami-name>'
)
COMMENT 'UAMI credential for ADLS Gen2 access';

To run this as Python in the same notebook using spark.sql():

access_connector_id = "/subscriptions/<subscription-id>/resourceGroups/<rg-name>/providers/Microsoft.Databricks/accessConnectors/<connector-name>"
managed_identity_id = "/subscriptions/<subscription-id>/resourceGroups/<rg-name>/providers/Microsoft.ManagedIdentity/userAssignedIdentities/<uami-name>"
credential_name = "sc-techanalytics-adls"

spark.sql(f"""
CREATE STORAGE CREDENTIAL IF NOT EXISTS `{credential_name}`
WITH AZURE_MANAGED_IDENTITY (
CONNECTOR = '{access_connector_id}',
MANAGED_IDENTITY_ID = '{managed_identity_id}'
)
COMMENT 'UAMI credential for ADLS Gen2 access'
""")

# Confirm it was created
display(spark.sql("SHOW STORAGE CREDENTIALS"))

You need to be a metastore admin or have the CREATE STORAGE CREDENTIAL privilege to run this. If the workspace is set up with a single admin account for initial setup, run it from there first.

Unity Catalog: external location

A storage credential on its own doesn't grant access to any specific storage path. You need an external location to map a storage path to a credential. External locations define the root paths that notebooks and jobs are allowed to read from or write to via Unity Catalog.

Via Terraform

resource "databricks_external_location" "logs" {
name = "el-techanalytics-logs"
url = "abfss://logs@${azurerm_storage_account.data.name}.dfs.core.windows.net/"
credential_name = databricks_storage_credential.adls.name

comment = "External location for log ingestion container"

depends_on = [databricks_storage_credential.adls]
}

The url is the abfss:// path to the container root. You can scope it to a specific subfolder if you want to limit what the external location exposes, but I usually set it at the container root and let table grants handle the finer-grained access control.

Via the UI

  1. Go to Catalog > External Data > External Locations
  2. Click Add an external location
  3. Select Manual (rather than an auto-discovered location)
  4. Give it a name (e.g. el-techanalytics-logs)
  5. Set the URL to the abfss:// container root: abfss://logs@<storage-account-name>.dfs.core.windows.net/
  6. Select the storage credential you created in the previous step from the Storage credential dropdown
  7. Click Create
  8. On the next screen click Test connection — this is the fastest way to confirm the credential, role assignment, and private endpoint are all working correctly before you try reading from a notebook

Via a notebook

storage_account  = "<storage-account-name>"   # from Terraform output: data_storage_account_name
container = "logs"
credential_name = "sc-techanalytics-adls"
location_name = "el-techanalytics-logs"

spark.sql(f"""
CREATE EXTERNAL LOCATION IF NOT EXISTS `{location_name}`
URL 'abfss://{container}@{storage_account}.dfs.core.windows.net/'
WITH (CREDENTIAL `{credential_name}`)
COMMENT 'External location for log ingestion container'
""")

# Validate — this lists the files at the location root, confirming end-to-end access
spark.sql(f"VALIDATE STORAGE CREDENTIAL `{credential_name}`")
display(spark.sql("SHOW EXTERNAL LOCATIONS"))

Once the external location exists, notebooks can reference any path under it directly and Unity Catalog handles the credential injection transparently. There's nothing extra to configure per-notebook or per-cluster.

To do a quick sanity check from a notebook after setup:

dbutils.fs.ls(f"abfss://{container}@{storage_account}.dfs.core.windows.net/")

If the credential and external location are wired correctly, you'll get a directory listing. If you get a 403 or an auth error, the most likely cause is that the storage credential was created against the system-assigned identity rather than the UAMI, or the role assignment hasn't propagated yet (it can take a few minutes).

Private endpoint for the DFS subresource

Cluster nodes run with no_public_ip = true, so they have no internet egress. Without a private endpoint, the ABFS driver tries to resolve <storage-account>.dfs.core.windows.net, gets the public IP, and can't connect. The private endpoint must target the dfs subresource — not blob. The dfs subresource covers all abfss:// traffic from the ABFS driver.

resource "azurerm_private_dns_zone" "storage_dfs" {
name = "privatelink.dfs.core.windows.net"
resource_group_name = azurerm_resource_group.hub.name
tags = local.tags
}

resource "azurerm_private_dns_zone_virtual_network_link" "storage_dfs_hub" {
name = local.dns_link_storage_hub_name
resource_group_name = azurerm_resource_group.hub.name
private_dns_zone_name = azurerm_private_dns_zone.storage_dfs.name
virtual_network_id = azurerm_virtual_network.hub.id
registration_enabled = false
tags = local.tags
}

resource "azurerm_private_dns_zone_virtual_network_link" "storage_dfs_spoke" {
name = local.dns_link_storage_spoke_name
resource_group_name = azurerm_resource_group.hub.name
private_dns_zone_name = azurerm_private_dns_zone.storage_dfs.name
virtual_network_id = azurerm_virtual_network.spoke.id
registration_enabled = false
tags = local.tags
}

resource "azurerm_private_endpoint" "storage_dfs" {
name = local.pep_storage_dfs_name
location = azurerm_resource_group.hub.location
resource_group_name = azurerm_resource_group.hub.name
subnet_id = azurerm_subnet.private_endpoint.id

private_service_connection {
name = local.psc_storage_dfs_name
private_connection_resource_id = azurerm_storage_account.data.id
is_manual_connection = false
subresource_names = ["dfs"]
}

private_dns_zone_group {
name = "pdnszg-stg-dfs"
private_dns_zone_ids = [azurerm_private_dns_zone.storage_dfs.id]
}

tags = local.tags
}

The DNS zone needs to be linked to both the hub and spoke VNets. Hub is where the private endpoint lives; spoke is where the Databricks cluster nodes run. If the spoke link is missing, the cluster nodes can't resolve the storage FQDN through the private zone and fall back to the public IP.

Reading data from a notebook

Once the infrastructure is in place, reading is straightforward. The storage account name comes from a Terraform output:

storage_account = "<value from Terraform output: data_storage_account_name>"

df = (spark.read
.option("header", "true")
.option("inferSchema", "true")
.csv(f"abfss://logs@{storage_account}.dfs.core.windows.net/sample/app_logs.csv"))

df.show()

For more control over schema or parsing:

from pyspark.sql.functions import col, to_timestamp

df = (spark.read
.option("header", "true")
.option("inferSchema", "true")
.csv(f"abfss://logs@{storage_account}.dfs.core.windows.net/sample/app_logs.csv"))

df = df.withColumn("timestamp", to_timestamp(col("timestamp")))
df.filter(col("level") == "ERROR").show(truncate=False)

Seeding test data from a notebook

Since the Terraform runner can't reach the blob plane from outside the VNet, the easiest way to seed a test file is from within a Databricks notebook using dbutils.fs.put:

storage_account = "<value from Terraform output: data_storage_account_name>"

dbutils.fs.put(
f"abfss://logs@{storage_account}.dfs.core.windows.net/sample/app_logs.csv",
"""timestamp,level,service,message
2026-05-30T00:01:12Z,INFO,auth,User login successful
2026-05-30T00:01:45Z,ERROR,payment,Timeout connecting to payment gateway
2026-05-30T00:02:03Z,WARN,api,Rate limit approaching threshold
2026-05-30T00:02:31Z,ERROR,auth,Token validation failed""",
overwrite=True
)

This works because the cluster nodes resolve the FQDN through the private DNS zone and reach the storage account via the private endpoint. The Terraform runner outside the VNet can't do this — but anything running inside the spoke can.

Things I've gotten wrong

Using the blob subresource on the private endpoint instead of dfs. The blob subresource covers wasbs:// traffic. The dfs subresource covers abfss:// traffic. Using blob means the ABFS driver still can't resolve correctly, and the error message doesn't make the distinction obvious. I only figured this out by checking the private endpoint DNS record that was created and comparing it against what nslookup returned from inside the cluster.

Forgetting managed_identity_id on the storage credential. If you create the databricks_storage_credential with only access_connector_id and skip managed_identity_id, Databricks assumes the system-assigned identity. The system-assigned identity doesn't have the storage role, so every access attempt gets a 403 even though the infrastructure looks correct. The error message doesn't mention managed identity selection — it just says permission denied.

Creating the external location before the storage credential is fully provisioned. The depends_on on databricks_external_location matters. Without it, Terraform may try to create the external location while the storage credential is still being registered, and the external location creation fails with a "credential not found" error. This can also happen if you're running the azurerm and databricks providers in the same apply without explicit sequencing.

Forgetting to link the DNS zone to the spoke VNet. The hub link alone isn't enough. Cluster nodes are in the spoke, and if the spoke isn't linked to the privatelink.dfs.core.windows.net zone, they won't resolve the private IP. The private endpoint exists and is healthy — it just can't be found by the nodes that need it.

Trying to upload files from the Terraform pipeline. azurerm_storage_blob silently creates but then gets a 403 on the PUT when public_network_access_enabled = false. The resource shows as created in state but the file doesn't exist. Checking the storage account from the portal or from inside the VNet reveals nothing there. Use dbutils.fs.put or copy from the jump VM via Bastion instead.