One of our Databricks instances constantly has "BOOT STRAP FAILURE"

Steven Morris 0 Reputation points
2025-04-16T16:43:49.3166667+00:00

We have multiple Databricks instances across multiple subsciptions, one of our instances in one subscription constantly has provisioning failures, timing out while waiting for VM's to either provision or attach storage. This is in US West 2, not sure if we're just in a bad zone in that subscription on US West 2 vs. the others.

We started troubleshooting by increasing scale, and scaling out / scaling up, we tried spot instances vs. on-demand. And we just consistsently get intermittent failures after around 400 seconds provisioning the cluster, across various jobs with both lighter and heavier workloads.

Example Errors:

Cluster 'redacted' was terminated. Reason: STORAGE_DOWNLOAD_FAILURE_SLOW (CLIENT_ERROR). Parameters: databricks_error_message:Downloading worker artifacts onto the instance timed out. Please check your network settings, especially any firewall appliances, they may throttle your download speed. Instance bootstrap failed command: Command_UpdateWorker Failure message (may be truncated): 2025/04/14 13:02:07 INFO worker_artifacts.py:95: Clearing the JAR destination: /home/ubuntu/databricks/jars/node-daemon

Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
2,415 questions
{count} votes

1 answer

Sort by: Most helpful
  1. Chandra Boorla 12,100 Reputation points Microsoft External Staff
    2025-04-16T18:24:48.76+00:00

    @Steven Morris

    Thanks for sharing the details - and sorry to hear you are running into these consistent BOOTSTRAP_FAILURE issues in US West 2. Based on the error message and your thorough troubleshooting so far, it seems likely the root cause is tied to network-level performance or availability in the specific zone or subnet your cluster is deploying into.

    The key error:

    STORAGE_DOWNLOAD_FAILURE_SLOW "Downloading worker artifacts onto the instance timed out. Please check your network settings, especially any firewall appliances, they may throttle your download speed."

    This usually indicates that the VM is able to provision but fails while downloading critical runtime components - often due to throttled or unreliable network paths (sometimes caused by custom firewalls, DNS delays, or degraded zone performance).

    Here are some possible troubleshooting steps that might help you:

    Check for Azure Service Health - In the Azure Portal, navigate to Service Health for your subscription/region and confirm if any Availability Zones in US West 2 are degraded. Databricks clusters may land in a specific zone by default unless configured otherwise.

    User's image

    Run the Databricks Network Connectivity Test - You can run this from within the cluster’s “Advanced” tab (or using network diagnostic notebooks) to verify outbound access to Databricks artifact hosts is healthy.

    Review Any Network Configurations - If you have any NSGs, firewalls, or forced proxy/tunneling in place, especially ones performing deep packet inspection, they could slow down or block bootstrap traffic. These commonly affect VM startup.

    Try a Different Availability Zone or Region (if possible) - If this issue is isolated to one Databricks workspace in one subscription, it may be worth trying a test workspace in another AZ or region to validate whether this is a local zone/subnet issue.

    Recommended

    One additional action worth exploring is increasing the egress (outbound bandwidth) limit on your storage account, which is responsible for delivering Databricks runtime components during cluster bootstrap. Since the error points to timeouts while downloading worker artifacts, and this storage account plays a key role in that process, raising the egress throughput cap may help avoid timeouts, especially under heavier provisioning or in zones with limited VM bandwidth.

    I hope this information helps. Please do let us know if you have any further queries.

    Kindly consider upvoting the comment if the information provided is helpful. This can assist other community members in resolving similar issues.

    Thank you.

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.