Azure Synapse Link for Dataverse Spark issues

Question

Azure Synapse Link for Dataverse Spark issues

Broomfield, Darrien 0

I have set up a Azure Synapse Linked Service for Dataverse and the Lake Database got auto created in the Synapse workspace. I can also run a SQL script on any of the Dataverse tables that I included in the Link by right clicking them and using the select top 100 rows option to autofill the conneection details. But when I try to pull the same data into dataframe within a notebook I get me with this error:

org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.IllegalArgumentException: null path

Anyone know why the lake database that the Synapse Link created only work for SQL scripts and not spark sql notebooks?

Broomfield, Darrien 0 Reputation points

2025-04-26T15:06:18.6866667+00:00

To add, both myself and the Synapse managed identity has storage blob data contributor access on the storage account so it should not be a permission issue
Ganesh Gurram 6,460 Reputation points Microsoft External Staff

2025-04-28T04:34:08.33+00:00
@Broomfield, Darrien

I understand you have successfully set up Azure Synapse Link for Dataverse and can query the auto-created Lake Database using Serverless SQL pool. However, you encounter the error: org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.IllegalArgumentException: null path when trying to access the same data in a Spark notebook.

This is a known behavior because:

The Lake Database tables created by Synapse Link are optimized for Serverless SQL pools. These tables do not register full file paths in the Hive metastore required for Apache Spark pools. As a result, Spark cannot locate the data unless you directly point it to the underlying storage location.

Access Dataverse Data Directly Using Storage Path - In your Spark notebook, load the data from the underlying storage using the correct Delta or Parquet path. Example (if Delta format was enabled):

df = spark.read.format("delta").load( "abfss://<container>@<storage_account>.dfs.core.windows.net/Tables/<table_name>/delta" ) df.show()

If Delta format is not enabled, adjust to use "parquet" instead of "delta".

Ensure that ACLs (Access Control Lists) on the storage container and folders allow execute (x) permission.

Use Apache Spark 3.3 or newer in your Synapse Spark pools to ensure compatibility with Delta Lake files.

For more details refer: https://learn.microsoft.com/en-us/power-apps/maker/data-platform/azure-synapse-link-spark

Similar issues: https://learn.microsoft.com/en-us/answers/questions/484641/unable-to-perform-spark-sql-in-synapse-notebook

https://learn.microsoft.com/en-us/answers/questions/1372159/how-to-work-with-sparklyr-and-a-lake-database-in-a

https://learn.microsoft.com/en-us/answers/questions/706694/unable-to-run-sql-queries-in-azure-synapse-error-o

https://learn.microsoft.com/en-us/answers/questions/1320179/cant-create-show-databases-in-azure-synapse-notebo

I hope this information helps.

Kindly consider upvoting the comment if the information provided is helpful. This can assist other community members in resolving similar issues.
Broomfield, Darrien 0 Reputation points

2025-04-28T15:42:28.5+00:00

Thanks Ganesh. However when trying to set up the Synapse Link with delta format enabled it gets stuck trying to set permissions like this:

I don't see anything about requiring additional privilege's to turn on the delta setting but let me know if I'm missing something.

I can set the link up without the delta setting on but then the files loaded into the datalake are csv's not parquet files. I should still be able to use your suggested approach with csv option but want to confirm what files I should pull to get all the data from a given table. The data gets partitioned monthly but there is also a snapshot folder so I want to confirm that if I want all data from I table I would only need to query for all csv files in the table folder and nothing in the snapshot folder, correct?
Broomfield, Darrien 0 Reputation points

2025-04-28T20:41:47.48+00:00

Second issue with the csv files is that there is no header information, it appears to only be stored in the model.json file that sits at the container level and holds column information for all tables, so although I can read in the data with the direct link to the storage I do not know what the data corresponds too
Singireddi JagadishKumar 0 Reputation points Microsoft External Staff

2025-04-29T11:47:30.3133333+00:00
Hi Broomfield, Darrien
CSV Files Have No Header — It’s Stored in model.json Dataverse stores the column metadata for all tables in a single model.json file located at the root of the container. Locate the model.json file and parse it into a list to use as column headers.

import json model_json_path = "abfss://******@your-storage-account.dfs.core.windows.net/model.json" model_df = spark.read.text(model_json_path) model = json.loads("".join([row['value'] for row in model_df.collect()]))

Extract the schema for your table Enforce the column headers stored in the model variable when reading the CSV into a DataFrame by using the following code:

path = "abfss://<container_name>@<account_name>.dfs.core.windows.net/<folder>/<file>.csv" df = spark.read.option("header", "false").csv(path ) df = df.toDF(*model) df.show()
Singireddi JagadishKumar 0 Reputation points Microsoft External Staff

2025-04-30T10:40:24.7633333+00:00

Hi Broomfield, Darrien
I wanted to follow up to see if you had an opportunity to review my response to your query. Kindly let us know if it addressed your concerns, and feel free to reach out should you have any further questions or require additional clarification.

1 answer

Your answer

Broomfield, Darrien 0 Reputation points

2025-04-26T15:06:18.6866667+00:00

To add, both myself and the Synapse managed identity has storage blob data contributor access on the storage account so it should not be a permission issue
Broomfield, Darrien 0 Reputation points

2025-04-28T15:42:28.5+00:00

Thanks Ganesh. However when trying to set up the Synapse Link with delta format enabled it gets stuck trying to set permissions like this:

I don't see anything about requiring additional privilege's to turn on the delta setting but let me know if I'm missing something.

I can set the link up without the delta setting on but then the files loaded into the datalake are csv's not parquet files. I should still be able to use your suggested approach with csv option but want to confirm what files I should pull to get all the data from a given table. The data gets partitioned monthly but there is also a snapshot folder so I want to confirm that if I want all data from I table I would only need to query for all csv files in the table folder and nothing in the snapshot folder, correct?
Broomfield, Darrien 0 Reputation points

2025-04-28T20:41:47.48+00:00

Second issue with the csv files is that there is no header information, it appears to only be stored in the model.json file that sits at the container level and holds column information for all tables, so although I can read in the data with the direct link to the storage I do not know what the data corresponds too
Singireddi JagadishKumar 0 Reputation points Microsoft External Staff

2025-04-29T11:47:30.3133333+00:00

Hi Broomfield, Darrien
CSV Files Have No Header — It’s Stored in model.json Dataverse stores the column metadata for all tables in a single model.json file located at the root of the container. Locate the model.json file and parse it into a list to use as column headers.

import json model_json_path = "abfss://******@your-storage-account.dfs.core.windows.net/model.json" model_df = spark.read.text(model_json_path) model = json.loads("".join([row['value'] for row in model_df.collect()]))

Extract the schema for your table Enforce the column headers stored in the model variable when reading the CSV into a DataFrame by using the following code:

path = "abfss://<container_name>@<account_name>.dfs.core.windows.net/<folder>/<file>.csv" df = spark.read.option("header", "false").csv(path ) df = df.toDF(*model) df.show()
Singireddi JagadishKumar 0 Reputation points Microsoft External Staff

2025-04-30T10:40:24.7633333+00:00

Hi Broomfield, Darrien
I wanted to follow up to see if you had an opportunity to review my response to your query. Kindly let us know if it addressed your concerns, and feel free to reach out should you have any further questions or require additional clarification.

Answer 1

Hello Broomfield, Darrien,

Welcome to the Microsoft Q&A and thank you for posting your questions here.

Regarding the Azure Synapse Link for Dataverse Spark issues and the responses, after my careful review, you need to know in reality:

Delta requires explicit activation and permissions (Synapse Managed Identity must have Storage Blob Data Owner role). Chech from this links: -

model.json must be parsed to map headers, and partitioned data requires path patterns. Check this links:

Therefore, follow the below steps to resolve the issue:

Assign Synapse Managed Identity the Storage Blob Data Owner role on the storage account - https://learn.microsoft.com/en-us/azure/storage/common/storage-auth-aad-rbac-portal also, make sure Synapse identity has System Administrator in Dataverse - https://learn.microsoft.com/en-us/power-platform/admin/manage-service-principals then, reconfigure Synapse Link with Delta enabled.

Read Parquet/Delta Data (If Delta Enabled)

   # For Delta tables:
   df = spark.read.format("delta").load("abfss://<container>@<storage>.dfs.core.windows.net/Tables/<table>/delta")
   df.show()
   # For Parquet:
   df = spark.read.parquet("abfss://<container>@<storage>.dfs.core.windows.net/Tables/<table>/parquet")

Handle CSV Files & Schema Extraction in this two ways:

Parse model.json for Schema:

        from pyspark.sql import SparkSession
        import json
        # Read model.json
        model_json_path = "abfss://<container>@<storage>.dfs.core.windows.net/model.json"
        model_content = spark.sparkContext.wholeTextFiles(model_json_path).collect()[0][1]
        model = json.loads(model_content)
        # Extract schema for a specific table (e.g., "account")
        table_schema = [col["name"] for table in model["entities"] if table["name"] == "account" for col in table["attributes"]]

Read CSV with Inferred Schema:

        # Read all CSV files (including partitions and snapshots)
        path = "abfss://<container>@<storage>.dfs.core.windows.net/Tables/account/**/*.csv"
        df = spark.read.option("header", "false").csv(path)
        df = df.toDF(*table_schema)
        df.show()

Lastly, you will need to handle Partitioned Data & Snapshots, make sure you include all data:
- Use wildcards to read partitioned folders (e.g., Tables/account/**/*.csv).
- Snapshots are stored in Tables/account/snapshot; exclude them if not needed:
```
     path = "abfss://<container>@<storage>.dfs.core.windows.net/Tables/account/*/*.csv"  # Excludes "snapshot" 
```

I hope this is helpful! Do not hesitate to let me know if you have any other questions or clarifications.

Please don't forget to close up the thread here by upvoting and accept it as an answer if it is helpful.

Singireddi JagadishKumar 0 Reputation points Microsoft External Staff

2025-05-02T03:23:05.8666667+00:00

Hi Broomfield, Darrien,
I just wanted to follow up and see if you had a chance to review above response from @Sina Salam to your question. Please let us know if it was helpful, and feel free to reach out if you have any further queries.

Share via

Azure Synapse Link for Dataverse Spark issues

1 answer

Your answer