Azure Synapse Link for Dataverse Spark issues

Broomfield, Darrien 0 Reputation points
2025-04-25T23:35:21.93+00:00

I have set up a Azure Synapse Linked Service for Dataverse and the Lake Database got auto created in the Synapse workspace. I can also run a SQL script on any of the Dataverse tables that I included in the Link by right clicking them and using the select top 100 rows option to autofill the conneection details. But when I try to pull the same data into dataframe within a notebook I get me with this error:

org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.IllegalArgumentException: null path

Anyone know why the lake database that the Synapse Link created only work for SQL scripts and not spark sql notebooks?

Azure Synapse Analytics
Azure Synapse Analytics
An Azure analytics service that brings together data integration, enterprise data warehousing, and big data analytics. Previously known as Azure SQL Data Warehouse.
5,318 questions
{count} votes

1 answer

Sort by: Most helpful
  1. Sina Salam 19,616 Reputation points
    2025-04-30T20:03:56.06+00:00

    Hello Broomfield, Darrien,

    Welcome to the Microsoft Q&A and thank you for posting your questions here.

    Regarding the Azure Synapse Link for Dataverse Spark issues and the responses, after my careful review, you need to know in reality:

    1. Delta requires explicit activation and permissions (Synapse Managed Identity must have Storage Blob Data Owner role). Chech from this links: -
    1. model.json must be parsed to map headers, and partitioned data requires path patterns. Check this links:

    Therefore, follow the below steps to resolve the issue:

    1. Assign Synapse Managed Identity the Storage Blob Data Owner role on the storage account - https://learn.microsoft.com/en-us/azure/storage/common/storage-auth-aad-rbac-portal also, make sure Synapse identity has System Administrator in Dataverse - https://learn.microsoft.com/en-us/power-platform/admin/manage-service-principals then, reconfigure Synapse Link with Delta enabled.
    2. Read Parquet/Delta Data (If Delta Enabled)
         # For Delta tables:
         df = spark.read.format("delta").load("abfss://<container>@<storage>.dfs.core.windows.net/Tables/<table>/delta")
         df.show()
         # For Parquet:
         df = spark.read.parquet("abfss://<container>@<storage>.dfs.core.windows.net/Tables/<table>/parquet")
      
    3. Handle CSV Files & Schema Extraction in this two ways:
      • Parse model.json for Schema:
                from pyspark.sql import SparkSession
                import json
                # Read model.json
                model_json_path = "abfss://<container>@<storage>.dfs.core.windows.net/model.json"
                model_content = spark.sparkContext.wholeTextFiles(model_json_path).collect()[0][1]
                model = json.loads(model_content)
                # Extract schema for a specific table (e.g., "account")
                table_schema = [col["name"] for table in model["entities"] if table["name"] == "account" for col in table["attributes"]]
        
      • Read CSV with Inferred Schema:
                # Read all CSV files (including partitions and snapshots)
                path = "abfss://<container>@<storage>.dfs.core.windows.net/Tables/account/**/*.csv"
                df = spark.read.option("header", "false").csv(path)
                df = df.toDF(*table_schema)
                df.show()
        
    4. Lastly, you will need to handle Partitioned Data & Snapshots, make sure you include all data:
      • Use wildcards to read partitioned folders (e.g., Tables/account/**/*.csv).
      • Snapshots are stored in Tables/account/snapshot; exclude them if not needed:
             path = "abfss://<container>@<storage>.dfs.core.windows.net/Tables/account/*/*.csv"  # Excludes "snapshot" 
        

    I hope this is helpful! Do not hesitate to let me know if you have any other questions or clarifications.


    Please don't forget to close up the thread here by upvoting and accept it as an answer if it is helpful.


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.