Tutorial Part 1: Use Apache Spark to ingest data into a Microsoft Fabric lakehouse

Article
2025-04-25

This tutorial ingests data into Fabric lakehouses in delta lake format. We define some important terms here:

Lakehouse - A lakehouse is a collection of files, folders, and / or tables that represent a database over a data lake. The Spark engine and SQL engine use lakehouse resources for big data processing. When you use open-source Delta-formatted tables, that processing includes enhanced ACID transaction capabilities.
Delta Lake - Delta Lake is an open-source storage layer that brings ACID transactions, scalable metadata management, and batch and streaming data processing to Apache Spark. As a data table format, Delta Lake extends Parquet data files with a file-based transaction log for ACID transactions and scalable metadata management.
Azure Open Datasets are curated public datasets that add scenario-specific features to machine learning solutions. This leads to more accurate models. Open Datasets are cloud resources that reside on Microsoft Azure Storage. Apache Spark, REST API, Data factory, and other tools can access Open Datasets.

In this tutorial, you use the Apache Spark to:

Read data from Azure Open Datasets containers.
Write data into a Fabric lakehouse delta table.

Prerequisites

Get a Microsoft Fabric subscription. Or, sign up for a free Microsoft Fabric trial.
Sign in to Microsoft Fabric.
Use the experience switcher on the bottom left side of your home page to switch to Fabric.

Add a lakehouse to this notebook. In this tutorial, you first download data from a public blob. Then, the data is stored in that lakehouse resource.

Follow along in a notebook

The 1-ingest-data.ipynb notebook accompanies this tutorial.

To open the accompanying notebook for this tutorial, follow the instructions in Prepare your system for data science tutorials to import the notebook to your workspace.
If you'd rather copy and paste the code from this page, you can create a new notebook.
Be sure to attach a lakehouse to the notebook before you start running code.

Bank churn data

The dataset contains churn status information for 10,000 customers. It also includes attributes that could influence churn - for example:

Credit score
Geographical location (Germany, France, Spain)
Gender (male, female)
Age
Tenure (the number of years the customer was a client at that bank)
Account balance
Estimated salary
Number of products that a customer purchased through the bank
Credit card status (whether or not a customer has a credit card)
Active member status (whether or not the customer has an active bank customer status)

The dataset also includes these columns:

row number
customer ID
customer surname

These columns should have no influence on the decision of a customer to leave the bank.

The closure of a customer bank account defines the churn of that customer. The dataset exited column refers to customer's abandonment. Little context about these attributes is available, so you must proceed without background information about the dataset. Our goal is to understand how these attributes contribute to the exited status.

Sample dataset rows:

"CustomerID"	"Surname"	"CreditScore"	"Geography"	"Gender"	"Age"	"Tenure"	"Balance"	"NumOfProducts"	"HasCrCard"	"IsActiveMember"	"EstimatedSalary"	"Exited"
15634602	Hargrave	619	France	Female	42	2	0.00	1	1	1	101348.88	1
15647311	Hill	608	Spain	Female	41	1	83807.86	1	0	1	112542.58	0

Download dataset and upload to lakehouse

Tip

When you define the following parameters, you can easily use this notebook with different datasets:

IS_CUSTOM_DATA = False  # if TRUE, dataset has to be uploaded manually

DATA_ROOT = "/lakehouse/default"
DATA_FOLDER = "Files/churn"  # folder with data files
DATA_FILE = "churn.csv"  # data file name

The following code snippet downloads a publicly available version of the dataset, and then stores that resource in a Fabric lakehouse:

Important

Make sure you add a lakehouse to the notebook before you run it. Failure to do so results in an error.

import os, requests
if not IS_CUSTOM_DATA:
# Download demo data files into lakehouse if not exist
    remote_url = "https://synapseaisolutionsa.blob.core.windows.net/public/bankcustomerchurn"
    file_list = [DATA_FILE]
    download_path = f"{DATA_ROOT}/{DATA_FOLDER}/raw"

    if not os.path.exists("/lakehouse/default"):
        raise FileNotFoundError(
            "Default lakehouse not found, please add a lakehouse and restart the session."
        )
    os.makedirs(download_path, exist_ok=True)
    for fname in file_list:
        if not os.path.exists(f"{download_path}/{fname}"):
            r = requests.get(f"{remote_url}/{fname}", timeout=30)
            with open(f"{download_path}/{fname}", "wb") as f:
                f.write(r.content)
    print("Downloaded demo data files into lakehouse.")

You use the data you just ingested in:

Part 2: Explore and visualize data using notebooks

Share via

Tutorial Part 1: Use Apache Spark to ingest data into a Microsoft Fabric lakehouse

Prerequisites

Follow along in a notebook

Bank churn data

Download dataset and upload to lakehouse

Related content

Feedback

Additional resources