Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
This tutorial ingests data into Fabric lakehouses in delta lake format. We define some important terms here:
Lakehouse - A lakehouse is a collection of files, folders, and / or tables that represent a database over a data lake. The Spark engine and SQL engine use lakehouse resources for big data processing. When you use open-source Delta-formatted tables, that processing includes enhanced ACID transaction capabilities.
Delta Lake - Delta Lake is an open-source storage layer that brings ACID transactions, scalable metadata management, and batch and streaming data processing to Apache Spark. As a data table format, Delta Lake extends Parquet data files with a file-based transaction log for ACID transactions and scalable metadata management.
Azure Open Datasets are curated public datasets that add scenario-specific features to machine learning solutions. This leads to more accurate models. Open Datasets are cloud resources that reside on Microsoft Azure Storage. Apache Spark, REST API, Data factory, and other tools can access Open Datasets.
In this tutorial, you use the Apache Spark to:
- Read data from Azure Open Datasets containers.
- Write data into a Fabric lakehouse delta table.
Prerequisites
Get a Microsoft Fabric subscription. Or, sign up for a free Microsoft Fabric trial.
Sign in to Microsoft Fabric.
Use the experience switcher on the bottom left side of your home page to switch to Fabric.
- Add a lakehouse to this notebook. In this tutorial, you first download data from a public blob. Then, the data is stored in that lakehouse resource.
Follow along in a notebook
The 1-ingest-data.ipynb notebook accompanies this tutorial.
To open the accompanying notebook for this tutorial, follow the instructions in Prepare your system for data science tutorials to import the notebook to your workspace.
If you'd rather copy and paste the code from this page, you can create a new notebook.
Be sure to attach a lakehouse to the notebook before you start running code.
Bank churn data
The dataset contains churn status information for 10,000 customers. It also includes attributes that could influence churn - for example:
- Credit score
- Geographical location (Germany, France, Spain)
- Gender (male, female)
- Age
- Tenure (the number of years the customer was a client at that bank)
- Account balance
- Estimated salary
- Number of products that a customer purchased through the bank
- Credit card status (whether or not a customer has a credit card)
- Active member status (whether or not the customer has an active bank customer status)
The dataset also includes these columns:
- row number
- customer ID
- customer surname
These columns should have no influence on the decision of a customer to leave the bank.
The closure of a customer bank account defines the churn of that customer. The dataset exited
column refers to customer's abandonment. Little context about these attributes is available, so you must proceed without background information about the dataset. Our goal is to understand how these attributes contribute to the exited
status.
Sample dataset rows:
"CustomerID" | "Surname" | "CreditScore" | "Geography" | "Gender" | "Age" | "Tenure" | "Balance" | "NumOfProducts" | "HasCrCard" | "IsActiveMember" | "EstimatedSalary" | "Exited" |
---|---|---|---|---|---|---|---|---|---|---|---|---|
15634602 | Hargrave | 619 | France | Female | 42 | 2 | 0.00 | 1 | 1 | 1 | 101348.88 | 1 |
15647311 | Hill | 608 | Spain | Female | 41 | 1 | 83807.86 | 1 | 0 | 1 | 112542.58 | 0 |
Download dataset and upload to lakehouse
Tip
When you define the following parameters, you can easily use this notebook with different datasets:
IS_CUSTOM_DATA = False # if TRUE, dataset has to be uploaded manually
DATA_ROOT = "/lakehouse/default"
DATA_FOLDER = "Files/churn" # folder with data files
DATA_FILE = "churn.csv" # data file name
The following code snippet downloads a publicly available version of the dataset, and then stores that resource in a Fabric lakehouse:
Important
Make sure you add a lakehouse to the notebook before you run it. Failure to do so results in an error.
import os, requests
if not IS_CUSTOM_DATA:
# Download demo data files into lakehouse if not exist
remote_url = "https://synapseaisolutionsa.blob.core.windows.net/public/bankcustomerchurn"
file_list = [DATA_FILE]
download_path = f"{DATA_ROOT}/{DATA_FOLDER}/raw"
if not os.path.exists("/lakehouse/default"):
raise FileNotFoundError(
"Default lakehouse not found, please add a lakehouse and restart the session."
)
os.makedirs(download_path, exist_ok=True)
for fname in file_list:
if not os.path.exists(f"{download_path}/{fname}"):
r = requests.get(f"{remote_url}/{fname}", timeout=30)
with open(f"{download_path}/{fname}", "wb") as f:
f.write(r.content)
print("Downloaded demo data files into lakehouse.")
Related content
You use the data you just ingested in: