Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
APPLIES TO:
Python SDK azureml v1
Important
This article provides information on using the Azure Machine Learning SDK v1. The SDK v1 is deprecated as of March 31, 2025 and support for it will end on June 30, 2026. You're able to install and use the SDK v1 until that date.
We recommend that you transition to the SDK v2 before June 30, 2026. For more information on the SDK v2, see What is the Azure Machine Learning Python SDK v2 and the SDK v2 reference.
Azure Machine Learning's automated machine learning (AutoML) capability helps you discover high-performing models without reimplementing every possible approach. Combined with Azure Machine Learning pipelines, you can create deployable workflows that quickly discover the algorithm that works best for your data.
This article explains how to efficiently join a data preparation step to an automated machine learning step by using Python. AutoML can quickly discover the algorithm that works best for your data, while putting you on the road to MLOps and model lifecycle operationalization with pipelines.
Prerequisites
An Azure subscription. If you don't have an Azure subscription, create a free account before you begin. Try the free or paid version of Azure Machine Learning today.
An Azure Machine Learning workspace. See Create resources you need to get started.
Familiarity with Azure's automated machine learning and machine learning pipelines facilities and SDK.
Review AutoML's central classes
Automated machine learning in a pipeline is represented by an AutoMLStep
object. The AutoMLStep
class is a subclass of PipelineStep
. A graph of PipelineStep
objects defines a Pipeline
.
There are several subclasses of PipelineStep
. In addition to the AutoMLStep
, this article shows a PythonScriptStep
for data preparation and another for registering the model.
The preferred way to initially move data into a machine learning pipeline is with Dataset
objects. To move data between steps and save data output from runs, the preferred way is with OutputFileDatasetConfig
and OutputTabularDatasetConfig
objects. To be used with AutoMLStep
, the PipelineData
object must be transformed into a PipelineOutputTabularDataset
object. For more information, see Moving data into and between ML pipeline steps.
The AutoMLStep
is configured via an AutoMLConfig
object. AutoMLConfig
is a flexible class, as discussed in Configure your experiment settings.
A Pipeline
runs in an Experiment
. The pipeline Run
has, for each step, a child StepRun
. The outputs of the automated machine learning StepRun
are the training metrics and highest-performing model.
To make things concrete, this article creates a simple pipeline for a classification task. The task is predicting Titanic survival, but we don't discuss the data or task except in passing.
Get started
Retrieve initial dataset
Often, a machine learning workflow starts with preexisting baseline data. This is a good scenario for a registered dataset. Datasets are visible across the workspace, support versioning, and can be interactively explored. There are many ways to create and populate a dataset, as discussed in Create Azure Machine Learning datasets. Since we're using the Python SDK to create our pipeline, use the SDK to download baseline data and register it with the name titanic_ds.
from azureml.core import Workspace, Dataset
ws = Workspace.from_config()
if not 'titanic_ds' in ws.datasets.keys() :
# create a TabularDataset from Titanic training data
web_paths = ['https://dprepdata.blob.core.windows.net/demo/Titanic.csv',
'https://dprepdata.blob.core.windows.net/demo/Titanic2.csv']
titanic_ds = Dataset.Tabular.from_delimited_files(path=web_paths)
titanic_ds.register(workspace = ws,
name = 'titanic_ds',
description = 'Titanic baseline data',
create_new_version = True)
titanic_ds = Dataset.get_by_name(ws, 'titanic_ds')
The code first logs in to the Azure Machine Learning workspace defined in config.json. To learn how to create configuration files, see Create a workspace configuration file. If there isn't already a dataset named 'titanic_ds'
registered, then it creates one. The code downloads CSV data from the web, instantiates a TabularDataset
, and then registers the dataset with the workspace. Finally, the function Dataset.get_by_name()
assigns the Dataset
to titanic_ds
.
Configure your storage and compute target
Additional resources that the pipeline needs are storage and, generally, Azure Machine Learning compute resources.
from azureml.core import Datastore
from azureml.core.compute import AmlCompute, ComputeTarget
datastore = ws.get_default_datastore()
compute_name = 'cpu-cluster'
if not compute_name in ws.compute_targets :
print('creating a new compute target...')
provisioning_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_D2_V2',
min_nodes=0,
max_nodes=1)
compute_target = ComputeTarget.create(ws, compute_name, provisioning_config)
compute_target.wait_for_completion(
show_output=True, min_node_count=None, timeout_in_minutes=20)
# Show the result
print(compute_target.get_status().serialize())
compute_target = ws.compute_targets[compute_name]
The intermediate data between the data preparation and the AutoML step can be stored in the workspace's default datastore, so you don't need to do more than call get_default_datastore()
on the Workspace
object.
After that, the code checks if the Azure Machine Learning compute target 'cpu-cluster'
already exists. If not, specify that you want a small CPU-based compute target. If you plan to use AutoML's deep learning features (for instance, text featurization with DNN support), you should choose a compute with strong GPU support, as described in GPU optimized virtual machine sizes.
The code blocks until the target is provisioned and then prints some details of the just-created compute target. Finally, the named compute target is retrieved from the workspace and assigned to compute_target
.
Configure the training run
The runtime context is set by creating and configuring a RunConfiguration
object. Here we set the compute target.
from azureml.core.runconfig import RunConfiguration
from azureml.core.conda_dependencies import CondaDependencies
aml_run_config = RunConfiguration()
# Use just-specified compute target ("cpu-cluster")
aml_run_config.target = compute_target
# Specify CondaDependencies obj, add necessary packages
aml_run_config.environment.python.conda_dependencies = CondaDependencies.create(
conda_packages=['pandas','scikit-learn'],
pip_packages=['azureml-sdk[automl]', 'pyarrow'])
Prepare data for AutoML
Write the data preparation code
The baseline Titanic dataset consists of mixed numerical and text data, with some values missing. To prepare it for automated machine learning, the data preparation pipeline step:
- Fills missing data with either random data or a category corresponding to Unknown
- Transforms categorical data to integers
- Drops columns that you don't intend to use
- Splits the data into training and testing sets
- Writes the transformed data to the
OutputFileDatasetConfig
output paths
%%writefile dataprep.py
from azureml.core import Run
import pandas as pd
import numpy as np
import argparse
RANDOM_SEED=42
def prepare_age(df):
# Fill in missing Age values from distribution of present Age values
mean = df["Age"].mean()
std = df["Age"].std()
is_null = df["Age"].isnull().sum()
# compute enough (== is_null().sum()) random numbers between the mean, std
rand_age = np.random.randint(mean - std, mean + std, size = is_null)
# fill NaN values in Age column with random values generated
age_slice = df["Age"].copy()
age_slice[np.isnan(age_slice)] = rand_age
df["Age"] = age_slice
df["Age"] = df["Age"].astype(int)
# Quantize age into 5 classes
df['Age_Group'] = pd.qcut(df['Age'],5, labels=False)
df.drop(['Age'], axis=1, inplace=True)
return df
def prepare_fare(df):
df['Fare'].fillna(0, inplace=True)
df['Fare_Group'] = pd.qcut(df['Fare'],5,labels=False)
df.drop(['Fare'], axis=1, inplace=True)
return df
def prepare_genders(df):
genders = {"male": 0, "female": 1, "unknown": 2}
df['Sex'] = df['Sex'].map(genders)
df['Sex'].fillna(2, inplace=True)
df['Sex'] = df['Sex'].astype(int)
return df
def prepare_embarked(df):
df['Embarked'].replace('', 'U', inplace=True)
df['Embarked'].fillna('U', inplace=True)
ports = {"S": 0, "C": 1, "Q": 2, "U": 3}
df['Embarked'] = df['Embarked'].map(ports)
return df
parser = argparse.ArgumentParser()
parser.add_argument('--output_path', dest='output_path', required=True)
args = parser.parse_args()
titanic_ds = Run.get_context().input_datasets['titanic_ds']
df = titanic_ds.to_pandas_dataframe().drop(['PassengerId', 'Name', 'Ticket', 'Cabin'], axis=1)
df = prepare_embarked(prepare_genders(prepare_fare(prepare_age(df))))
df.to_csv(os.path.join(args.output_path,"prepped_data.csv"))
print(f"Wrote prepped data to {args.output_path}/prepped_data.csv")
The preceding code snippet is a complete, but minimal, example of data preparation for the Titanic data. The snippet starts with a Jupyter magic command to output the code to a file. If you aren't using a Jupyter notebook, remove that line and create the file manually.
The various prepare_
functions in the snippet modify the relevant column in the input dataset. These functions work on the data once it has been changed into a Pandas DataFrame
object. In each case, missing data is either filled with representative random data or categorical data indicating Unknown. Text-based categorical data is mapped to integers. No-longer-needed columns are overwritten or dropped.
After the code defines the data preparation functions, the code parses the input argument, which is the path to which we want to write our data. (These values are determined by OutputFileDatasetConfig
objects that are discussed in the next step.) The code retrieves the registered 'titanic_cs'
Dataset
, converts it to a Pandas DataFrame
, and calls the various data preparation functions.
Since the output_path
is a directory, the call to to_csv()
specifies the filename prepped_data.csv
.
Write the data preparation pipeline step (PythonScriptStep
)
The data preparation code described must be associated with a PythonScriptStep
object to be used with a pipeline. The path to which the CSV output is written is generated by a OutputFileDatasetConfig
object. The resources prepared earlier, such as the ComputeTarget
, the RunConfig
, and the 'titanic_ds' Dataset
are used to complete the specification.
from azureml.data import OutputFileDatasetConfig
from azureml.pipeline.steps import PythonScriptStep
prepped_data_path = OutputFileDatasetConfig(name="output_path")
dataprep_step = PythonScriptStep(
name="dataprep",
script_name="dataprep.py",
compute_target=compute_target,
runconfig=aml_run_config,
arguments=["--output_path", prepped_data_path],
inputs=[titanic_ds.as_named_input('titanic_ds')],
allow_reuse=True
)
The prepped_data_path
object is of type OutputFileDatasetConfig
which points to a directory. Notice that it's specified in the arguments
parameter. If you review the previous step, you see that within the data preparation code, the value of the argument '--output_path'
is the directory path at which the CSV file was written.
Train with AutoMLStep
Configuring an automated machine learning pipeline step is done with the AutoMLConfig
class. To learn more about this flexible class, see AutoMLConfig Class. Data input and output are the only aspects of configuration that require special attention in a machine learning pipeline. Input and output for AutoMLConfig
in pipelines is discussed later in this article. Beyond data, an advantage of machine learning pipelines is the ability to use different compute targets for different steps. You might choose to use a more powerful ComputeTarget
only for the automated machine learning process. Doing so is as straightforward as assigning a more powerful RunConfiguration
to the AutoMLConfig
object's run_configuration
parameter.
Send data to AutoMLStep
In a machine learning pipeline, the input data must be a Dataset
object. The highest-performing way is to provide the input data in the form of OutputTabularDatasetConfig
objects. You create an object of that type with the read_delimited_files()
on a OutputFileDatasetConfig
, such as the prepped_data_path
, such as the prepped_data_path
object.
# type(prepped_data) == OutputTabularDatasetConfig
prepped_data = prepped_data_path.read_delimited_files()
Another option is to use Dataset
objects registered in the workspace:
prepped_data = Dataset.get_by_name(ws, 'Data_prepared')
Comparing the two techniques:
Technique | Benefits and drawbacks |
---|---|
OutputTabularDatasetConfig |
Higher performance |
Natural route from OutputFileDatasetConfig |
|
Data isn't persisted after pipeline run | |
Registered Dataset |
Lower performance |
Can be generated in many ways | |
Data persists and is visible throughout workspace | |
Notebook showing registered Dataset technique |
Specify AutoML outputs
The outputs of the AutoMLStep
are the final metric scores of the higher-performing model and that model itself. To use these outputs in further pipeline steps, prepare OutputFileDatasetConfig
objects to receive them.
from azureml.pipeline.core import TrainingOutput, PipelineData
metrics_data = PipelineData(name='metrics_data',
datastore=datastore,
pipeline_output_name='metrics_output',
training_output=TrainingOutput(type='Metrics'))
model_data = PipelineData(name='best_model_data',
datastore=datastore,
pipeline_output_name='model_output',
training_output=TrainingOutput(type='Model'))
This snippet creates the two PipelineData
objects for the metrics and model output. Each is named, assigned to the default datastore retrieved earlier, and associated with the particular type
of TrainingOutput
from the AutoMLStep
. Because we assign pipeline_output_name
on these PipelineData
objects, their values are available not just from the individual pipeline step, but from the pipeline as a whole, as discussed later in the section Examine pipeline results.
Configure and create the AutoML pipeline step
Once the inputs and outputs are defined, it's time to create the AutoMLConfig
and AutoMLStep
. The details of the configuration depend on your task, as described in Set up AutoML training with Python. For the Titanic survival classification task, the following snippet demonstrates a simple configuration.
from azureml.train.automl import AutoMLConfig
from azureml.pipeline.steps import AutoMLStep
# Change iterations to a reasonable number (50) to get better accuracy
automl_settings = {
"iteration_timeout_minutes" : 10,
"iterations" : 2,
"experiment_timeout_hours" : 0.25,
"primary_metric" : 'AUC_weighted'
}
automl_config = AutoMLConfig(task = 'classification',
path = '.',
debug_log = 'automated_ml_errors.log',
compute_target = compute_target,
run_configuration = aml_run_config,
featurization = 'auto',
training_data = prepped_data,
label_column_name = 'Survived',
**automl_settings)
train_step = AutoMLStep(name='AutoML_Classification',
automl_config=automl_config,
passthru_automl_config=False,
outputs=[metrics_data,model_data],
enable_default_model_output=False,
enable_default_metrics_output=False,
allow_reuse=True)
The snippet shows an idiom commonly used with AutoMLConfig
. Arguments that are more fluid (hyperparameter-ish) are specified in a separate dictionary while the values less likely to change are specified directly in the AutoMLConfig
constructor. In this case, the automl_settings
specify a brief run: the run stops after only two iterations or 15 minutes, whichever comes first.
The automl_settings
dictionary is passed to the AutoMLConfig
constructor as kwargs. The other parameters aren't complex:
task
is set toclassification
for this example. Other valid values areregression
andforecasting
.path
anddebug_log
describe the path to the project and a local file to which debug information is written.compute_target
is the previously definedcompute_target
that, in this example, is an inexpensive CPU-based machine. If you're using AutoML's deep learning facilities, you would want to change the compute target to be GPU-based.featurization
is set toauto
. More details can be found in the Data featurization section of the AutoML configuration document.label_column_name
indicates which column you're interested in predicting.training_data
is set to theOutputTabularDatasetConfig
objects made from the outputs of the data preparation step.
The AutoMLStep
itself takes the AutoMLConfig
and has, as outputs, the PipelineData
objects created to hold the metrics and model data.
Important
You must set enable_default_model_output
and enable_default_metrics_output
to True
only if you are using AutoMLStepRun
.
In this example, the AutoML process performs cross-validations on the training_data
. You can control the number of cross-validations with the n_cross_validations
argument. If you've already split your training data as part of your data preparation steps, you can set validation_data
to its own Dataset
.
You might occasionally see the use of X
for data features and y
for data labels. This technique is deprecated and you should use training_data
for input.
Register the model generated by AutoML
The last step in a simple machine learning pipeline is registering the created model. By adding the model to the workspace's model registry, it's available in the Azure portal and can be versioned. To register the model, write another PythonScriptStep
that takes the model_data
output of the AutoMLStep
.
Write the code to register the model
A model is registered in a Workspace
. You're probably familiar with using Workspace.from_config()
to sign in to your workspace on your local machine, but there's another way to get the workspace from within a running machine learning pipeline. The Run.get_context()
retrieves the active run
. This run
object provides access to many important objects, including the Workspace
used here.
%%writefile register_model.py
from azureml.core.model import Model, Dataset
from azureml.core.run import Run, _OfflineRun
from azureml.core import Workspace
import argparse
parser = argparse.ArgumentParser()
parser.add_argument("--model_name", required=True)
parser.add_argument("--model_path", required=True)
args = parser.parse_args()
print(f"model_name : {args.model_name}")
print(f"model_path: {args.model_path}")
run = Run.get_context()
ws = Workspace.from_config() if type(run) == _OfflineRun else run.experiment.workspace
model = Model.register(workspace=ws,
model_path=args.model_path,
model_name=args.model_name)
print("Registered version {0} of model {1}".format(model.version, model.name))
Write the PythonScriptStep code
Warning
If you're using the Azure Machine Learning SDK v1, and your workspace is configured for network isolation, you might receive an error when running this step. For more information, see HyperdriveStep and AutoMLStep fail with network isolation.
The model-registering PythonScriptStep
uses a PipelineParameter
for one of its arguments. Pipeline parameters are arguments to pipelines that can be easily set at run-submission time. Once declared, they're passed as normal arguments.
from azureml.pipeline.core.graph import PipelineParameter
# The model name with which to register the trained model in the workspace.
model_name = PipelineParameter("model_name", default_value="TitanicSurvivalInitial")
register_step = PythonScriptStep(script_name="register_model.py",
name="register_model",
allow_reuse=False,
arguments=["--model_name", model_name, "--model_path", model_data],
inputs=[model_data],
compute_target=compute_target,
runconfig=aml_run_config)
Create and run your AutoML pipeline
Creating and running a pipeline that contains an AutoMLStep
is no different than a normal pipeline.
from azureml.pipeline.core import Pipeline
from azureml.core import Experiment
pipeline = Pipeline(ws, [dataprep_step, train_step, register_step])
experiment = Experiment(workspace=ws, name='titanic_automl')
run = experiment.submit(pipeline, show_output=True)
run.wait_for_completion()
This code combines the data preparation, automated machine learning, and model-registering steps into a Pipeline
object. It then creates an Experiment
object. The Experiment
constructor retrieves the named experiment if it exists, or creates it if necessary. It submits the Pipeline
to the Experiment
, creating a Run
object that asynchronously runs the pipeline. The wait_for_completion()
function blocks until the run completes.
Examine pipeline results
Once the run
completes, you can retrieve PipelineData
objects that were assigned a pipeline_output_name
. You can download the results and load them for further processing.
metrics_output_port = run.get_pipeline_output('metrics_output')
model_output_port = run.get_pipeline_output('model_output')
metrics_output_port.download('.', show_progress=True)
model_output_port.download('.', show_progress=True)
Downloaded files are written to the subdirectory azureml/{run.id}/
. The metrics file is JSON-formatted and can be converted into a Pandas dataframe for examination.
For local processing, you might need to install relevant packages, such as Pandas, Pickle, the Azure Machine Learning SDK, and so forth. For this example, it's likely that the best model found by automated machine learning depends on XGBoost.
!pip install xgboost==0.90
import pandas as pd
import json
metrics_filename = metrics_output._path_on_datastore
# metrics_filename = path to downloaded file
with open(metrics_filename) as f:
metrics_output_result = f.read()
deserialized_metrics_output = json.loads(metrics_output_result)
df = pd.DataFrame(deserialized_metrics_output)
df
This code snippet shows the metrics file being loaded from its location on the Azure datastore. You can also load it from the downloaded file, as shown in the comment. After you deserialize it and convert it to a Pandas DataFrame, you can see detailed metrics for each of the iterations of the automated machine learning step.
The model file can be deserialized into a Model
object that you can use for inferencing, further metrics analysis, and so forth.
import pickle
model_filename = model_output._path_on_datastore
# model_filename = path to downloaded file
with open(model_filename, "rb" ) as f:
best_model = pickle.load(f)
# ... inferencing code not shown ...
For more information on loading and working with existing models, see Deploy machine learning models to Azure.
Download the results of an AutoML run
If you've been following along with the article, you have an instantiated Run
object. But you can also retrieve completed Run
objects from the Workspace
by way of an Experiment
object.
The workspace contains a complete record of all your experiments and runs. You can either use the portal to find and download the outputs of experiments or use code. To access the records from a historic run, use Azure Machine Learning to find the ID of the run in which you're interested. With that ID, you can choose the specific run
by way of the Workspace
and Experiment
.
# Retrieved from Azure Machine Learning web UI
run_id = 'aaaaaaaa-bbbb-cccc-dddd-0123456789AB'
experiment = ws.experiments['titanic_automl']
run = next(run for run in ex.get_runs() if run.id == run_id)
You need to change the run_id
string in the preceding code to the specific ID of your historical run. The snippet assumes that you've assigned ws
to the relevant Workspace
with the normal from_config()
. The experiment of interest is directly retrieved and then the code finds the Run
of interest by matching the run.id
value.
Once you have a Run
object, you can download the metrics and model.
automl_run = next(r for r in run.get_children() if r.name == 'AutoML_Classification')
outputs = automl_run.get_outputs()
metrics = outputs['default_metrics_AutoML_Classification']
model = outputs['default_model_AutoML_Classification']
metrics.get_port_data_reference().download('.')
model.get_port_data_reference().download('.')
Each Run
object contains StepRun
objects that contain information about the individual pipeline step run. The run
is searched for the StepRun
object for the AutoMLStep
. The metrics and model are retrieved using their default names, which are available even if you don't pass PipelineData
objects to the outputs
parameter of the AutoMLStep
.
Finally, the actual metrics and model are downloaded to your local machine, as was discussed in the Examine pipeline results section.