parallel Package

Reference

Classes

ParallelJob	Parallel job.
RunFunction	Run Function.

Functions

parallel_run_function

Create a Parallel object which can be used inside dsl.pipeline as a function and can also be created as a standalone parallel job.

For an example of using ParallelRunStep, see the notebook https://aka.ms/parallel-example-notebook

Note

To use parallel_run_function:

Create a <xref:azure.ai.ml.entities._builders.Parallel> object to specify how parallel run is performed,

with parameters to control batch size,number of nodes per compute target, and a

reference to your custom Python script.

Build pipeline with the parallel object as a function. defines inputs and

outputs for the step.

Sumbit the pipeline to run.


   from azure.ai.ml import Input, Output, parallel

   parallel_run = parallel_run_function(
       name="batch_score_with_tabular_input",
       display_name="Batch Score with Tabular Dataset",
       description="parallel component for batch score",
       inputs=dict(
           job_data_path=Input(
               type=AssetTypes.MLTABLE,
               description="The data to be split and scored in parallel",
           ),
           score_model=Input(
               type=AssetTypes.URI_FOLDER, description="The model for batch score."
           ),
       ),
       outputs=dict(job_output_path=Output(type=AssetTypes.MLTABLE)),
       input_data="${{inputs.job_data_path}}",
       max_concurrency_per_instance=2,  # Optional, default is 1
       mini_batch_size="100",  # optional
       mini_batch_error_threshold=5,  # Optional, allowed failed count on mini batch items, default is -1
       logging_level="DEBUG",  # Optional, default is INFO
       error_threshold=5,  # Optional, allowed failed count totally, default is -1
       retry_settings=dict(max_retries=2, timeout=60),  # Optional
       task=RunFunction(
           code="./src",
           entry_script="tabular_batch_inference.py",
           environment=Environment(
               image="mcr.microsoft.com/azureml/openmpi3.1.2-ubuntu18.04",
               conda_file="./src/environment_parallel.yml",
           ),
           program_arguments="--model ${{inputs.score_model}}",
           append_row_to="${{outputs.job_output_path}}",  # Optional, if not set, summary_only
       ),
   )

parallel_run_function(*, name: str | None = None, description: str | None = None, tags: Dict | None = None, properties: Dict | None = None, display_name: str | None = None, experiment_name: str | None = None, compute: str | None = None, retry_settings: BatchRetrySettings | None = None, environment_variables: Dict | None = None, logging_level: str | None = None, max_concurrency_per_instance: int | None = None, error_threshold: int | None = None, mini_batch_error_threshold: int | None = None, task: RunFunction | None = None, mini_batch_size: str | None = None, partition_keys: List | None = None, input_data: str | None = None, inputs: Dict | None = None, outputs: Dict | None = None, instance_count: int | None = None, instance_type: str | None = None, docker_args: str | None = None, shm_size: str | None = None, identity: ManagedIdentityConfiguration | AmlTokenConfiguration | UserIdentityConfiguration | None = None, is_deterministic: bool = True, **kwargs: Any) -> Parallel

Keyword-Only Parameters

Name	Description
name	str Name of the parallel job or component created. Default value: None
description	str A friendly description of the parallel. Default value: None
tags	Dict Tags to be attached to this parallel. Default value: None
properties	Dict The asset property dictionary. Default value: None
display_name	str A friendly name. Default value: None
experiment_name	str Name of the experiment the job will be created under, if None is provided, default will be set to current directory name. Will be ignored as a pipeline step. Default value: None
compute	str The name of the compute where the parallel job is executed (will not be used if the parallel is used as a component/function). Default value: None
retry_settings	BatchRetrySettings Parallel component run failed retry Default value: None
environment_variables	Dict[str, str] A dictionary of environment variables names and values. These environment variables are set on the process where user script is being executed. Default value: None
logging_level	str A string of the logging level name, which is defined in 'logging'. Possible values are 'WARNING', 'INFO', and 'DEBUG'. (optional, default value is 'INFO'.) This value could be set through PipelineParameter. Default value: None
max_concurrency_per_instance	int The max parallellism that each compute instance has. Default value: None
error_threshold	int The number of record failures for Tabular Dataset and file failures for File Dataset that should be ignored during processing. If the error count goes above this value, then the job will be aborted. Error threshold is for the entire input rather than the individual mini-batch sent to run() method. The range is [-1, int.max]. -1 indicates ignore all failures during processing Default value: None
mini_batch_error_threshold	int The number of mini batch processing failures should be ignored Default value: None
task	RunFunction The parallel task Default value: None
mini_batch_size	str For FileDataset input, this field is the number of files a user script can process in one run() call. For TabularDataset input, this field is the approximate size of data the user script can process in one run() call. Example values are 1024, 1024KB, 10MB, and 1GB. (optional, default value is 10 files for FileDataset and 1MB for TabularDataset.) This value could be set through PipelineParameter. Default value: None
partition_keys	List The keys used to partition dataset into mini-batches. If specified, the data with the same key will be partitioned into the same mini-batch. If both partition_keys and mini_batch_size are specified, the partition keys will take effect. The input(s) must be partitioned dataset(s), and the partition_keys must be a subset of the keys of every input dataset for this to work Default value: None
input_data	str The input data. Default value: None
inputs	Dict A dict of inputs used by this parallel. Default value: None
outputs	Dict The outputs of this parallel Default value: None
instance_count	int Optional number of instances or nodes used by the compute target. Defaults to 1 Default value: None
instance_type	str Optional type of VM used as supported by the compute target.. Default value: None
docker_args	str Extra arguments to pass to the Docker run command. This would override any parameters that have already been set by the system, or in this section. This parameter is only supported for Azure ML compute types. Default value: None
shm_size	str Size of the docker container's shared memory block. This should be in the format of (number)(unit) where number as to be greater than 0 and the unit can be one of b(bytes), k(kilobytes), m(megabytes), or g(gigabytes). Default value: None
identity	Optional[Union[ ManagedIdentityConfiguration, AmlTokenConfiguration, UserIdentityConfiguration]] Identity that PRS job will use while running on compute. Default value: None
is_deterministic	bool Specify whether the parallel will return same output given same input. If a parallel (component) is deterministic, when use it as a node/step in a pipeline, it will reuse results from a previous submitted job in current workspace which has same inputs and settings. In this case, this step will not use any compute resource. Defaults to True, specify is_deterministic=False if you would like to avoid such reuse behavior, defaults to True. Default value: True

Returns

Type	Description
Parallel	The parallel node