title | titleSuffix | description | services | ms.service | ms.subservice | ms.topic | ms.author | author | ms.reviewer | ms.date | ms.custom |
---|---|---|---|---|---|---|---|---|---|---|---|
Read and write data |
Azure Machine Learning |
Learn how to read and write data for consumption in Azure Machine Learning training jobs. |
machine-learning |
machine-learning |
mldata |
how-to |
yogipandey |
ynpandey |
ssalgadodev |
05/26/2022 |
devx-track-python, devplatv2, sdkv2, cliv2, event-tier1-build-2022 |
[!INCLUDE sdk v2] [!INCLUDE cli v2]
Learn how to read and write data for your training jobs with the Azure Machine Learning Python SDK v2(preview) and the Azure Machine Learning CLI extension v2.
-
An Azure subscription. If you don't have an Azure subscription, create a free account before you begin. Try the free or paid version of Azure Machine Learning.
-
An Azure Machine Learning workspace
from azure.ai.ml import MLClient
from azure.identity import InteractiveBrowserCredential
#enter details of your AML workspace
subscription_id = '<SUBSCRIPTION_ID>'
resource_group = '<RESOURCE_GROUP>'
workspace = '<AML_WORKSPACE_NAME>'
#get a handle to the workspace
ml_client = MLClient(InteractiveBrowserCredential(), subscription_id, resource_group, workspace)
You can use data from your current working directory in a training job with the Input class.
The Input class allows you to define data inputs from a specific file, uri_file
or a folder location, uri_folder
. In the Input object, you specify the path
of where your data is located; the path can be a local path or a cloud path. Azure Machine Learning supports https://
, abfss://
, wasbs://
and azureml://
URIs.
Important
If the path is local, but your compute is defined to be in the cloud, Azure Machine Learning will automatically upload the data to cloud storage for you.
from azure.ai.ml import Input, command
from azure.ai.ml.entities import Data
from azure.ai.ml.constants import AssetTypes
my_job_inputs = {
"input_data": Input(
path='./sample_data', # change to be your local directory
type=AssetTypes.URI_FOLDER
)
}
job = command(
code="./src", # local path where the code is stored
command='python train.py --input_folder ${{inputs.input_data}}',
inputs=my_job_inputs,
environment="AzureML-sklearn-0.24-ubuntu18.04-py37-cpu:9",
compute="cpu-cluster"
)
#submit the command job
returned_job = ml_client.create_or_update(job)
#get a URL for the status of the job
returned_job.services["Studio"].endpoint
The following code shows how to read in uri_file type data from local.
az ml job create -f <file-name>.yml
$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
command: |
python hello-iris.py --iris-csv ${{inputs.iris_csv}}
code: src
inputs:
iris_csv:
type: uri_file
path: ./example-data/iris.csv
environment: azureml:AzureML-sklearn-1.0-ubuntu20.04-py38-cpu@latest
compute: azureml:cpu-cluster
You can read your data in from existing storage on Azure. You can leverage Azure Machine Learning datastore to register these exiting Azure storage. Azure Machine Learning datastores securely keep the connection information to your data storage on Azure, so you don't have to code it in your scripts. You can access your data and create datastores with,
- Credential-based data authentication, like a service principal or shared access signature (SAS) token. These credentials can be accessed by users who have Reader access to the workspace.
- Identity-based data authentication to connect to storage services with your Azure Active Directory ID or other managed identity.
The following code shows how to read in uri_folder type data from Azure Data Lake Storage Gen 2 or Blob via SDK V2.
from azure.ai.ml import Input, command
from azure.ai.ml.entities import Data
from azure.ai.ml.constants import AssetTypes
my_job_inputs = {
"input_data": Input(
path='abfss://<file_system>@<account_name>.dfs.core.windows.net/<path>', # Blob: 'https://<account_name>.blob.core.windows.net/<container_name>/path'
type=AssetTypes.URI_FOLDER
)
}
job = command(
code="./src", # local path where the code is stored
command='python train.py --input_folder ${{inputs.input_data}}',
inputs=my_job_inputs,
environment="AzureML-sklearn-0.24-ubuntu18.04-py37-cpu:9",
compute="cpu-cluster"
)
#submit the command job
returned_job = ml_client.create_or_update(job)
#get a URL for the status of the job
returned_job.services["Studio"].endpoint
The following code shows how to read in uri_file type data from Azure ML datastore via CLI V2.
az ml job create -f <file-name>.yml
$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
command: |
echo "--iris-csv: ${{inputs.iris_csv}}"
python hello-iris.py --iris-csv ${{inputs.iris_csv}}
code: src
inputs:
iris_csv:
type: uri_file
path: azureml://datastores/workspaceblobstore/paths/example-data/iris.csv
environment: azureml:AzureML-sklearn-1.0-ubuntu20.04-py38-cpu@latest
compute: azureml:cpu-cluster
You can read and write data from your job into your cloud-based storage.
The Input defaults the mode - how the input will be exposed during job runtime - to InputOutputModes.RO_MOUNT (read-only mount). Put another way, Azure Machine Learning will mount the file or folder to the compute and set the file/folder to read-only. By design, you can't write to JobInputs only JobOutputs. The data is automatically uploaded to cloud storage.
Matrix of possible types and modes for job inputs and outputs:
Type | Input/Output | upload |
download |
ro_mount |
rw_mount |
direct |
eval_download |
eval_mount |
---|---|---|---|---|---|---|---|---|
uri_folder |
Input | ❌ | ✅ | ✅ | ❌ | ✅ | ❌ | ❌ |
uri_file |
Input | ❌ | ✅ | ✅ | ❌ | ✅ | ❌ | ❌ |
mltable |
Input | ❌ | ✅ | ✅ | ❌ | ✅ | ✅ | ✅ |
uri_folder |
Output | ✅ | ❌ | ❌ | ✅ | ✅ | ❌ | ❌ |
uri_file |
Output | ✅ | ❌ | ❌ | ✅ | ✅ | ❌ | ❌ |
mltable |
Output | ✅ | ❌ | ❌ | ✅ | ✅ | ❌ | ❌ |
As you can see from the table, eval_download
and eval_mount
are unique to mltable
. A MLTable-artifact can yield files that are not necessarily located in the mltable
's storage. Or it can subset or shuffle the data that resides in the storage. That view is only visible if the MLTable file is actually evaluated by the engine. These modes will provide that view of the files.
from azure.ai.ml import Input, command
from azure.ai.ml.entities import Data, JobOutput
from azure.ai.ml.constants import AssetTypes
my_job_inputs = {
"input_data": Input(
path='abfss://<file_system>@<account_name>.dfs.core.windows.net/<path>',
type=AssetTypes.URI_FOLDER
)
}
my_job_outputs = {
"output_folder": JobOutput(
path='abfss://<file_system>@<account_name>.dfs.core.windows.net/<path>',
type=AssetTypes.URI_FOLDER
)
}
job = command(
code="./src", #local path where the code is stored
command='python pre-process.py --input_folder ${{inputs.input_data}} --output_folder ${{outputs.output_folder}}',
inputs=my_job_inputs,
outputs=my_job_outputs,
environment="AzureML-sklearn-0.24-ubuntu18.04-py37-cpu:9",
compute="cpu-cluster"
)
#submit the command job
returned_job = ml_client.create_or_update(job)
#get a URL for the status of the job
returned_job.services["Studio"].endpoint
$schema: https://azuremlschemas.azureedge.net/latest/CommandJob.schema.json
code: src/prep
command: >-
python prep.py
--raw_data ${{inputs.raw_data}}
--prep_data ${{outputs.prep_data}}
inputs:
raw_data:
type: uri_folder
path: ./data
outputs:
prep_data:
mode: upload
environment: azureml:AzureML-sklearn-0.24-ubuntu18.04-py37-cpu@latest
compute: azureml:cpu-cluster
You can register data as an asset to your workspace. The benefits of registering data are:
- Easy to share with other members of the team (no need to remember file locations)
- Versioning of the metadata (location, description, etc.)
- Lineage tracking
The following example demonstrates versioning of sample data, and shows how to register a local file as a data asset. The data is uploaded to cloud storage and registered as an asset.
from azure.ai.ml.entities import Data
from azure.ai.ml.constants import AssetTypes
my_data = Data(
path="./sample_data/titanic.csv",
type=AssetTypes.URI_FILE,
description="Titanic Data",
name="titanic",
version='1'
)
ml_client.data.create_or_update(my_data)
To register data that is in a cloud location, you can specify the path with any of the supported protocols for the storage type. The following example shows what the path looks like for data from Azure Data Lake Storage Gen 2.
from azure.ai.ml.entities import Data
from azure.ai.ml.constants import AssetTypes
my_path = 'abfss://<file_system>@<account_name>.dfs.core.windows.net/<path>' # adls gen2
my_data = Data(
path=my_path,
type=AssetTypes.URI_FOLDER,
description="description here",
name="a_name",
version='1'
)
ml_client.data.create_or_update(my_data)
Once your data is registered as an asset to the workspace, you can consume that data asset in jobs.
The following example demonstrates how to consume version
1 of the registered data asset titanic
.
from azure.ai.ml import Input, command
from azure.ai.ml.entities import Data
from azure.ai.ml.constants import AssetTypes
registered_data_asset = ml_client.data.get(name='titanic', version='1')
my_job_inputs = {
"input_data": Input(
type=AssetTypes.URI_FOLDER,
path=registered_data_asset.id
)
}
job = command(
code="./src",
command='python read_data_asset.py --input_folder ${{inputs.input_data}}',
inputs=my_job_inputs,
environment="AzureML-sklearn-0.24-ubuntu18.04-py37-cpu:9",
compute="cpu-cluster"
)
#submit the command job
returned_job = ml_client.create_or_update(job)
#get a URL for the status of the job
returned_job.services["Studio"].endpoint
If you're working with Azure Machine Learning pipelines, you can read data into and move data between pipeline components with the Azure Machine Learning CLI v2 extension or the Python SDK v2 (preview).
The following YAML file demonstrates how to use the output data from one component as the input for another component of the pipeline using the Azure Machine Learning CLI v2 extension:
[!INCLUDE cli v2]
:::code language="yaml" source="~/azureml-examples-main/cli/jobs/pipelines-with-components/basics/3b_pipeline_with_data/pipeline.yml":::
The following example defines a pipeline containing three nodes and moves data between each node.
prepare_data_node
that loads the image and labels from Fashion MNIST data set intomnist_train.csv
andmnist_test.csv
.train_node
that trains a CNN model with Keras using the training data,mnist_train.csv
.score_node
that scores the model using test data,mnist_test.csv
.
[!notebook-python[] (~/azureml-examples-main/sdk/jobs/pipelines/2e_image_classification_keras_minist_convnet/image_classification_keras_minist_convnet.ipynb?name=build-pipeline)]