title | titleSuffix | description | ms.service | ms.subservice | ms.topic | ms.author | author | ms.reviewer | ms.date | ms.custom |
---|---|---|---|---|---|---|---|---|---|---|
Identity-based data access to storage services on Azure |
Machine Learning |
Learn how to use identity-based data access to connect to storage services on Azure with Azure Machine Learning datastores and the Machine Learning Python SDK. |
machine-learning |
enterprise-readiness |
how-to |
yogipandey |
ynpandey |
nibaccam |
01/25/2022 |
contperf-fy21q1, devx-track-python, data4ml, event-tier1-build-2022 |
In this article, you learn how to connect to storage services on Azure by using identity-based data access and Azure Machine Learning datastores via the Azure Machine Learning SDK for Python.
Typically, datastores use credential-based authentication to confirm you have permission to access the storage service. They keep connection information, like your subscription ID and token authorization, in the key vault that's associated with the workspace. When you create a datastore that uses identity-based data access, your Azure account (Azure Active Directory token) is used to confirm you have permission to access the storage service. In the identity-based data access scenario, no authentication credentials are saved. Only the storage account information is stored in the datastore.
To create datastores with identity-based data access via the Azure Machine Learning studio UI, see Connect to data with the Azure Machine Learning studio.
To create datastores that use credential-based authentication, like access keys or service principals, see Connect to storage services on Azure.
There are two scenarios in which you can apply identity-based data access in Azure Machine Learning. These scenarios are a good fit for identity-based access when you're working with confidential data and need more granular data access management:
Warning
Identity-based data access is not supported for automated ML experiments.
- Accessing storage services
- Training machine learning models with private data
You can connect to storage services via identity-based data access with Azure Machine Learning datastores or Azure Machine Learning datasets.
Your authentication credentials are usually kept in a datastore, which is used to ensure you have permission to access the storage service. When these credentials are registered via datastores, any user with the workspace Reader role can retrieve them. That scale of access can be a security concern for some organizations. Learn more about the workspace Reader role.
When you use identity-based data access, Azure Machine Learning prompts you for your Azure Active Directory token for data access authentication instead of keeping your credentials in the datastore. That approach allows for data access management at the storage level and keeps credentials confidential.
The same behavior applies when you:
- Create a dataset directly from storage URLs.
- Work with data interactively via a Jupyter Notebook on your local computer or compute instance.
Note
Credentials stored via credential-based authentication include subscription IDs, shared access signature (SAS) tokens, and storage access key and service principal information, like client IDs and tenant IDs.
Certain machine learning scenarios involve training models with private data. In such cases, data scientists need to run training workflows without being exposed to the confidential input data. In this scenario, a managed identity of the training compute is used for data access authentication. This approach allows storage admins to grant Storage Blob Data Reader access to the managed identity that the training compute uses to run the training job. The individual data scientists don't need to be granted access. For more information, see Set up managed identity on a compute cluster.
-
An Azure subscription. If you don't have an Azure subscription, create a free account before you begin. Try the free or paid version of Azure Machine Learning.
-
An Azure storage account with a supported storage type. These storage types are supported:
-
An Azure Machine Learning workspace.
Either create an Azure Machine Learning workspace or use an existing one via the Python SDK.
When you register a storage service on Azure as a datastore, you automatically create and register that datastore to a specific workspace. See Storage access permissions for guidance on required permission types. You also have the option to manually create the storage you want to connect to without any special permissions, and you just need the name.
See Work with virtual networks for details on how to connect to data storage behind virtual networks.
In the following code, notice the absence of authentication parameters like sas_token
, account_key
, subscription_id
, and the service principal client_id
. This omission indicates that Azure Machine Learning will use identity-based data access for authentication. Creation of datastores typically happens interactively in a notebook or via the studio. So your Azure Active Directory token is used for data access authentication.
Note
Datastore names should consist only of lowercase letters, numbers, and underscores.
To register an Azure blob container as a datastore, use register_azure_blob_container()
.
The following code creates the credentialless_blob
datastore, registers it to the ws
workspace, and assigns it to the blob_datastore
variable. This datastore accesses the my_container_name
blob container on the my-account-name
storage account.
# Create blob datastore without credentials.
blob_datastore = Datastore.register_azure_blob_container(workspace=ws,
datastore_name='credentialless_blob',
container_name='my_container_name',
account_name='my_account_name')
Use register_azure_data_lake() to register a datastore that connects to Azure Data Lake Storage Gen1.
The following code creates the credentialless_adls1
datastore, registers it to the workspace
workspace, and assigns it to the adls_dstore
variable. This datastore accesses the adls_storage
Azure Data Lake Storage account.
# Create Azure Data Lake Storage Gen1 datastore without credentials.
adls_dstore = Datastore.register_azure_data_lake(workspace = workspace,
datastore_name='credentialless_adls1',
store_name='adls_storage')
Use register_azure_data_lake_gen2() to register a datastore that connects to Azure Data Lake Storage Gen2.
The following code creates the credentialless_adls2
datastore, registers it to the ws
workspace, and assigns it to the adls2_dstore
variable. This datastore accesses the file system tabular
in the myadls2
storage account.
# Create Azure Data Lake Storage Gen2 datastore without credentials.
adls2_dstore = Datastore.register_azure_data_lake_gen2(workspace=ws,
datastore_name='credentialless_adls2',
filesystem='tabular',
account_name='myadls2')
To help ensure that you securely connect to your storage service on Azure, Azure Machine Learning requires that you have permission to access the corresponding data storage.
Warning
Cross tenant access to storage accounts is not supported. If cross tenant access is needed for your scenario, please reach out to the AzureML Data Support team alias at amldatasupport@microsoft.com for assistance with a custom code solution.
Identity-based data access supports connections to only the following storage services.
- Azure Blob Storage
- Azure Data Lake Storage Gen1
- Azure Data Lake Storage Gen2
To access these storage services, you must have at least Storage Blob Data Reader access to the storage account. Only storage account owners can change your access level via the Azure portal.
If you prefer to not use your user identity (Azure Active Directory), you also have the option to grant a workspace managed-system identity (MSI) permission to create the datastore. To do so, you must have Owner permissions to the storage account and add the grant_workspace_access= True
parameter to your data register method.
If you're training a model on a remote compute target and want to access the data for training, the compute identity must be granted at least the Storage Blob Data Reader role from the storage service. Learn how to set up managed identity on a compute cluster.
By default, Azure Machine Learning can't communicate with a storage account that's behind a firewall or in a virtual network.
You can configure storage accounts to allow access only from within specific virtual networks. This configuration requires extra steps to ensure data isn't leaked outside of the network. This behavior is the same for credential-based data access. For more information, see How to configure virtual network scenarios.
If your storage account has virtual network settings, that dictates what identity type and permissions access is needed. For example for data preview and data profile, the virtual network settings determine what type of identity is used to authenticate data access.
-
In scenarios where only certain IPs and subnets are allowed to access the storage, then Azure Machine Learning uses the workspace MSI to accomplish data previews and profiles.
-
If your storage is ADLS Gen 2 or Blob and has virtual network settings, customers can use either user identity or workspace MSI depending on the datastore settings defined during creation.
-
If the virtual network setting is “Allow Azure services on the trusted services list to access this storage account”, then Workspace MSI is used.
We recommend that you use Azure Machine Learning datasets when you interact with your data in storage with Azure Machine Learning.
Important
Datasets using identity-based data access are not supported for automated ML experiments.
Datasets package your data into a lazily evaluated consumable object for machine learning tasks like training. Also, with datasets you can download or mount files of any format from Azure storage services like Azure Blob Storage and Azure Data Lake Storage to a compute target.
To create a dataset, you can reference paths from datastores that also use identity-based data access.
- If you're underlying storage account type is Blob or ADLS Gen 2, your user identity needs Blob Reader role.
- If your underlying storage is ADLS Gen 1, permissions need can be set via the storage's Access Control List (ACL).
In the following example, blob_datastore
already exists and uses identity-based data access.
blob_dataset = Dataset.Tabular.from_delimited_files(blob_datastore,'test.csv')
Another option is to skip datastore creation and create datasets directly from storage URLs. This functionality currently supports only Azure blobs and Azure Data Lake Storage Gen1 and Gen2. For creation based on storage URL, only the user identity is needed to authenticate.
blob_dset = Dataset.File.from_files('https://myblob.blob.core.windows.net/may/keras-mnist-fashion/')
When you submit a training job that consumes a dataset created with identity-based data access, the managed identity of the training compute is used for data access authentication. Your Azure Active Directory token isn't used. For this scenario, ensure that the managed identity of the compute is granted at least the Storage Blob Data Reader role from the storage service. For more information, see Set up managed identity on compute clusters.
[!INCLUDE cli v2]
When training on Azure Machine Learning compute clusters, you can authenticate to storage with your Azure Active Directory token.
This authentication mode allows you to:
- Set up fine-grained permissions, where different workspace users can have access to different storage accounts or folders within storage accounts.
- Audit storage access because the storage logs show which identities were used to access data.
Warning
This functionality has the following limitations
- Feature is only supported for experiments submitted via the Azure Machine Learning CLI
- Only CommandJobs, and PipelineJobs with CommandSteps and AutoMLSteps are supported
- User identity and compute managed identity cannot be used for authentication within same job.
The following steps outline how to set up identity-based data access for training jobs on compute clusters.
-
Grant the user identity access to storage resources. For example, grant StorageBlobReader access to the specific storage account you want to use or grant ACL-based permission to specific folders or files in Azure Data Lake Gen 2 storage.
-
Create an Azure Machine Learning datastore without cached credentials for the storage account. If a datastore has cached credentials, such as storage account key, those credentials are used instead of user identity.
-
Submit a training job with property identity set to type: user_identity, as shown in following job specification. During the training job, the authentication to storage happens via the identity of the user that submits the job.
Note
If the identity property is left unspecified and datastore does not have cached credentials, then compute managed identity becomes the fallback option.
command: |
echo "--census-csv: ${{inputs.census_csv}}"
python hello-census.py --census-csv ${{inputs.census_csv}}
code: src
inputs:
census_csv:
type: uri_file
path: azureml://datastores/mydata/paths/census.csv
environment: azureml:AzureML-sklearn-1.0-ubuntu20.04-py38-cpu@latest
compute: azureml:cpu-cluster
identity:
type: user_identity