title	titleSuffix	description	services	ms.service	ms.subservice	ms.topic	ms.custom	ms.author	author	ms.reviewer	ms.date
Create compute clusters	Azure Machine Learning	Learn how to create compute clusters in your Azure Machine Learning workspace. Use the compute cluster as a compute target for training or inference.	machine-learning	machine-learning	core	how-to	devx-track-azurecli, cliv2, sdkv1, event-tier1-build-2022	sgilley	sdgilley	sgilley	05/02/2022

Create an Azure Machine Learning compute cluster

[!div class="op_single_selector" title1="Select the Azure Machine Learning CLI version you are using:"]

v1

v2 (preview)

Learn how to create and manage a compute cluster in your Azure Machine Learning workspace.

You can use Azure Machine Learning compute cluster to distribute a training or batch inference process across a cluster of CPU or GPU compute nodes in the cloud. For more information on the VM sizes that include GPUs, see GPU-optimized virtual machine sizes.

In this article, learn how to:

Create a compute cluster
Lower your compute cluster cost
Set up a managed identity for the cluster

Prerequisites

An Azure Machine Learning workspace. For more information, see Create an Azure Machine Learning workspace.
The Azure CLI extension for Machine Learning service (v2), Azure Machine Learning Python SDK, or the Azure Machine Learning Visual Studio Code extension.
If using the Python SDK, set up your development environment with a workspace. Once your environment is set up, attach to the workspace in your Python script:

[!INCLUDE sdk v1]
```
from azureml.core import Workspace

ws = Workspace.from_config() 
```

What is a compute cluster?

Azure Machine Learning compute cluster is a managed-compute infrastructure that allows you to easily create a single or multi-node compute. The compute cluster is a resource that can be shared with other users in your workspace. The compute scales up automatically when a job is submitted, and can be put in an Azure Virtual Network. Compute cluster supports no public IP (preview) deployment as well in virtual network. The compute executes in a containerized environment and packages your model dependencies in a Docker container.

Compute clusters can run jobs securely in a virtual network environment, without requiring enterprises to open up SSH ports. The job executes in a containerized environment and packages your model dependencies in a Docker container.

Limitations

Some of the scenarios listed in this document are marked as preview. Preview functionality is provided without a service level agreement, and it's not recommended for production workloads. Certain features might not be supported or might have constrained capabilities. For more information, see Supplemental Terms of Use for Microsoft Azure Previews.
Compute clusters can be created in a different region than your workspace. This functionality is in preview, and is only available for compute clusters, not compute instances. This preview is not available if you are using a private endpoint-enabled workspace.

[!WARNING] When using a compute cluster in a different region than your workspace or datastores, you may see increased network latency and data transfer costs. The latency and costs can occur when creating the cluster, and when running jobs on it.
We currently support only creation (and not updating) of clusters through ARM templates. For updating compute, we recommend using the SDK, Azure CLI or UX for now.
Azure Machine Learning Compute has default limits, such as the number of cores that can be allocated. For more information, see Manage and request quotas for Azure resources.
Azure allows you to place locks on resources, so that they cannot be deleted or are read only. Do not apply resource locks to the resource group that contains your workspace. Applying a lock to the resource group that contains your workspace will prevent scaling operations for Azure ML compute clusters. For more information on locking resources, see Lock resources to prevent unexpected changes.

Tip

Clusters can generally scale up to 100 nodes as long as you have enough quota for the number of cores required. By default clusters are setup with inter-node communication enabled between the nodes of the cluster to support MPI jobs for example. However you can scale your clusters to 1000s of nodes by simply raising a support ticket, and requesting to allow list your subscription, or workspace, or a specific cluster for disabling inter-node communication.

Create

Time estimate: Approximately 5 minutes.

Azure Machine Learning Compute can be reused across runs. The compute can be shared with other users in the workspace and is retained between runs, automatically scaling nodes up or down based on the number of runs submitted, and the max_nodes set on your cluster. The min_nodes setting controls the minimum nodes available.

The dedicated cores per region per VM family quota and total regional quota, which applies to compute cluster creation, is unified and shared with Azure Machine Learning training compute instance quota.

[!INCLUDE min-nodes-note]

The compute autoscales down to zero nodes when it isn't used. Dedicated VMs are created to run your jobs as needed.

Python

To create a persistent Azure Machine Learning Compute resource in Python, specify the vm_size and max_nodes properties. Azure Machine Learning then uses smart defaults for the other properties.

vm_size: The VM family of the nodes created by Azure Machine Learning Compute.
max_nodes: The max number of nodes to autoscale up to when you run a job on Azure Machine Learning Compute.

[!INCLUDE sdk v1]

[!code-python]

You can also configure several advanced properties when you create Azure Machine Learning Compute. The properties allow you to create a persistent cluster of fixed size, or within an existing Azure Virtual Network in your subscription. See the AmlCompute class for details.

Warning

When setting the location parameter, if it is a different region than your workspace or datastores you may see increased network latency and data transfer costs. The latency and costs can occur when creating the cluster, and when running jobs on it.

Azure CLI

[!INCLUDE cli v2]

az ml compute create -f create-cluster.yml

Where the file create-cluster.yml is:

:::code language="yaml" source="~/azureml-examples-main/cli/resources/compute/cluster-location.yml":::

Warning

When using a compute cluster in a different region than your workspace or datastores, you may see increased network latency and data transfer costs. The latency and costs can occur when creating the cluster, and when running jobs on it.

Studio

For information on creating a compute cluster in the studio, see Create compute targets in Azure Machine Learning studio.

Lower your compute cluster cost

You may also choose to use low-priority VMs to run some or all of your workloads. These VMs do not have guaranteed availability and may be preempted while in use. You will have to restart a preempted job.

Use any of these ways to specify a low-priority VM:

Python

[!INCLUDE sdk v1]

compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_D2_V2',
                                                            vm_priority='lowpriority',
                                                            max_nodes=4)

Azure CLI

[!INCLUDE cli v2]

Set the vm-priority:

az ml compute create -f create-cluster.yml

Where the file create-cluster.yml is:

:::code language="yaml" source="~/azureml-examples-main/cli/resources/compute/cluster-low-priority.yml":::

Studio

In the studio, choose Low Priority when you create a VM.

Set up managed identity

[!INCLUDE aml-clone-in-azure-notebook]

Python

[!INCLUDE sdk v1]

Configure managed identity in your provisioning configuration:

System assigned managed identity created in a workspace named ws

# configure cluster with a system-assigned managed identity
compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_D2_V2',
                                                        max_nodes=5,
                                                        identity_type="SystemAssigned",
                                                        )
cpu_cluster_name = "cpu-cluster"
cpu_cluster = ComputeTarget.create(ws, cpu_cluster_name, compute_config)

User-assigned managed identity created in a workspace named ws

# configure cluster with a user-assigned managed identity
compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_D2_V2',
                                                        max_nodes=5,
                                                        identity_type="UserAssigned",
                                                        identity_id=['/subscriptions/<subcription_id>/resourcegroups/<resource_group>/providers/Microsoft.ManagedIdentity/userAssignedIdentities/<user_assigned_identity>'])

cpu_cluster_name = "cpu-cluster"
cpu_cluster = ComputeTarget.create(ws, cpu_cluster_name, compute_config)

Add managed identity to an existing compute cluster named cpu_cluster

System-assigned managed identity:

# add a system-assigned managed identity
cpu_cluster.add_identity(identity_type="SystemAssigned")

User-assigned managed identity:

# add a user-assigned managed identity
cpu_cluster.add_identity(identity_type="UserAssigned", 
                            identity_id=['/subscriptions/<subcription_id>/resourcegroups/<resource_group>/providers/Microsoft.ManagedIdentity/userAssignedIdentities/<user_assigned_identity>'])

Azure CLI

[!INCLUDE cli v2]

Create a new managed compute cluster with managed identity

Use this command:

az ml compute create -f create-cluster.yml

Where the contents of create-cluster.yml are as follows:

User-assigned managed identity

:::code language="yaml" source="~/azureml-examples-main/cli/resources/compute/cluster-user-identity.yml":::
System-assigned managed identity

:::code language="yaml" source="~/azureml-examples-main/cli/resources/compute/cluster-system-identity.yml":::

Add a managed identity to an existing cluster

To update an existing cluster:

User-assigned managed identity

:::code language="azurecli" source="~/azureml-examples-main/cli/deploy-mlcompute-update-to-user-identity.sh":::
System-assigned managed identity

:::code language="azurecli" source="~/azureml-examples-main/cli/deploy-mlcompute-update-to-system-identity.sh":::

Studio

See Set up managed identity in studio.

[!INCLUDE aml-clone-in-azure-notebook]

Managed identity usage

[!INCLUDE aml-clone-in-azure-notebook]

Troubleshooting

There is a chance that some users who created their Azure Machine Learning workspace from the Azure portal before the GA release might not be able to create AmlCompute in that workspace. You can either raise a support request against the service or create a new workspace through the portal or the SDK to unblock yourself immediately.

Stuck at resizing

If your Azure Machine Learning compute cluster appears stuck at resizing (0 -> 0) for the node state, this may be caused by Azure resource locks.

[!INCLUDE resource locks]

Next steps

Use your compute cluster to:

Submit a training run
Run batch inference.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Files

how-to-create-attach-compute-cluster.md

how-to-create-attach-compute-cluster.md

Create an Azure Machine Learning compute cluster

Prerequisites

What is a compute cluster?

Limitations

Create

Python

Azure CLI

Studio

Lower your compute cluster cost

Python

Azure CLI

Studio

Set up managed identity

Python

Azure CLI

Create a new managed compute cluster with managed identity

Add a managed identity to an existing cluster

Studio

Managed identity usage

Troubleshooting

Stuck at resizing

Next steps

Files

how-to-create-attach-compute-cluster.md

Latest commit

History

how-to-create-attach-compute-cluster.md

File metadata and controls

Create an Azure Machine Learning compute cluster

Prerequisites

What is a compute cluster?

Limitations

Create

Lower your compute cluster cost

Set up managed identity

Create a new managed compute cluster with managed identity

Add a managed identity to an existing cluster

Managed identity usage

Troubleshooting

Stuck at resizing

Next steps