title | titleSuffix | description | services | ms.service | ms.subservice | ms.custom | ms.topic | ms.author | author | ms.reviewer | ms.date |
---|---|---|---|---|---|---|---|---|---|---|---|
Failover & disaster recovery |
Azure Machine Learning |
Learn how to plan for disaster recovery and maintain business continuity for Azure Machine Learning. |
machine-learning |
machine-learning |
enterprise-readiness |
event-tier1-build-2022 |
how-to |
jhirono |
jhirono |
larryfr |
10/21/2021 |
To maximize your uptime, plan ahead to maintain business continuity and prepare for disaster recovery with Azure Machine Learning.
Microsoft strives to ensure that Azure services are always available. However, unplanned service outages may occur. We recommend having a disaster recovery plan in place for handling regional service outages. In this article, you'll learn how to:
- Plan for a multi-regional deployment of Azure Machine Learning and associated resources.
- Design for high availability of your solution.
- Initiate a failover to another region.
Note
Azure Machine Learning itself does not provide automatic failover or disaster recovery.
In case you have accidentally deleted your workspace or corresponding components, this article also provides you with currently supported recovery options.
Azure Machine Learning depends on multiple Azure services and has several layers. Some of these services are provisioned in your (customer) subscription. You're responsible for the high-availability configuration of these services. Other services are created in a Microsoft subscription and managed by Microsoft.
Azure services include:
-
Azure Machine Learning infrastructure: A Microsoft-managed environment for the Azure Machine Learning workspace.
-
Associated resources: Resources provisioned in your subscription during Azure Machine Learning workspace creation. These resources include Azure Storage, Azure Key Vault, Azure Container Registry, and Application Insights. You're responsible for configuring high-availability settings for these resources.
- Default storage has data such as model, training log data, and dataset.
- Key Vault has credentials for Azure Storage, Container Registry, and data stores.
- Container Registry has a Docker image for training and inferencing environments.
- Application Insights is for monitoring Azure Machine Learning.
-
Compute resources: Resources you create after workspace deployment. For example, you might create a compute instance or compute cluster to train a Machine Learning model.
- Compute instance and compute cluster: Microsoft-managed model development environments.
- Other resources: Microsoft computing resources that you can attach to Azure Machine Learning, such as Azure Kubernetes Service (AKS), Azure Databricks, Azure Container Instances, and Azure HDInsight. You're responsible for configuring high-availability settings for these resources.
-
Other data stores: Azure Machine Learning can mount other data stores such as Azure Storage, Azure Data Lake Storage, and Azure SQL Database for training data. These data stores are provisioned within your subscription. You're responsible for configuring their high-availability settings.
The following table shows the Azure services are managed by Microsoft and which are managed by you. It also indicates the services that are highly available by default.
Service | Managed by | High availability by default |
---|---|---|
Azure Machine Learning infrastructure | Microsoft | |
Associated resources | ||
Azure Storage | You | |
Key Vault | You | ✓ |
Container Registry | You | |
Application Insights | You | NA |
Compute resources | ||
Compute instance | Microsoft | |
Compute cluster | Microsoft | |
Other compute resources such as AKS, Azure Databricks, Container Instances, HDInsight |
You | |
Other data stores such as Azure Storage, SQL Database, Azure Database for PostgreSQL, Azure Database for MySQL, Azure Databricks File System |
You |
The rest of this article describes the actions you need to take to make each of these services highly available.
A multi-regional deployment relies on creation of Azure Machine Learning and other resources (infrastructure) in two Azure regions. If a regional outage occurs, you can switch to the other region. When planning on where to deploy your resources, consider:
-
Regional availability: Use regions that are close to your users. To check regional availability for Azure Machine Learning, see Azure products by region.
-
Azure paired regions: Paired regions coordinate platform updates and prioritize recovery efforts where needed. For more information, see Azure paired regions.
-
Service availability: Decide whether the resources used by your solution should be hot/hot, hot/warm, or hot/cold.
- Hot/hot: Both regions are active at the same time, with one region ready to begin use immediately.
- Hot/warm: Primary region active, secondary region has critical resources (for example, deployed models) ready to start. Non-critical resources would need to be manually deployed in the secondary region.
- Hot/cold: Primary region active, secondary region has Azure Machine Learning and other resources deployed, along with needed data. Resources such as models, model deployments, or pipelines would need to be manually deployed.
Tip
Depending on your business requirements, you may decide to treat different Azure Machine Learning resources differently. For example, you may want to use hot/hot for deployed models (inference), and hot/cold for experiments (training).
Azure Machine Learning builds on top of other services. Some services can be configured to replicate to other regions. Others you must manually create in multiple regions. The following table provides a list of services, who is responsible for replication, and an overview of the configuration:
Azure service | Geo-replicated by | Configuration |
---|---|---|
Machine Learning workspace | You | Create a workspace in the selected regions. |
Machine Learning compute | You | Create the compute resources in the selected regions. For compute resources that can dynamically scale, make sure that both regions provide sufficient compute quota for your needs. |
Key Vault | Microsoft | Use the same Key Vault instance with the Azure Machine Learning workspace and resources in both regions. Key Vault automatically fails over to a secondary region. For more information, see Azure Key Vault availability and redundancy. |
Container Registry | Microsoft | Configure the Container Registry instance to geo-replicate registries to the paired region for Azure Machine Learning. Use the same instance for both workspace instances. For more information, see Geo-replication in Azure Container Registry. |
Storage Account | You | Azure Machine Learning does not support default storage-account failover using geo-redundant storage (GRS), geo-zone-redundant storage (GZRS), read-access geo-redundant storage (RA-GRS), or read-access geo-zone-redundant storage (RA-GZRS). Create a separate storage account for the default storage of each workspace. Create separate storage accounts or services for other data storage. For more information, see Azure Storage redundancy. |
Application Insights | You | Create Application Insights for the workspace in both regions. To adjust the data-retention period and details, see Data collection, retention, and storage in Application Insights. |
To enable fast recovery and restart in the secondary region, we recommend the following development practices:
- Use Azure Resource Manager templates. Templates are 'infrastructure-as-code', and allow you to quickly deploy services in both regions.
- To avoid drift between the two regions, update your continuous integration and deployment pipelines to deploy to both regions.
- When automating deployments, include the configuration of workspace attached compute resources such as Azure Kubernetes Service.
- Create role assignments for users in both regions.
- Create network resources such as Azure Virtual Networks and private endpoints for both regions. Make sure that users have access to both network environments. For example, VPN and DNS configurations for both virtual networks.
Depending on your needs, you may have more compute or data services that are used by Azure Machine Learning. For example, you may use Azure Kubernetes Services or Azure SQL Database. Use the following information to learn how to configure these services for high availability.
Compute resources
- Azure Kubernetes Service: See Best practices for business continuity and disaster recovery in Azure Kubernetes Service (AKS) and Create an Azure Kubernetes Service (AKS) cluster that uses availability zones. If the AKS cluster was created by using the Azure Machine Learning Studio, SDK, or CLI, cross-region high availability is not supported.
- Azure Databricks: See Regional disaster recovery for Azure Databricks clusters.
- Container Instances: An orchestrator is responsible for failover. See Azure Container Instances and container orchestrators.
- HDInsight: See High availability services supported by Azure HDInsight.
Data services
- Azure Blob container / Azure Files / Data Lake Storage Gen2: See Azure Storage redundancy.
- Data Lake Storage Gen1: See High availability and disaster recovery guidance for Data Lake Storage Gen1.
- SQL Database: See High availability for Azure SQL Database and SQL Managed Instance.
- Azure Database for PostgreSQL: See High availability concepts in Azure Database for PostgreSQL - Single Server.
- Azure Database for MySQL: See Understand business continuity in Azure Database for MySQL.
- Azure Databricks File System: See Regional disaster recovery for Azure Databricks clusters.
Tip
If you provide your own customer-managed key to deploy an Azure Machine Learning workspace, Azure Cosmos DB is also provisioned within your subscription. In that case, you're responsible for configuring its high-availability settings. See High availability with Azure Cosmos DB.
Determine the level of business continuity that you are aiming for. The level may differ between the components of your solution. For example, you may want to have a hot/hot configuration for production pipelines or model deployments, and hot/cold for experimentation.
By keeping your data storage isolated from the default storage the workspace uses for logs, you can:
- Attach the same storage instances as datastores to the primary and secondary workspaces.
- Make use of geo-replication for data storage accounts and maximize your uptime.
Runs in Azure Machine Learning are defined by a run specification. This specification includes dependencies on input artifacts that are managed on a workspace-instance level, including environments, datasets, and compute. For multi-region run submission and deployments, we recommend the following practices:
-
Manage your code base locally, backed by a Git repository.
-
Export important notebooks from Azure Machine Learning studio.
-
Export pipelines authored in studio as code.
[!NOTE] Pipelines created in studio designer cannot currently be exported as code.
-
-
Manage configurations as code.
- Avoid hardcoded references to the workspace. Instead, configure a reference to the workspace instance using a config file and use Workspace.from_config() to initialize the workspace. To automate the process, use the Azure CLI extension for machine learning command az ml folder attach.
- Use run submission helpers such as ScriptRunConfig and Pipeline.
- Use Environments.save_to_directory() to save your environment definitions.
- Use a Dockerfile if you use custom Docker images.
- Use the Dataset class to define the collection of data paths used by your solution.
- Use the Inferenceconfig class to deploy models as inference endpoints.
When your primary workspace becomes unavailable, you can switch over the secondary workspace to continue experimentation and development. Azure Machine Learning does not automatically submit runs to the secondary workspace if there is an outage. Update your code configuration to point to the new workspace resource. We recommend to avoiding hardcoding workspace references. Instead, use a workspace config file to minimize manual user steps when changing workspaces. Make sure to also update any automation, such as continuous integration and deployment pipelines to the new workspace.
Azure Machine Learning cannot sync or recover artifacts or metadata between workspace instances. Dependent on your application deployment strategy, you might have to move artifacts or recreate experimentation inputs such as dataset objects in the failover workspace in order to continue run submission. In case you have configured your primary workspace and secondary workspace resources to share associated resources with geo-replication enabled, some objects might be directly available to the failover workspace. For example, if both workspaces share the same docker images, configured datastores, and Azure Key Vault resources. The following diagram shows a configuration where two workspaces share the same images (1), datastores (2), and Key Vault (3).
Note
Any jobs that are running when a service outage occurs will not automatically transition to the secondary workspace. It is also unlikely that the jobs will resume and finish successfully in the primary workspace once the outage is resolved. Instead, these jobs must be resubmitted, either in the secondary workspace or in the primary (once the outage is resolved).
Depending on your recovery approach, you may need to copy artifacts such as dataset and model objects between the workspaces to continue your work. Currently, the portability of artifacts between workspaces is limited. We recommend managing artifacts as code where possible so that they can be recreated in the failover instance.
The following artifacts can be exported and imported between workspaces by using the Azure CLI extension for machine learning:
Artifact | Export | Import |
---|---|---|
Models | az ml model download --model-id {ID} --target-dir {PATH} | az ml model register –name {NAME} --path {PATH} |
Environments | az ml environment download -n {NAME} -d {PATH} | az ml environment register -d {PATH} |
Azure ML pipelines (code-generated) | az ml pipeline get --path {PATH} | az ml pipeline create --name {NAME} -y {PATH} |
Tip
- Registered datasets cannot be downloaded or moved. This includes datasets generated by Azure ML, such as intermediate pipeline datasets. However datasets that refer to a shared file location that both workspaces can access, or where the underlying data storage is replicated, can be registered on both workspaces. Use the az ml dataset register to register a dataset.
- Run outputs are stored in the default storage account associated with a workspace. While run outputs might become inaccessible from the studio UI in the case of a service outage, you can directly access the data through the storage account. For more information on working with data stored in blobs, see Create, download, and list blobs with Azure CLI.
If you accidentally deleted your workspace it is currently not possible to recover it. However you are able to retrieve your existing notebooks from the corresponding storage if you follow these steps:
- In the Azure portal navigate to the storage account that was linked to the deleted Azure Machine Learning workspace.
- In the Data storage section on the left, click on File shares.
- Your notebooks are located on the file share with the name that contains your workspace ID.
To deploy Azure Machine Learning with associated resources with your high-availability settings, use an Azure Resource Manager template.