title | titleSuffix | description | services | ms.service | ms.subservice | ms.author | author | ms.reviewer | ms.date | ms.topic | ms.custom |
---|---|---|---|---|---|---|---|---|---|---|---|
Train deep learning Keras models |
Azure Machine Learning |
Learn how to train and register a Keras deep neural network classification model running on TensorFlow using Azure Machine Learning. |
machine-learning |
machine-learning |
core |
larryfr |
blackmist |
peterlu |
09/28/2020 |
how-to |
sdkv1, event-tier1-build-2022 |
[!INCLUDE sdk v1]
In this article, learn how to run your Keras training scripts with Azure Machine Learning.
The example code in this article shows you how to train and register a Keras classification model built using the TensorFlow backend with Azure Machine Learning. It uses the popular MNIST dataset to classify handwritten digits using a deep neural network (DNN) built using the Keras Python library running on top of TensorFlow.
Keras is a high-level neural network API capable of running top of other popular DNN frameworks to simplify development. With Azure Machine Learning, you can rapidly scale out training jobs using elastic cloud compute resources. You can also track your training runs, version models, deploy models, and much more.
Whether you're developing a Keras model from the ground-up or you're bringing an existing model into the cloud, Azure Machine Learning can help you build production-ready models.
Note
If you are using the Keras API tf.keras built into TensorFlow and not the standalone Keras package, refer instead to Train TensorFlow models.
Run this code on either of these environments:
-
Azure Machine Learning compute instance - no downloads or installation necessary
- Complete the Quickstart: Get started with Azure Machine Learning to create a dedicated notebook server pre-loaded with the SDK and the sample repository.
- In the samples folder on the notebook server, find a completed and expanded notebook by navigating to this directory: how-to-use-azureml > ml-frameworks > keras > train-hyperparameter-tune-deploy-with-keras folder.
-
Your own Jupyter Notebook server
- Install the Azure Machine Learning SDK (>= 1.15.0).
- Create a workspace configuration file.
- Download the sample script files
keras_mnist.py
andutils.py
You can also find a completed Jupyter Notebook version of this guide on the GitHub samples page. The notebook includes expanded sections covering intelligent hyperparameter tuning, model deployment, and notebook widgets.
This section sets up the training experiment by loading the required Python packages, initializing a workspace, creating the FileDataset for the input training data, creating the compute target, and defining the training environment.
First, import the necessary Python libraries.
import os
import azureml
from azureml.core import Experiment
from azureml.core import Environment
from azureml.core import Workspace, Run
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException
The Azure Machine Learning workspace is the top-level resource for the service. It provides you with a centralized place to work with all the artifacts you create. In the Python SDK, you can access the workspace artifacts by creating a workspace
object.
Create a workspace object from the config.json
file created in the prerequisites section.
ws = Workspace.from_config()
A FileDataset
object references one or multiple files in your workspace datastore or public urls. The files can be of any format, and the class provides you with the ability to download or mount the files to your compute. By creating a FileDataset
, you create a reference to the data source location. If you applied any transformations to the data set, they will be stored in the data set as well. The data remains in its existing location, so no extra storage cost is incurred. See the how-to guide on the Dataset
package for more information.
from azureml.core.dataset import Dataset
web_paths = [
'http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz',
'http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz',
'http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz',
'http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz'
]
dataset = Dataset.File.from_files(path=web_paths)
You can use the register()
method to register the data set to your workspace so they can be shared with others, reused across various experiments, and referred to by name in your training script.
dataset = dataset.register(workspace=ws,
name='mnist-dataset',
description='training and test dataset',
create_new_version=True)
Create a compute target for your training job to run on. In this example, create a GPU-enabled Azure Machine Learning compute cluster.
cluster_name = "gpu-cluster"
try:
compute_target = ComputeTarget(workspace=ws, name=cluster_name)
print('Found existing compute target')
except ComputeTargetException:
print('Creating a new compute target...')
compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_NC6',
max_nodes=4)
compute_target = ComputeTarget.create(ws, cluster_name, compute_config)
compute_target.wait_for_completion(show_output=True, min_node_count=None, timeout_in_minutes=20)
[!INCLUDE low-pri-note]
For more information on compute targets, see the what is a compute target article.
Define the Azure ML Environment that encapsulates your training script's dependencies.
First, define your conda dependencies in a YAML file; in this example the file is named conda_dependencies.yml
.
channels:
- conda-forge
dependencies:
- python=3.6.2
- pip:
- azureml-defaults
- tensorflow-gpu==2.0.0
- keras<=2.3.1
- matplotlib
Create an Azure ML environment from this conda environment specification. The environment will be packaged into a Docker container at runtime.
By default if no base image is specified, Azure ML will use a CPU image azureml.core.environment.DEFAULT_CPU_IMAGE
as the base image. Since this example runs training on a GPU cluster, you will need to specify a GPU base image that has the necessary GPU drivers and dependencies. Azure ML maintains a set of base images published on Microsoft Container Registry (MCR) that you can use, see the Azure/AzureML-Containers GitHub repo for more information.
keras_env = Environment.from_conda_specification(name='keras-env', file_path='conda_dependencies.yml')
# Specify a GPU base image
keras_env.docker.enabled = True
keras_env.docker.base_image = 'mcr.microsoft.com/azureml/openmpi3.1.2-cuda10.0-cudnn7-ubuntu18.04'
For more information on creating and using environments, see Create and use software environments in Azure Machine Learning.
First get the data from the workspace datastore using the Dataset
class.
dataset = Dataset.get_by_name(ws, 'mnist-dataset')
# list the files referenced by mnist-dataset
dataset.to_path()
Create a ScriptRunConfig object to specify the configuration details of your training job, including your training script, environment to use, and the compute target to run on.
Any arguments to your training script will be passed via command line if specified in the arguments
parameter. The DatasetConsumptionConfig for our FileDataset is passed as an argument to the training script, for the --data-folder
argument. Azure ML will resolve this DatasetConsumptionConfig to the mount-point of the backing datastore, which can then be accessed from the training script.
from azureml.core import ScriptRunConfig
args = ['--data-folder', dataset.as_mount(),
'--batch-size', 50,
'--first-layer-neurons', 300,
'--second-layer-neurons', 100,
'--learning-rate', 0.001]
src = ScriptRunConfig(source_directory=script_folder,
script='keras_mnist.py',
arguments=args,
compute_target=compute_target,
environment=keras_env)
For more information on configuring jobs with ScriptRunConfig, see Configure and submit training runs.
Warning
If you were previously using the TensorFlow estimator to configure your Keras training jobs, please note that Estimators have been deprecated as of the 1.19.0 SDK release. With Azure ML SDK >= 1.15.0, ScriptRunConfig is the recommended way to configure training jobs, including those using deep learning frameworks. For common migration questions, see the Estimator to ScriptRunConfig migration guide.
The Run object provides the interface to the run history while the job is running and after it has completed.
run = Experiment(workspace=ws, name='Tutorial-Keras-Minst').submit(src)
run.wait_for_completion(show_output=True)
As the run is executed, it goes through the following stages:
-
Preparing: A docker image is created according to the environment defined. The image is uploaded to the workspace's container registry and cached for later runs. Logs are also streamed to the run history and can be viewed to monitor progress. If a curated environment is specified instead, the cached image backing that curated environment will be used.
-
Scaling: The cluster attempts to scale up if the Batch AI cluster requires more nodes to execute the run than are currently available.
-
Running: All scripts in the script folder are uploaded to the compute target, data stores are mounted or copied, and the
script
is executed. Outputs from stdout and the ./logs folder are streamed to the run history and can be used to monitor the run. -
Post-Processing: The ./outputs folder of the run is copied over to the run history.
Once you've trained the model, you can register it to your workspace. Model registration lets you store and version your models in your workspace to simplify model management and deployment.
model = run.register_model(model_name='keras-mnist', model_path='outputs/model')
Tip
The deployment how-to contains a section on registering models, but you can skip directly to creating a compute target for deployment, since you already have a registered model.
You can also download a local copy of the model. This can be useful for doing additional model validation work locally. In the training script, keras_mnist.py
, a TensorFlow saver object persists the model to a local folder (local to the compute target). You can use the Run object to download a copy from the run history.
# Create a model folder in the current directory
os.makedirs('./model', exist_ok=True)
for f in run.get_file_names():
if f.startswith('outputs/model'):
output_file_path = os.path.join('./model', f.split('/')[-1])
print('Downloading from {} to {} ...'.format(f, output_file_path))
run.download_file(name=f, output_file_path=output_file_path)
In this article, you trained and registered a Keras model on Azure Machine Learning. To learn how to deploy a model, continue on to our model deployment article.