Skip to content

Files

227 lines (152 loc) · 11.2 KB

how-to-troubleshoot-environments.md

File metadata and controls

227 lines (152 loc) · 11.2 KB
title titleSuffix description services ms.service ms.subservice author ms.author ms.date ms.topic ms.custom
Troubleshoot environment images
Azure Machine Learning
Learn how to troubleshoot issues with environment image builds and package installations.
machine-learning
machine-learning
mlops
saachigopal
sagopal
03/01/2022
troubleshooting
devx-track-python, event-tier1-build-2022

Troubleshoot environment image builds

Learn how to troubleshoot issues with Docker environment image builds and package installations.

Prerequisites

Docker image build failures

For most image build failures, you'll find the root cause in the image build log. Find the image build log from the Azure Machine Learning portal (20_image_build_log.txt) or from your Azure Container Registry task run logs.

It's usually easier to reproduce errors locally. Check the kind of error and try one of the following setuptools:

  • Install a conda dependency locally: conda install suspicious-dependency==X.Y.Z.
  • Install a pip dependency locally: pip install suspicious-dependency==X.Y.Z.
  • Try to materialize the entire environment: conda create -f conda-specification.yml.

Important

Make sure that the platform and interpreter on your local compute cluster match the ones on the remote compute cluster.

Timeout

The following network issues can cause timeout errors:

  • Low internet bandwidth
  • Server issues
  • Large dependencies that can't be downloaded with the given conda or pip timeout settings

Messages similar to the following examples will indicate the issue:

('Connection broken: OSError("(104, \'ECONNRESET\')")', OSError("(104, 'ECONNRESET')"))
ReadTimeoutError("HTTPSConnectionPool(host='****', port=443): Read timed out. (read timeout=15)",)

If you get an error message, try one of the following possible solutions:

  • Try a different source, such as mirrors, Azure Blob Storage, or other Python feeds, for the dependency.
  • Update conda or pip. If you're using a custom Docker file, update the timeout settings.
  • Some pip versions have known issues. Consider adding a specific version of pip to the environment dependencies.

Package not found

The following errors are most common for image build failures:

  • Conda package couldn't be found:

    ResolvePackageNotFound: 
    - not-existing-conda-package
    
  • Specified pip package or version couldn't be found:

    ERROR: Could not find a version that satisfies the requirement invalid-pip-package (from versions: none)
    ERROR: No matching distribution found for invalid-pip-package
    
  • Bad nested pip dependency:

    ERROR: No matching distribution found for bad-package==0.0 (from good-package==1.0)
    

Check that the package exists on the specified sources. Use pip search to verify pip dependencies:

  • pip search azureml-core

For conda dependencies, use conda search:

  • conda search conda-forge::numpy

For more options, try:

  • pip search -h
  • conda search -h

Installer notes

Make sure that the required distribution exists for the specified platform and Python interpreter version.

For pip dependencies, go to https://pypi.org/project/[PROJECT NAME]/[VERSION]/#files to see if the required version is available. Go to https://pypi.org/project/azureml-core/1.11.0/#files to see an example.

For conda dependencies, check the package on the channel repository. For channels maintained by Anaconda, Inc., check the Anaconda Packages page.

Pip package update

During an installation or an update of a pip package, the resolver might need to update an already-installed package to satisfy the new requirements. Uninstallation can fail for various reasons related to the pip version or the way the dependency was installed. The most common scenario is that a dependency installed by conda couldn't be uninstalled by pip. For this scenario, consider uninstalling the dependency by using conda remove mypackage.

  Attempting uninstall: mypackage
    Found existing installation: mypackage X.Y.Z
ERROR: Cannot uninstall 'mypackage'. It is a distutils installed project and thus we cannot accurately determine which files belong to it which would lead to only a partial uninstall.

Installer issues

Certain installer versions have issues in the package resolvers that can lead to a build failure.

If you're using a custom base image or Dockerfile, we recommend using conda version 4.5.4 or later.

A pip package is required to install pip dependencies. If a version isn't specified in the environment, the latest version will be used. We recommend using a known version of pip to avoid transient issues or breaking changes that the latest version of the tool might cause.

Consider pinning the pip version in your environment if you see the following message:

Warning: you have pip-installed dependencies in your environment file, but you do not list pip itself as one of your conda dependencies. Conda may not use the correct pip to install your packages, and they may end up in the wrong place. Please add an explicit pip dependency. I'm adding one for you, but still nagging you.

Pip subprocess error:

ERROR: THESE PACKAGES DO NOT MATCH THE HASHES FROM THE REQUIREMENTS FILE. If you have updated the package versions, update the hashes as well. Otherwise, examine the package contents carefully; someone may have tampered with them.

Pip installation can be stuck in an infinite loop if there are unresolvable conflicts in the dependencies. If you're working locally, downgrade the pip version to < 20.3. In a conda environment created from a YAML file, you'll see this issue only if conda-forge is the highest-priority channel. To mitigate the issue, explicitly specify pip < 20.3 (!=20.3 or =20.2.4 pin to other version) as a conda dependency in the conda specification file.

ModuleNotFoundError: No module named 'distutils.dir_util'

When setting up your environment, sometimes you'll run into the issue ModuleNotFoundError: No module named 'distutils.dir_util'. To fix it, run the following command:

apt-get install -y --no-install-recommends python3 python3-distutils && \
ln -sf /usr/bin/python3 /usr/bin/python

When working with a Dockerfile, run it as part of a RUN command.

RUN apt-get update && \
  apt-get install -y --no-install-recommends python3 python3-distutils && \
  ln -sf /usr/bin/python3 /usr/bin/python

Running this command installs the correct module dependencies to configure your environment.

Build failure when using Spark packages

Configure the environment to not precache the packages.

env.spark.precache_packages = False

Service-side failures

See the following scenarios to troubleshoot possible service-side failures.

You're unable to pull an image from a container registry, or the address couldn't be resolved for a container registry

Possible issues:

  • The path name to the container registry might not be resolving correctly. Check that image names use double slashes and the direction of slashes on Linux versus Windows hosts is correct.
  • If a container registry behind a virtual network is using a private endpoint in an unsupported region, configure the container registry by using the service endpoint (public access) from the portal and retry.
  • After you put the container registry behind a virtual network, run the Azure Resource Manager template so the workspace can communicate with the container registry instance.

You get a 401 error from a workspace container registry

Resynchronize storage keys by using ws.sync_keys().

The environment keeps throwing a "Waiting for other conda operations to finish…" error

When an image build is ongoing, conda is locked by the SDK client. If the process crashed or was canceled incorrectly by the user, conda stays in the locked state. To resolve this issue, manually delete the lock file.

Your custom Docker image isn't in the registry

Check if the correct tag is used and that user_managed_dependencies = True. Environment.python.user_managed_dependencies = True disables conda and uses the user's installed packages.

You get one of the following common virtual network issues

  • Check that the storage account, compute cluster, and container registry are all in the same subnet of the virtual network.
  • When your container registry is behind a virtual network, it can't directly be used to build images. You'll need to use the compute cluster to build images.
  • Storage might need to be placed behind a virtual network if you:
    • Use inferencing or private wheel.
    • See 403 (not authorized) service errors.
    • Can't get image details from Azure Container Registry.

The image build fails when you're trying to access network protected storage

  • Azure Container Registry tasks don't work behind a virtual network. If the user has their container registry behind a virtual network, they need to use the compute cluster to build an image.
  • Storage should be behind a virtual network in order to pull dependencies from it.

You can't run experiments when storage has network security enabled

If you're using default Docker images and enabling user-managed dependencies, use the MicrosoftContainerRegistry and AzureFrontDoor.FirstParty service tags to allowlist Azure Container Registry and its dependencies.

For more information, see Enabling virtual networks.

Error response from daemon: get "https://viennaglobal.azurecr.io": context deadline exceeded

This error happens when you have configured the workspace to build images using a compute cluster, and the compute cluster is configured for no public IP address. Using a compute cluster to build images is required if your Azure Container Registry is behind a virtual network. For more information, see Enable Azure Container Registry.

To resolve this error, use the following steps:

  1. Create a new compute cluster that has a public IP address.
  2. Configure the workspace to build images using the compute cluster created in step 1.

Next steps