Skip to content

Files

116 lines (80 loc) · 5.81 KB

how-to-troubleshoot-batch-endpoints.md

File metadata and controls

116 lines (80 loc) · 5.81 KB
title titleSuffix description services ms.service ms.subservice ms.topic ms.custom ms.reviewer ms.author author ms.date
Troubleshooting batch endpoints
Azure Machine Learning
Tips to help you succeed with batch endpoints.
machine-learning
machine-learning
mlops
troubleshooting
troubleshooting, devplatv2, cliv2, event-tier1-build-2022
laobri
larryfr
blackmist
03/31/2022

Troubleshooting batch endpoints

[!INCLUDE cli v2]

Learn how to troubleshoot and solve, or work around, common errors you may come across when using batch endpoints for batch scoring.

The following table contains common problems and solutions you may see during batch endpoint development and consumption.

Problem Possible solution
Code configuration or Environment is missing. Ensure you provide the scoring script and an environment definition if you're using a non-MLflow model. No-code deployment is supported for the MLflow model only. For more, see Track ML models with MLflow and Azure Machine Learning
Unsupported input data. Batch endpoint accepts input data in three forms: 1) registered data 2) data in the cloud 3) data in local. Ensure you're using the right format. For more, see Use batch endpoints for batch scoring
Output already exists. If you configure your own output location, ensure you provide a new output for each endpoint invocation.

Understanding logs of a batch scoring job

Get logs

After you invoke a batch endpoint using the Azure CLI or REST, the batch scoring job will run asynchronously. There are two options to get the logs for a batch scoring job.

Option 1: Stream logs to local console

You can run the following command to stream system-generated logs to your console. Only logs in the azureml-logs folder will be streamed.

az ml job stream -name <job_name>

Option 2: View logs in studio

To get the link to the run in studio, run:

az ml job show --name <job_name> --query interaction_endpoints.Studio.endpoint -o tsv
  1. Open the job in studio using the value returned by the above command.
  2. Choose batchscoring
  3. Open the Outputs + logs tab
  4. Choose the log(s) you wish to review

Understand log structure

There are two top-level log folders, azureml-logs and logs.

The file ~/azureml-logs/70_driver_log.txt contains information from the controller that launches the scoring script.

Because of the distributed nature of batch scoring jobs, there are logs from several different sources. However, two combined files are created that provide high-level information:

  • ~/logs/job_progress_overview.txt: This file provides high-level information about the number of mini-batches (also known as tasks) created so far and the number of mini-batches processed so far. As the mini-batches end, the log records the results of the job. If the job failed, it will show the error message and where to start the troubleshooting.

  • ~/logs/sys/master_role.txt: This file provides the principal node (also known as the orchestrator) view of the running job. This log provides information on task creation, progress monitoring, the run result.

For a concise understanding of errors in your script there is:

  • ~/logs/user/error.txt: This file will try to summarize the errors in your script.

For more information on errors in your script, there is:

  • ~/logs/user/error/: This file contains full stack traces of exceptions thrown while loading and running the entry script.

When you need a full understanding of how each node executed the score script, look at the individual process logs for each node. The process logs can be found in the sys/node folder, grouped by worker nodes:

  • ~/logs/sys/node/<ip_address>/<process_name>.txt: This file provides detailed info about each mini-batch as it's picked up or completed by a worker. For each mini-batch, this file includes:

    • The IP address and the PID of the worker process.
    • The total number of items, the number of successfully processed items, and the number of failed items.
    • The start time, duration, process time, and run method time.

You can also view the results of periodic checks of the resource usage for each node. The log files and setup files are in this folder:

  • ~/logs/perf: Set --resource_monitor_interval to change the checking interval in seconds. The default interval is 600, which is approximately 10 minutes. To stop the monitoring, set the value to 0. Each <ip_address> folder includes:

    • os/: Information about all running processes in the node. One check runs an operating system command and saves the result to a file. On Linux, the command is ps.
      • %Y%m%d%H: The sub folder name is the time to hour.
        • processes_%M: The file ends with the minute of the checking time.
    • node_disk_usage.csv: Detailed disk usage of the node.
    • node_resource_usage.csv: Resource usage overview of the node.
    • processes_resource_usage.csv: Resource usage overview of each process.

How to log in scoring script

You can use Python logging in your scoring script. Logs are stored in logs/user/stdout/<node_id>/processNNN.stdout.txt.

import argparse
import logging

# Get logging_level
arg_parser = argparse.ArgumentParser(description="Argument parser.")
arg_parser.add_argument("--logging_level", type=str, help="logging level")
args, unknown_args = arg_parser.parse_known_args()
print(args.logging_level)

# Initialize Python logger
logger = logging.getLogger(__name__)
logger.setLevel(args.logging_level.upper())
logger.info("Info log statement")
logger.debug("Debug log statement")