title	description	ms.devlang	ms.topic	ms.date	ms.custom
Tutorial - Run Python scripts through Data Factory	Learn how to run Python scripts as part of a pipeline through Azure Data Factory using Azure Batch.	python	tutorial	03/12/2021	mvc, devx-track-python

Tutorial: Run Python scripts through Azure Data Factory using Azure Batch

In this tutorial, you learn how to:

[!div class="checklist"]

Authenticate with Batch and Storage accounts

Develop and run a script in Python

Create a pool of compute nodes to run an application

Schedule your Python workloads

Monitor your analytics pipeline

Access your logfiles

The example below runs a Python script that receives CSV input from a blob storage container, performs a data manipulation process, and writes the output to a separate blob storage container.

If you don’t have an Azure subscription, create a free account before you begin.

Prerequisites

An installed Python distribution, for local testing.
The azure-storage-blob pip package.
The iris.csv dataset
An Azure Batch account and a linked Azure Storage account. See Create a Batch account for more information on how to create and link Batch accounts to storage accounts.
An Azure Data Factory account. See Create a data factory for more information on how to create a data factory through the Azure portal.
Batch Explorer.
Azure Storage Explorer.

Sign in to Azure

Sign in to the Azure portal at https://portal.azure.com.

[!INCLUDE batch-common-credentials]

Create a Batch pool using Batch Explorer

In this section, you'll use Batch Explorer to create the Batch pool that your Azure Data factory pipeline will use.

Sign in to Batch Explorer using your Azure credentials.
Select your Batch account
Create a pool by selecting Pools on the left side bar, then the Add button above the search form.
1. Choose an ID and display name. We'll use custom-activity-pool for this example.
2. Set the scale type to Fixed size, and set the dedicated node count to 2.
3. Under Data science, select Dsvm Windows as the operating system.
4. Choose Standard_f2s_v2 as the virtual machine size.
5. Enable the start task and add the command cmd /c "pip install azure-storage-blob pandas". The user identity can remain as the default Pool user.
6. Select OK.

Create blob containers

Here you'll create blob containers that will store your input and output files for the OCR Batch job.

Sign in to Storage Explorer using your Azure credentials.
Using the storage account linked to your Batch account, create two blob containers (one for input files, one for output files) by following the steps at Create a blob container.
- In this example, we'll call our input container input, and our output container output.
Upload iris.csv to your input container input using Storage Explorer by following the steps at Managing blobs in a blob container

Develop a script in Python

The following Python script loads the iris.csv dataset from your input container, performs a data manipulation process, and saves the results back to the output container.

# Load libraries
from azure.storage.blob import BlobClient
import pandas as pd

# Define parameters
connectionString = "<storage-account-connection-string>"
containerName = "output"
outputBlobName	= "iris_setosa.csv"

# Establish connection with the blob storage account
blob = BlobClient.from_connection_string(conn_str=connectionString, container_name=containerName, blob_name=outputBlobName)

# Load iris dataset from the task node
df = pd.read_csv("iris.csv")

# Take a subset of the records
df = df[df['Species'] == "setosa"]

# Save the subset of the iris dataframe locally in task node
df.to_csv(outputBlobName, index = False)

with open(outputBlobName, "rb") as data:
    blob.upload_blob(data)

Save the script as main.py and upload it to the Azure Storage input container. Be sure to test and validate its functionality locally before uploading it to your blob container:

python main.py

Set up an Azure Data Factory pipeline

In this section, you'll create and validate a pipeline using your Python script.

Follow the steps to create a data factory under the "Create a data factory" section of this article.
In the Factory Resources box, select the + (plus) button and then select Pipeline
In the General tab, set the name of the pipeline as "Run Python"
In the Activities box, expand Batch Service. Drag the custom activity from the Activities toolbox to the pipeline designer surface. Fill out the following tabs for the custom activity:
1. In the General tab, specify testPipeline for Name
2. In the Azure Batch tab, add the Batch Account that was created in the previous steps and Test connection to ensure that it is successful.
3. In the Settings tab:
  1. Set the Command as python main.py.
  2. For the Resource Linked Service, add the storage account that was created in the previous steps. Test the connection to ensure it is successful.
  3. In the Folder Path, select the name of the Azure Blob Storage container that contains the Python script and the associated inputs. This will download the selected files from the container to the pool node instances before the execution of the Python script.
Click Validate on the pipeline toolbar above the canvas to validate the pipeline settings. Confirm that the pipeline has been successfully validated. To close the validation output, select the >> (right arrow) button.
Click Debug to test the pipeline and ensure it works accurately.
Click Publish to publish the pipeline.
Click Trigger to run the Python script as part of a batch process.

Monitor the log files

In case warnings or errors are produced by the execution of your script, you can check out stdout.txt or stderr.txt for more information on output that was logged.

Select Jobs from the left-hand side of Batch Explorer.
Choose the job created by your data factory. Assuming you named your pool custom-activity-pool, select adfv2-custom-activity-pool.
Click on the task that had a failure exit code.
View stdout.txt and stderr.txt to investigate and diagnose your problem.

Clean up resources

Although you're not charged for jobs and tasks themselves, you are charged for compute nodes. Thus, we recommend that you allocate pools only as needed. When you delete the pool, all task output on the nodes is deleted. However, the input and output files remain in the storage account. When no longer needed, you can also delete the Batch account and the storage account.

Next steps

In this tutorial, you learned how to:

[!div class="checklist"]

Authenticate with Batch and Storage accounts

Develop and run a script in Python

Create a pool of compute nodes to run an application

Schedule your Python workloads

Monitor your analytics pipeline

Access your logfiles

To learn more about Azure Data Factory, see:

[!div class="nextstepaction"] Azure Data Factory overview

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Files

tutorial-run-python-batch-azure-data-factory.md

tutorial-run-python-batch-azure-data-factory.md

Tutorial: Run Python scripts through Azure Data Factory using Azure Batch

Prerequisites

Sign in to Azure

Create a Batch pool using Batch Explorer

Create blob containers

Develop a script in Python

Set up an Azure Data Factory pipeline

Monitor the log files

Clean up resources

Next steps

Files

tutorial-run-python-batch-azure-data-factory.md

Latest commit

History

tutorial-run-python-batch-azure-data-factory.md

File metadata and controls

Tutorial: Run Python scripts through Azure Data Factory using Azure Batch

Prerequisites

Sign in to Azure

Create a Batch pool using Batch Explorer

Create blob containers

Develop a script in Python

Set up an Azure Data Factory pipeline

Monitor the log files

Clean up resources

Next steps