title | description | ms.service | ms.subservice | ms.topic | author | ms.author | ms.date |
---|---|---|---|---|---|---|---|
Transform data by using Spark in Azure Data Factory |
This tutorial provides step-by-step instructions for transforming data by using a Spark activity in Azure Data Factory. |
data-factory |
tutorials |
tutorial |
nabhishek |
abnarain |
06/07/2021 |
[!INCLUDEappliesto-adf-xxx-md]
In this tutorial, you use the Azure portal to create an Azure Data Factory pipeline. This pipeline transforms data by using a Spark activity and an on-demand Azure HDInsight linked service.
You perform the following steps in this tutorial:
[!div class="checklist"]
- Create a data factory.
- Create a pipeline that uses a Spark activity.
- Trigger a pipeline run.
- Monitor the pipeline run.
If you don't have an Azure subscription, create a free account before you begin.
[!INCLUDE updated-for-az]
- Azure storage account. You create a Python script and an input file, and you upload them to Azure Storage. The output from the Spark program is stored in this storage account. The on-demand Spark cluster uses the same storage account as its primary storage.
Note
HdInsight supports only general-purpose storage accounts with standard tier. Make sure that the account is not a premium or blob only storage account.
- Azure PowerShell. Follow the instructions in How to install and configure Azure PowerShell.
-
Create a Python file named WordCount_Spark.py with the following content:
import sys from operator import add from pyspark.sql import SparkSession def main(): spark = SparkSession\ .builder\ .appName("PythonWordCount")\ .getOrCreate() lines = spark.read.text("wasbs://adftutorial@<storageaccountname>.blob.core.windows.net/spark/inputfiles/minecraftstory.txt").rdd.map(lambda r: r[0]) counts = lines.flatMap(lambda x: x.split(' ')) \ .map(lambda x: (x, 1)) \ .reduceByKey(add) counts.saveAsTextFile("wasbs://adftutorial@<storageaccountname>.blob.core.windows.net/spark/outputfiles/wordcount") spark.stop() if __name__ == "__main__": main()
-
Replace <storageAccountName> with the name of your Azure storage account. Then, save the file.
-
In Azure Blob storage, create a container named adftutorial if it does not exist.
-
Create a folder named spark.
-
Create a subfolder named script under the spark folder.
-
Upload the WordCount_Spark.py file to the script subfolder.
- Create a file named minecraftstory.txt with some text. The Spark program counts the number of words in this text.
- Create a subfolder named inputfiles in the spark folder.
- Upload the minecraftstory.txt file to the inputfiles subfolder.
-
Launch Microsoft Edge or Google Chrome web browser. Currently, Data Factory UI is supported only in Microsoft Edge and Google Chrome web browsers.
-
Select New on the left menu, select Data + Analytics, and then select Data Factory.
:::image type="content" source="./media/tutorial-transform-data-spark-portal/new-azure-data-factory-menu.png" alt-text="Data Factory selection in the "New" pane":::
-
In the New data factory pane, enter ADFTutorialDataFactory under Name.
:::image type="content" source="./media/tutorial-transform-data-spark-portal/new-azure-data-factory.png" alt-text=""New data factory" pane":::
The name of the Azure data factory must be globally unique. If you see the following error, change the name of the data factory. (For example, use <yourname>ADFTutorialDataFactory). For naming rules for Data Factory artifacts, see the Data Factory - naming rules article.
:::image type="content" source="./media/tutorial-transform-data-spark-portal/name-not-available-error.png" alt-text="Error when a name is not available":::
-
For Subscription, select your Azure subscription in which you want to create the data factory.
-
For Resource Group, take one of the following steps:
- Select Use existing, and select an existing resource group from the drop-down list.
- Select Create new, and enter the name of a resource group.
Some of the steps in this quickstart assume that you use the name ADFTutorialResourceGroup for the resource group. To learn about resource groups, see Using resource groups to manage your Azure resources.
-
For Version, select V2.
-
For Location, select the location for the data factory.
For a list of Azure regions in which Data Factory is currently available, select the regions that interest you on the following page, and then expand Analytics to locate Data Factory: Products available by region. The data stores (like Azure Storage and Azure SQL Database) and computes (like HDInsight) that Data Factory uses can be in other regions.
-
Select Create.
-
After the creation is complete, you see the Data factory page. Select the Author & Monitor tile to start the Data Factory UI application on a separate tab.
:::image type="content" source="./media/tutorial-transform-data-spark-portal/data-factory-home-page.png" alt-text="Home page for the data factory, with the "Author & Monitor" tile":::
You author two linked services in this section:
- An Azure Storage linked service that links an Azure storage account to the data factory. This storage is used by the on-demand HDInsight cluster. It also contains the Spark script to be run.
- An on-demand HDInsight linked service. Azure Data Factory automatically creates an HDInsight cluster and runs the Spark program. It then deletes the HDInsight cluster after the cluster is idle for a preconfigured time.
-
On the home page, switch to the Manage tab in the left panel.
:::image type="content" source="media/doc-common-process/get-started-page-manage-button.png" alt-text="Screenshot that shows the Manage tab.":::
-
Select Connections at the bottom of the window, and then select + New.
:::image type="content" source="./media/tutorial-transform-data-spark-portal/new-connection.png" alt-text="Buttons for creating a new connection":::
-
In the New Linked Service window, select Data Store > Azure Blob Storage, and then select Continue.
:::image type="content" source="./media/tutorial-transform-data-spark-portal/select-azure-storage.png" alt-text="Selecting the "Azure Blob Storage" tile":::
-
For Storage account name, select the name from the list, and then select Save.
:::image type="content" source="./media/tutorial-transform-data-spark-portal/new-azure-storage-linked-service.png" alt-text="Box for specifying the storage account name":::
-
Select the + New button again to create another linked service.
-
In the New Linked Service window, select Compute > Azure HDInsight, and then select Continue.
:::image type="content" source="./media/tutorial-transform-data-spark-portal/select-azure-hdinsight.png" alt-text="Selecting the "Azure HDInsight" tile":::
-
In the New Linked Service window, complete the following steps:
a. For Name, enter AzureHDInsightLinkedService.
b. For Type, confirm that On-demand HDInsight is selected.
c. For Azure Storage Linked Service, select AzureBlobStorage1. You created this linked service earlier. If you used a different name, specify the right name here.
d. For Cluster type, select spark.
e. For Service principal id, enter the ID of the service principal that has permission to create an HDInsight cluster.
This service principal needs to be a member of the Contributor role of the subscription or the resource group in which the cluster is created. For more information, see Create an Azure Active Directory application and service principal. The Service principal id is equivalent to the Application ID, and a Service principal key is equivalent to the value for a Client secret.
f. For Service principal key, enter the key.
g. For Resource group, select the same resource group that you used when you created the data factory. The Spark cluster is created in this resource group.
h. Expand OS type.
i. Enter a name for Cluster user name.
j. Enter the Cluster password for the user.
k. Select Finish.
:::image type="content" source="./media/tutorial-transform-data-spark-portal/azure-hdinsight-linked-service-settings.png" alt-text="HDInsight linked service settings":::
Note
Azure HDInsight limits the total number of cores that you can use in each Azure region that it supports. For the on-demand HDInsight linked service, the HDInsight cluster is created in the same Azure Storage location that's used as its primary storage. Ensure that you have enough core quotas for the cluster to be created successfully. For more information, see Set up clusters in HDInsight with Hadoop, Spark, Kafka, and more.
-
Select the + (plus) button, and then select Pipeline on the menu.
:::image type="content" source="./media/tutorial-transform-data-spark-portal/new-pipeline-menu.png" alt-text="Buttons for creating a new pipeline":::
-
In the Activities toolbox, expand HDInsight. Drag the Spark activity from the Activities toolbox to the pipeline designer surface.
:::image type="content" source="./media/tutorial-transform-data-spark-portal/drag-drop-spark-activity.png" alt-text="Dragging the Spark activity":::
-
In the properties for the Spark activity window at the bottom, complete the following steps:
a. Switch to the HDI Cluster tab.
b. Select AzureHDInsightLinkedService (which you created in the previous procedure).
:::image type="content" source="./media/tutorial-transform-data-spark-portal/select-hdinsight-linked-service.png" alt-text="Specifying the HDInsight linked service":::
-
Switch to the Script/Jar tab, and complete the following steps:
a. For Job Linked Service, select AzureBlobStorage1.
b. Select Browse Storage.
:::image type="content" source="./media/tutorial-transform-data-spark-portal/specify-spark-script.png" alt-text="Specifying the Spark script on the "Script/Jar" tab":::
c. Browse to the adftutorial/spark/script folder, select WordCount_Spark.py, and then select Finish.
-
To validate the pipeline, select the Validate button on the toolbar. Select the >> (right arrow) button to close the validation window.
:::image type="content" source="./media/tutorial-transform-data-spark-portal/validate-button.png" alt-text=""Validate" button":::
-
Select Publish All. The Data Factory UI publishes entities (linked services and pipeline) to the Azure Data Factory service.
:::image type="content" source="./media/tutorial-transform-data-spark-portal/publish-button.png" alt-text=""Publish All" button":::
Select Add Trigger on the toolbar, and then select Trigger Now.
:::image type="content" source="./media/tutorial-transform-data-spark-portal/trigger-now-menu.png" alt-text=""Trigger" and "Trigger Now" buttons":::
-
Switch to the Monitor tab. Confirm that you see a pipeline run. It takes approximately 20 minutes to create a Spark cluster.
-
Select Refresh periodically to check the status of the pipeline run.
:::image type="content" source="./media/tutorial-transform-data-spark-portal/monitor-tab.png" alt-text="Tab for monitoring pipeline runs, with "Refresh" button":::
-
To see activity runs associated with the pipeline run, select View Activity Runs in the Actions column.
:::image type="content" source="./media/tutorial-transform-data-spark-portal/pipeline-run-succeeded.png" alt-text="Pipeline run status":::
You can switch back to the pipeline runs view by selecting the All Pipeline Runs link at the top.
:::image type="content" source="./media/tutorial-transform-data-spark-portal/activity-runs.png" alt-text=""Activity Runs" view":::
Verify that the output file is created in the spark/otuputfiles/wordcount folder of the adftutorial container.
:::image type="content" source="./media/tutorial-transform-data-spark-portal/verity-output.png" alt-text="Location of the output file":::
The file should have each word from the input text file and the number of times the word appeared in the file. For example:
(u'This', 1)
(u'a', 1)
(u'is', 1)
(u'test', 1)
(u'file', 1)
The pipeline in this sample transforms data by using a Spark activity and an on-demand HDInsight linked service. You learned how to:
[!div class="checklist"]
- Create a data factory.
- Create a pipeline that uses a Spark activity.
- Trigger a pipeline run.
- Monitor the pipeline run.
To learn how to transform data by running a Hive script on an Azure HDInsight cluster that's in a virtual network, advance to the next tutorial:
[!div class="nextstepaction"] Tutorial: Transform data using Hive in Azure Virtual Network.