title | description | ms.service | ms.subservice | ms.topic | ms.reviewer | ms.date | author | ms.author | ms.custom |
---|---|---|---|---|---|---|---|---|---|
Quickstart: Read data from ADLS Gen2 to Pandas dataframe |
Read data from an Azure Data Lake Storage Gen2 account into a Pandas dataframe using Python in Synapse Studio in Azure Synapse Analytics. |
synapse-analytics |
machine-learning |
quickstart |
sngun, garye, negust |
03/23/2021 |
WilliamDAssafMSFT |
wiassaf |
mode-other |
In this quickstart, you'll learn how to easily use Python to read data from an Azure Data Lake Storage (ADLS) Gen2 into a Pandas dataframe in Azure Synapse Analytics.
From a Synapse Studio notebook, you'll:
- connect to a container in Data Lake Storage Gen2 that is linked to your Azure Synapse Analytics workspace
- read the data from a PySpark Notebook using
spark.read.load
- convert the data to a Pandas dataframe using
.toPandas()
- Azure subscription - Create one for free.
- Synapse Analytics workspace with Data Lake Storage Gen2 configured as the default storage - You need to be the Storage Blob Data Contributor of the Data Lake Storage Gen2 filesystem that you work with. For details on how to create a workspace, see Creating a Synapse workspace.
- Apache Spark pool in your workspace - See Create a serverless Apache Spark pool.
Sign in to the Azure portal.
-
In the Azure portal, create a container in the same Data Lake Storage Gen2 used by Synapse Studio. You can skip this step if you want to use the default linked storage account in your Azure Synapse Analytics workspace.
-
In Synapse Studio, click Data, select the Linked tab, and select the container under Azure Data Lake Storage Gen2.
-
Download the sample file RetailSales.csv and upload it to the container.
-
Select the uploaded file, click Properties, and copy the ABFSS Path value.
-
In the left pane, click Develop.
-
Click + and select "Notebook" to create a new notebook.
-
In Attach to, select your Apache Spark Pool. If you don't have one, click Create Apache Spark pool.
-
In the notebook code cell, paste the following Python code, inserting the ABFSS path you copied earlier:
%%pyspark data_path = spark.read.load('<ABFSS Path to RetailSales.csv>', format='csv', header=True) data_path.show(10) print('Converting to Pandas.') pdf = data_path.toPandas() print(pdf)
-
Run the cell.
After a few minutes the text displayed should look similar to the following.
Command executed in 25s 324ms by gary on 03-23-2021 17:40:23.481 -07:00
Job execution Succeeded Spark 2 executors 8 cores
+-------+-----------+--------+-----------+-----------+-----+------------+--------------------+
|storeId|productCode|quantity|logQuantity|advertising|price|weekStarting| id|
+-------+-----------+--------+-----------+-----------+-----+------------+--------------------+
| 2| surface.go| 105|9.264828557| 1| 159| 6/15/2017|d6bd47a7-2ad6-4f0...|
| 2| surface.go| 80|8.987196821| 0| 269| 7/27/2017|64cc74c2-c7da-4e1...|
| 2| surface.go| 68|8.831711918| 1| 209| 8/3/2017|9a2d164b-5e44-44d...|
| 2| surface.go| 28|7.965545573| 0| 209| 8/10/2017|b8cd9987-1d5a-4f4...|
| 2| surface.go| 16|7.377758908| 0| 209| 8/24/2017|ac0ec099-e102-4bf...|
| 2| surface.go| 253| 10.1402973| 1| 189| 8/31/2017|3d22c002-b04c-409...|
| 2| surface.go| 107|9.282847063| 0| 189| 9/7/2017|b6e19699-d684-449...|
| 2| surface.go| 66|8.803273983| 0| 189| 9/14/2017|e89a5838-fb8f-413...|
| 2| surface.go| 65|8.793612072| 0| 179| 9/21/2017|c3278682-16c0-483...|
| 2| surface.go| 17|7.454719949| 0| 269| 10/12/2017|f40190c1-b2ed-46f...|
+-------+-----------+--------+-----------+-----------+-----+------------+--------------------+
only showing top 10 rows
Converting to Pandas.
storeId ... id
0 2 ... d6bd47a7-2ad6-4f0a-b8de-ed1386cae5ea
1 2 ... 64cc74c2-c7da-4e12-af64-c95bdf429934
2 2 ... 9a2d164b-5e44-44d7-9837-cf9ae6566c99
3 2 ... b8cd9987-1d5a-4f4f-9346-719d73b1f7f0
4 2 ... ac0ec099-e102-4bfc-9775-983b151dcd03
... ... ... ...
28942 137 ... 6af00133-7015-415d-831b-ddf05bb5828c
28943 137 ... 1e0d3a21-ab43-49c4-89e2-49d202821807
28944 137 ... 5cc7e50a-6aa4-419b-a933-905a667aa2df
28945 137 ... 650ca506-7a4f-46f8-b2e1-e52ceffadf16
28946 137 ... 9bb216f6-04ec-4b61-9e68-34772b814c44
[28947 rows x 8 columns]