title	description	ms.service	ms.subservice	ms.topic	ms.reviewer	ms.date	author	ms.author	ms.custom
Quickstart: Read data from ADLS Gen2 to Pandas dataframe	Read data from an Azure Data Lake Storage Gen2 account into a Pandas dataframe using Python in Synapse Studio in Azure Synapse Analytics.	synapse-analytics	machine-learning	quickstart	sngun, garye, negust	03/23/2021	WilliamDAssafMSFT	wiassaf	mode-other

Quickstart: Read data from ADLS Gen2 to Pandas dataframe in Azure Synapse Analytics

In this quickstart, you'll learn how to easily use Python to read data from an Azure Data Lake Storage (ADLS) Gen2 into a Pandas dataframe in Azure Synapse Analytics.

From a Synapse Studio notebook, you'll:

connect to a container in Data Lake Storage Gen2 that is linked to your Azure Synapse Analytics workspace
read the data from a PySpark Notebook using spark.read.load
convert the data to a Pandas dataframe using .toPandas()

Prerequisites

Azure subscription - Create one for free.
Synapse Analytics workspace with Data Lake Storage Gen2 configured as the default storage - You need to be the Storage Blob Data Contributor of the Data Lake Storage Gen2 filesystem that you work with. For details on how to create a workspace, see Creating a Synapse workspace.
Apache Spark pool in your workspace - See Create a serverless Apache Spark pool.

Sign in to the Azure portal

Sign in to the Azure portal.

Upload sample data to ADLS Gen2

In the Azure portal, create a container in the same Data Lake Storage Gen2 used by Synapse Studio. You can skip this step if you want to use the default linked storage account in your Azure Synapse Analytics workspace.
In Synapse Studio, click Data, select the Linked tab, and select the container under Azure Data Lake Storage Gen2.
Download the sample file RetailSales.csv and upload it to the container.
Select the uploaded file, click Properties, and copy the ABFSS Path value.

Read data from ADLS Gen2 into a Pandas dataframe

In the left pane, click Develop.
Click + and select "Notebook" to create a new notebook.
In Attach to, select your Apache Spark Pool. If you don't have one, click Create Apache Spark pool.

In the notebook code cell, paste the following Python code, inserting the ABFSS path you copied earlier:

%%pyspark
data_path = spark.read.load('<ABFSS Path to RetailSales.csv>', format='csv', header=True)
data_path.show(10)

print('Converting to Pandas.')

pdf = data_path.toPandas()
print(pdf)

Run the cell.

After a few minutes the text displayed should look similar to the following.

Command executed in 25s 324ms by gary on 03-23-2021 17:40:23.481 -07:00

Job execution Succeeded Spark 2 executors 8 cores

+-------+-----------+--------+-----------+-----------+-----+------------+--------------------+
|storeId|productCode|quantity|logQuantity|advertising|price|weekStarting|                  id|
+-------+-----------+--------+-----------+-----------+-----+------------+--------------------+
|      2| surface.go|     105|9.264828557|          1|  159|   6/15/2017|d6bd47a7-2ad6-4f0...|
|      2| surface.go|      80|8.987196821|          0|  269|   7/27/2017|64cc74c2-c7da-4e1...|
|      2| surface.go|      68|8.831711918|          1|  209|    8/3/2017|9a2d164b-5e44-44d...|
|      2| surface.go|      28|7.965545573|          0|  209|   8/10/2017|b8cd9987-1d5a-4f4...|
|      2| surface.go|      16|7.377758908|          0|  209|   8/24/2017|ac0ec099-e102-4bf...|
|      2| surface.go|     253| 10.1402973|          1|  189|   8/31/2017|3d22c002-b04c-409...|
|      2| surface.go|     107|9.282847063|          0|  189|    9/7/2017|b6e19699-d684-449...|
|      2| surface.go|      66|8.803273983|          0|  189|   9/14/2017|e89a5838-fb8f-413...|
|      2| surface.go|      65|8.793612072|          0|  179|   9/21/2017|c3278682-16c0-483...|
|      2| surface.go|      17|7.454719949|          0|  269|  10/12/2017|f40190c1-b2ed-46f...|
+-------+-----------+--------+-----------+-----------+-----+------------+--------------------+
only showing top 10 rows

Converting to Pandas.
      storeId  ...                                    id
0           2  ...  d6bd47a7-2ad6-4f0a-b8de-ed1386cae5ea
1           2  ...  64cc74c2-c7da-4e12-af64-c95bdf429934
2           2  ...  9a2d164b-5e44-44d7-9837-cf9ae6566c99
3           2  ...  b8cd9987-1d5a-4f4f-9346-719d73b1f7f0
4           2  ...  ac0ec099-e102-4bfc-9775-983b151dcd03
...       ...  ...                                   ...
28942     137  ...  6af00133-7015-415d-831b-ddf05bb5828c
28943     137  ...  1e0d3a21-ab43-49c4-89e2-49d202821807
28944     137  ...  5cc7e50a-6aa4-419b-a933-905a667aa2df
28945     137  ...  650ca506-7a4f-46f8-b2e1-e52ceffadf16
28946     137  ...  9bb216f6-04ec-4b61-9e68-34772b814c44

[28947 rows x 8 columns]

Next steps

What is Azure Synapse Analytics?
Get Started with Azure Synapse Analytics
Create a serverless Apache Spark pool

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Files

quickstart-read-from-gen2-to-pandas-dataframe.md

quickstart-read-from-gen2-to-pandas-dataframe.md

Quickstart: Read data from ADLS Gen2 to Pandas dataframe in Azure Synapse Analytics

Prerequisites

Sign in to the Azure portal

Upload sample data to ADLS Gen2

Read data from ADLS Gen2 into a Pandas dataframe

Next steps

Files

quickstart-read-from-gen2-to-pandas-dataframe.md

Latest commit

History

quickstart-read-from-gen2-to-pandas-dataframe.md

File metadata and controls

Quickstart: Read data from ADLS Gen2 to Pandas dataframe in Azure Synapse Analytics

Prerequisites

Sign in to the Azure portal

Upload sample data to ADLS Gen2

Read data from ADLS Gen2 into a Pandas dataframe

Next steps