title	description	author	ms.author	ms.reviewer	ms.service	ms.subservice	ms.topic	ms.date
Quickstart: Get started analyzing with Spark	In this tutorial, you'll learn to analyze data with Apache Spark.	saveenr	saveenr	sngun	synapse-analytics	spark	tutorial	03/24/2021

Analyze with Apache Spark

In this tutorial, you'll learn the basic steps to load and analyze data with Apache Spark for Azure Synapse.

Create a serverless Apache Spark pool

In Synapse Studio, on the left-side pane, select Manage > Apache Spark pools.
Select New
For Apache Spark pool name enter Spark1.
For Node size enter Small.
For Number of nodes Set the minimum to 3 and the maximum to 3
Select Review + create > Create. Your Apache Spark pool will be ready in a few seconds.

Understanding serverless Apache Spark pools

A serverless Spark pool is a way of indicating how a user wants to work with Spark. When you start using a pool, a Spark session is created if needed. The pool controls how many Spark resources will be used by that session and how long the session will last before it automatically pauses. You pay for spark resources used during that session not for the pool itself. In this way a Spark pool lets you work with Spark, without having to worry managing clusters. This is similar to how a serverless SQL pool works.

Analyze NYC Taxi data with a Spark pool

Note

Make sure you have placed the sample data in the primary storage account.

In Synapse Studio, go to the Develop hub.
Create a new notebook.

Create a new code cell and paste the following code in that cell:

%%pyspark
df = spark.read.load('abfss://users@contosolake.dfs.core.windows.net/NYCTripSmall.parquet', format='parquet')
display(df.limit(10))

Modify the load URI, so it references the sample file in your storage account according to the abfss URI scheme.
In the notebook, in the Attach to menu, choose the Spark1 serverless Spark pool that we created earlier.
Select Run on the cell. Synapse will start a new Spark session to run this cell if needed. If a new Spark session is needed, initially it will take about two seconds to be created.
If you just want to see the schema of the dataframe run a cell with the following code:
```
%%pyspark
df.printSchema()
```

Load the NYC Taxi data into the Spark nyctaxi database

Data is available via the dataframe named df. Load it into a Spark database named nyctaxi.

Add a new code cell to the notebook, and then enter the following code:

%%pyspark
spark.sql("CREATE DATABASE IF NOT EXISTS nyctaxi")
df.write.mode("overwrite").saveAsTable("nyctaxi.trip")

Analyze the NYC Taxi data using Spark and notebooks

Create a new code cell and enter the following code.

%%pyspark
df = spark.sql("SELECT * FROM nyctaxi.trip") 
display(df)

Run the cell to show the NYC Taxi data we loaded into the nyctaxi Spark database.

Create a new code cell and enter the following code. We will analyze this data and save the results into a table called nyctaxi.passengercountstats.

%%pyspark
df = spark.sql("""
   SELECT PassengerCount,
       SUM(TripDistanceMiles) as SumTripDistance,
       AVG(TripDistanceMiles) as AvgTripDistance
   FROM nyctaxi.trip
   WHERE TripDistanceMiles > 0 AND PassengerCount > 0
   GROUP BY PassengerCount
   ORDER BY PassengerCount
""") 
display(df)
df.write.saveAsTable("nyctaxi.passengercountstats")

In the cell results, select Chart to see the data visualized.

Next steps

[!div class="nextstepaction"] Analyze data with dedicated SQL pool

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Files

get-started-analyze-spark.md

get-started-analyze-spark.md

Analyze with Apache Spark

Create a serverless Apache Spark pool

Understanding serverless Apache Spark pools

Analyze NYC Taxi data with a Spark pool

Load the NYC Taxi data into the Spark nyctaxi database

Analyze the NYC Taxi data using Spark and notebooks

Next steps

Collapse file tree

Files

get-started-analyze-spark.md

Latest commit

History

get-started-analyze-spark.md

File metadata and controls

Analyze with Apache Spark

Create a serverless Apache Spark pool

Understanding serverless Apache Spark pools

Analyze NYC Taxi data with a Spark pool

Load the NYC Taxi data into the Spark nyctaxi database

Analyze the NYC Taxi data using Spark and notebooks

Next steps