Skip to content

Files

65 lines (42 loc) · 4.58 KB

concepts-data-flow-performance-sources.md

File metadata and controls

65 lines (42 loc) · 4.58 KB
title titleSuffix description author ms.topic ms.author ms.service ms.subservice ms.custom ms.date
Optimizing source performance in mapping data flow
Azure Data Factory & Azure Synapse
Learn about optimizing source performance in mapping data flows in Azure Data Factory and Azure Synapse Analytics pipelines.
kromerm
conceptual
makromer
data-factory
data-flows
synapse
09/29/2021

Optimizing sources

For every source except Azure SQL Database, it is recommended that you keep Use current partitioning as the selected value. When reading from all other source systems, data flows automatically partitions data evenly based upon the size of the data. A new partition is created for about every 128 MB of data. As your data size increases, the number of partitions increase.

Any custom partitioning happens after Spark reads in the data and will negatively impact your data flow performance. As the data is evenly partitioned on read, this is not recommended.

Note

Read speeds can be limited by the throughput of your source system.

Azure SQL Database sources

Azure SQL Database has a unique partitioning option called 'Source' partitioning. Enabling source partitioning can improve your read times from Azure SQL DB by enabling parallel connections on the source system. Specify the number of partitions and how to partition your data. Use a partition column with high cardinality. You can also enter a query that matches the partitioning scheme of your source table.

Tip

For source partitioning, the I/O of the SQL Server is the bottleneck. Adding too many partitions may saturate your source database. Generally four or five partitions is ideal when using this option.

:::image type="content" source="media/data-flow/sourcepart3.png" alt-text="Source partitioning":::

Isolation level

The isolation level of the read on an Azure SQL source system has an impact on performance. Choosing 'Read uncommitted' will provide the fastest performance and prevent any database locks. To learn more about SQL Isolation levels, please see Understanding isolation levels.

Read using query

You can read from Azure SQL Database using a table or a SQL query. If you are executing a SQL query, the query must complete before transformation can start. SQL Queries can be useful to push down operations that may execute faster and reduce the amount of data read from a SQL Server such as SELECT, WHERE, and JOIN statements. When pushing down operations, you lose the ability to track lineage and performance of the transformations before the data comes into the data flow.

Azure Synapse Analytics sources

When using Azure Synapse Analytics, a setting called Enable staging exists in the source options. This allows the service to read from Synapse using Staging, which greatly improves read performance by using the Synapse COPY statement command for the most performant bulk loading capability. Enabling Staging requires you to specify an Azure Blob Storage or Azure Data Lake Storage gen2 staging location in the data flow activity settings.

:::image type="content" source="media/data-flow/enable-staging.png" alt-text="Enable staging":::

File-based sources

While data flows support a variety of file types, the Spark-native Parquet format is recommended for optimal read and write times.

If you're running the same data flow on a set of files, we recommend reading from a folder, using wildcard paths or reading from a list of files. A single data flow activity run can process all of your files in batch. More information on how to configure these settings can be found in the Source transformation section of the Azure Blob Storage connector documentation.

If possible, avoid using the For-Each activity to run data flows over a set of files. This will cause each iteration of the for-each to spin up its own Spark cluster, which is often not necessary and can be expensive.

Next steps

See other Data Flow articles related to performance: