title | description | author | ms.service | ms.subservice | ms.topic | ms.date | ms.author |
---|---|---|---|---|---|---|---|
Delta format in Azure Data Factory |
Transform and move data from a delta lake using the delta format |
kromerm |
data-factory |
data-flows |
conceptual |
01/26/2022 |
makromer |
[!INCLUDEappliesto-adf-asa-md]
This article highlights how to copy data to and from a delta lake stored in Azure Data Lake Store Gen2 or Azure Blob Storage using the delta format. This connector is available as an inline dataset in mapping data flows as both a source and a sink.
[!VIDEO https://www.microsoft.com/en-us/videoplayer/embed/RE4ALTs]
This connector is available as an inline dataset in mapping data flows as both a source and a sink.
The below table lists the properties supported by a delta source. You can edit these properties in the Source options tab.
Name | Description | Required | Allowed values | Data flow script property |
---|---|---|---|---|
Format | Format must be delta |
yes | delta |
format |
File system | The container/file system of the delta lake | yes | String | fileSystem |
Folder path | The direct of the delta lake | yes | String | folderPath |
Compression type | The compression type of the delta table | no | bzip2 gzip deflate ZipDeflate snappy lz4 |
compressionType |
Compression level | Choose whether the compression completes as quickly as possible or if the resulting file should be optimally compressed. | required if compressedType is specified. |
Optimal or Fastest |
compressionLevel |
Time travel | Choose whether to query an older snapshot of a delta table | no | Query by timestamp: Timestamp Query by version: Integer |
timestampAsOf versionAsOf |
Allow no files found | If true, an error is not thrown if no files are found | no | true or false |
ignoreNoFilesFound |
Delta is only available as an inline dataset and, by default, doesn't have an associated schema. To get column metadata, click the Import schema button in the Projection tab. This will allow you to reference the column names and data types specified by the corpus. To import the schema, a data flow debug session must be active and you must have an existing CDM entity definition file to point to.
source(output(movieId as integer,
title as string,
releaseDate as date,
rated as boolean,
screenedOn as timestamp,
ticketPrice as decimal(10,2)
),
store: 'local',
format: 'delta',
versionAsOf: 0,
allowSchemaDrift: false,
folderPath: $tempPath + '/delta'
) ~> movies
The below table lists the properties supported by a delta sink. You can edit these properties in the Settings tab.
Name | Description | Required | Allowed values | Data flow script property |
---|---|---|---|---|
Format | Format must be delta |
yes | delta |
format |
File system | The container/file system of the delta lake | yes | String | fileSystem |
Folder path | The direct of the delta lake | yes | String | folderPath |
Compression type | The compression type of the delta table | no | bzip2 gzip deflate ZipDeflate snappy lz4 |
compressionType |
Compression level | Choose whether the compression completes as quickly as possible or if the resulting file should be optimally compressed. | required if compressedType is specified. |
Optimal or Fastest |
compressionLevel |
Vacuum | Specify retention threshold in hours for older versions of table. A value of 0 or less defaults to 30 days | yes | Integer | vacuum |
Update method | Specify which update operations are allowed on the delta lake. For methods that aren't insert, a preceding alter row transformation is required to mark rows. | yes | true or false |
deletable insertable updateable merge |
Optimized Write | Achieve higher throughput for write operation via optimizing internal shuffle in Spark executors. As a result, you may notice fewer partitions and files that are of a larger size | no | true or false |
optimizedWrite: true |
Auto Compact | After any write operation has completed, Spark will automatically execute the OPTIMIZE command to re-organize the data, resulting in more partitions if necessary, for better reading performance in the future |
no | true or false |
autoCompact: true |
The associated data flow script is:
moviesAltered sink(
input(movieId as integer,
title as string
),
mapColumn(
movieId,
title
),
insertable: true,
updateable: true,
deletable: true,
upsertable: false,
keys: ['movieId'],
store: 'local',
format: 'delta',
vacuum: 180,
folderPath: $tempPath + '/delta'
) ~> movieDB
When writing to a delta sink, there is a known limitation where the numbers of rows written won't be return in the monitoring output.
- Create a source transformation in mapping data flow.
- Create a sink transformation in mapping data flow.
- Create an alter row transformation to mark rows as insert, update, upsert, or delete.