You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/machine-learning/how-to-cicd-data-ingestion.md
+13-13Lines changed: 13 additions & 13 deletions
Original file line number
Diff line number
Diff line change
@@ -1,5 +1,5 @@
1
1
---
2
-
title: DevOps for a Data Ingestion pipeline
2
+
title: DevOps for a data ingestion pipeline
3
3
titleSuffix: Azure Machine Learning
4
4
description: Learn how to apply DevOps practices to a data ingestion pipeline implementation used to prepare data for a model training.
5
5
services: machine-learning
@@ -16,9 +16,9 @@ ms.date: 01/30/2020
16
16
17
17
---
18
18
19
-
# DevOps for a Data Ingestion pipeline
19
+
# DevOps for a data ingestion pipeline
20
20
21
-
In most scenarios, a Data Ingestion solution is a composition of scripts, service invocations, and a pipeline orchestrating all the activities. In this article, you learn how to apply DevOps practices to the development lifecycle of a common data ingestion pipeline. The pipeline prepares the data for the Machine Learning model training.
21
+
In most scenarios, a data ingestion solution is a composition of scripts, service invocations, and a pipeline orchestrating all the activities. In this article, you learn how to apply DevOps practices to the development lifecycle of a common data ingestion pipeline. The pipeline prepares the data for the Machine Learning model training.
22
22
23
23
## The solution
24
24
@@ -34,9 +34,9 @@ As with any software solution, there is a team (for example, Data Engineers) wor
They collaborate and share the same Azure resources such as Azure Data Factory, Azure Databricks, Azure Storage account and such. The collection of these resources is a Development environment. The data engineers contribute to the same source code base. The Continuous Integration process assembles the code, checks it with the code quality tests, unit tests and produces artifacts such as tested code and ARM templates. The Continuous Delivery process deploys the artifacts to the downstream environments. This article demonstrates how to automate the CI and CD processes with [Azure Pipelines](https://azure.microsoft.com/services/devops/pipelines/).
37
+
They collaborate and share the same Azure resources such as Azure Data Factory, Azure Databricks, Azure Storage account and such. The collection of these resources is a Development environment. The data engineers contribute to the same source code base. The Continuous Integration process assembles the code, checks it with the code quality tests, unit tests and produces artifacts such as tested code and Azure Resource Manager templates. The Continuous Delivery process deploys the artifacts to the downstream environments. This article demonstrates how to automate the CI and CD processes with [Azure Pipelines](https://azure.microsoft.com/services/devops/pipelines/).
38
38
39
-
## Source Control Management
39
+
## Source control management
40
40
41
41
The team members work in slightly different ways to collaborate on the Python notebook source code and the Azure Data Factory source code. However, in both cases the code is stored in a source control repository (for example, Azure DevOps, GitHub, GitLab) and the collaboration is normally based on some branching model (for example, [GitFlow](https://datasift.github.io/gitflow/IntroducingGitFlow.html)).
42
42
@@ -49,7 +49,7 @@ It's highly recommended to store the code in `.py` files rather than in `.ipynb`
49
49
50
50
The source code of Azure Data Factory pipelines is a collection of json files generated by a workspace. Normally the data engineers work with a visual designer in the Azure Data Factory workspace rather than with the source code files directly. Configure the workspace with a source control repository as it is described in the [Azure Data Factory documentation](https://docs.microsoft.com/azure/data-factory/source-control#author-with-azure-repos-git-integration). With this configuration in place, the data engineers are able to collaborate on the source code following a preferred branching workflow.
51
51
52
-
## Continuous Integration (CI)
52
+
## Continuous integration (CI)
53
53
54
54
The ultimate goal of the Continuous Integration process is to gather the joint team work from the source code and prepare it for the deployment to the downstream environments. As with the source code management this process is different for the Python notebooks and Azure Data Factory pipelines.
55
55
@@ -94,10 +94,10 @@ If the linting and unit testing is successful, the pipeline will copy the source
94
94
95
95
### Azure Data Factory CI
96
96
97
-
CI process for an Azure Data Factory pipeline is a bottleneck in the whole CI/CD story for a data ingestion pipeline. There's no ***Continuous*** Integration. A deployable artifact for Azure Data Factory is a collection of ARM templates. The only way to produce those templates is to click the ***publish*** button in the Azure Data Factory workspace. There's no automation here.
98
-
The data engineers merge the source code from their feature branches into the collaboration branch, for example, ***master*** or ***develop***. Then, someone with the granted permissions clicks the ***publish*** button to generate ARM templates from the source code in the collaboration branch. When the button is clicked, the workspace validates the pipelines (think of it as of linting and unit testing), generates ARM templates (think of it as of building) and saves the generated templates to a technical branch ***adf_publish*** in the same code repository (think of it as of publishing artifacts). This branch is created automatically by the Azure Data Factory workspace. This process is described in details in the [Azure Data Factory documentation](https://docs.microsoft.com/azure/data-factory/continuous-integration-deployment).
97
+
CI process for an Azure Data Factory pipeline is a bottleneck in the whole CI/CD story for a data ingestion pipeline. There's no ***Continuous*** Integration. A deployable artifact for Azure Data Factory is a collection of Azure Resource Manager templates. The only way to produce those templates is to click the ***publish*** button in the Azure Data Factory workspace. There's no automation here.
98
+
The data engineers merge the source code from their feature branches into the collaboration branch, for example, ***master*** or ***develop***. Then, someone with the granted permissions clicks the ***publish*** button to generate Azure Resource Manager templates from the source code in the collaboration branch. When the button is clicked, the workspace validates the pipelines (think of it as of linting and unit testing), generates Azure Resource Manager templates (think of it as of building) and saves the generated templates to a technical branch ***adf_publish*** in the same code repository (think of it as of publishing artifacts). This branch is created automatically by the Azure Data Factory workspace. This process is described in details in the [Azure Data Factory documentation](https://docs.microsoft.com/azure/data-factory/continuous-integration-deployment).
99
99
100
-
It's important to make sure that the generated ARM templates are environment agnostic. This means that all values that may differ from between environments are parametrized. The Azure Data Factory is smart enough to expose the majority of such values as parameters. For example, in the following template the connection properties to an Azure Machine Learning workspace are exposed as parameters:
100
+
It's important to make sure that the generated Azure Resource Manager templates are environment agnostic. This means that all values that may differ from between environments are parametrized. The Azure Data Factory is smart enough to expose the majority of such values as parameters. For example, in the following template the connection properties to an Azure Machine Learning workspace are exposed as parameters:
101
101
102
102
```json
103
103
{
@@ -147,7 +147,7 @@ The pipeline activities may refer to the pipeline variables while actually using
The Azure Data Factory workspace ***doesn't*** expose pipeline variables as ARM templates parameters by default. The workspace uses the [Default Parameterization Template](https://docs.microsoft.com/azure/data-factory/continuous-integration-deployment#default-parameterization-template) dictating what pipeline properties should be exposed as ARM template parameters. In order to add pipeline variables to the list, update the "Microsoft.DataFactory/factories/pipelines" section of the [Default Parameterization Template](https://docs.microsoft.com/azure/data-factory/continuous-integration-deployment#default-parameterization-template) with the following snippet and place the result json file in the root of the source folder:
150
+
The Azure Data Factory workspace ***doesn't*** expose pipeline variables as Azure Resource Manager templates parameters by default. The workspace uses the [Default Parameterization Template](https://docs.microsoft.com/azure/data-factory/continuous-integration-deployment#default-parameterization-template) dictating what pipeline properties should be exposed as Azure Resource Manager template parameters. In order to add pipeline variables to the list, update the "Microsoft.DataFactory/factories/pipelines" section of the [Default Parameterization Template](https://docs.microsoft.com/azure/data-factory/continuous-integration-deployment#default-parameterization-template) with the following snippet and place the result json file in the root of the source folder:
151
151
152
152
```json
153
153
"Microsoft.DataFactory/factories/pipelines": {
@@ -179,9 +179,9 @@ Doing so will force the Azure Data Factory workspace to add the variables to the
179
179
}
180
180
```
181
181
182
-
The values in the json file are default values configured in the pipeline definition. They're expected to be overridden with the target environment values when the ARM template is deployed.
182
+
The values in the json file are default values configured in the pipeline definition. They're expected to be overridden with the target environment values when the Azure Resource Manager template is deployed.
183
183
184
-
## Continuous Delivery (CD)
184
+
## Continuous delivery (CD)
185
185
186
186
The Continuous Delivery process takes the artifacts and deploys them to the first target environment. It makes sure that the solution works by running tests. If successful, it continues to the next environment. The CD Azure Pipeline consists of multiple stages representing the environments. Each stage contains [deployments](https://docs.microsoft.com/azure/devops/pipelines/process/deployment-jobs?view=azure-devops) and [jobs](https://docs.microsoft.com/azure/devops/pipelines/process/phases?view=azure-devops&tabs=yaml) that perform the following steps:
187
187
* Deploy a Python Notebook to Azure Databricks workspace
@@ -234,7 +234,7 @@ The ***Deploy_to_QA*** stage contains a reference to ***devops-ds-qa-vg*** varia
234
234
235
235
### Deploy an Azure Data Factory pipeline
236
236
237
-
A deployable artifact for Azure Data Factory is an ARM template. Therefore, it's going to be deployed with the ***Azure Resource Group Deployment*** task as it is demonstrated in the following snippet:
237
+
A deployable artifact for Azure Data Factory is an Azure Resource Manager template. Therefore, it's going to be deployed with the ***Azure Resource Group Deployment*** task as it is demonstrated in the following snippet:
0 commit comments