Skip to content
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.

Commit c7611b7

Browse files
Eugene FedorenkoEugene Fedorenko
authored andcommittedJan 31, 2020
Review comments
1 parent 907aeed commit c7611b7

File tree

7 files changed

+16
-13
lines changed

7 files changed

+16
-13
lines changed
 

‎articles/machine-learning/how-to-cicd-data-ingestion.md

Lines changed: 13 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
---
2-
title: DevOps for a Data Ingestion pipeline
2+
title: DevOps for a data ingestion pipeline
33
titleSuffix: Azure Machine Learning
44
description: Learn how to apply DevOps practices to a data ingestion pipeline implementation used to prepare data for a model training.
55
services: machine-learning
@@ -16,9 +16,9 @@ ms.date: 01/30/2020
1616

1717
---
1818

19-
# DevOps for a Data Ingestion pipeline
19+
# DevOps for a data ingestion pipeline
2020

21-
In most scenarios, a Data Ingestion solution is a composition of scripts, service invocations, and a pipeline orchestrating all the activities. In this article, you learn how to apply DevOps practices to the development lifecycle of a common data ingestion pipeline. The pipeline prepares the data for the Machine Learning model training.
21+
In most scenarios, a data ingestion solution is a composition of scripts, service invocations, and a pipeline orchestrating all the activities. In this article, you learn how to apply DevOps practices to the development lifecycle of a common data ingestion pipeline. The pipeline prepares the data for the Machine Learning model training.
2222

2323
## The solution
2424

@@ -34,9 +34,9 @@ As with any software solution, there is a team (for example, Data Engineers) wor
3434

3535
![cicd-data-ingestion](media/how-to-cicd-data-ingestion/cicd-data-ingestion.png)
3636

37-
They collaborate and share the same Azure resources such as Azure Data Factory, Azure Databricks, Azure Storage account and such. The collection of these resources is a Development environment. The data engineers contribute to the same source code base. The Continuous Integration process assembles the code, checks it with the code quality tests, unit tests and produces artifacts such as tested code and ARM templates. The Continuous Delivery process deploys the artifacts to the downstream environments. This article demonstrates how to automate the CI and CD processes with [Azure Pipelines](https://azure.microsoft.com/services/devops/pipelines/).
37+
They collaborate and share the same Azure resources such as Azure Data Factory, Azure Databricks, Azure Storage account and such. The collection of these resources is a Development environment. The data engineers contribute to the same source code base. The Continuous Integration process assembles the code, checks it with the code quality tests, unit tests and produces artifacts such as tested code and Azure Resource Manager templates. The Continuous Delivery process deploys the artifacts to the downstream environments. This article demonstrates how to automate the CI and CD processes with [Azure Pipelines](https://azure.microsoft.com/services/devops/pipelines/).
3838

39-
## Source Control Management
39+
## Source control management
4040

4141
The team members work in slightly different ways to collaborate on the Python notebook source code and the Azure Data Factory source code. However, in both cases the code is stored in a source control repository (for example, Azure DevOps, GitHub, GitLab) and the collaboration is normally based on some branching model (for example, [GitFlow](https://datasift.github.io/gitflow/IntroducingGitFlow.html)).
4242

@@ -49,7 +49,7 @@ It's highly recommended to store the code in `.py` files rather than in `.ipynb`
4949

5050
The source code of Azure Data Factory pipelines is a collection of json files generated by a workspace. Normally the data engineers work with a visual designer in the Azure Data Factory workspace rather than with the source code files directly. Configure the workspace with a source control repository as it is described in the [Azure Data Factory documentation](https://docs.microsoft.com/azure/data-factory/source-control#author-with-azure-repos-git-integration). With this configuration in place, the data engineers are able to collaborate on the source code following a preferred branching workflow.
5151

52-
## Continuous Integration (CI)
52+
## Continuous integration (CI)
5353

5454
The ultimate goal of the Continuous Integration process is to gather the joint team work from the source code and prepare it for the deployment to the downstream environments. As with the source code management this process is different for the Python notebooks and Azure Data Factory pipelines.
5555

@@ -94,10 +94,10 @@ If the linting and unit testing is successful, the pipeline will copy the source
9494

9595
### Azure Data Factory CI
9696

97-
CI process for an Azure Data Factory pipeline is a bottleneck in the whole CI/CD story for a data ingestion pipeline. There's no ***Continuous*** Integration. A deployable artifact for Azure Data Factory is a collection of ARM templates. The only way to produce those templates is to click the ***publish*** button in the Azure Data Factory workspace. There's no automation here.
98-
The data engineers merge the source code from their feature branches into the collaboration branch, for example, ***master*** or ***develop***. Then, someone with the granted permissions clicks the ***publish*** button to generate ARM templates from the source code in the collaboration branch. When the button is clicked, the workspace validates the pipelines (think of it as of linting and unit testing), generates ARM templates (think of it as of building) and saves the generated templates to a technical branch ***adf_publish*** in the same code repository (think of it as of publishing artifacts). This branch is created automatically by the Azure Data Factory workspace. This process is described in details in the [Azure Data Factory documentation](https://docs.microsoft.com/azure/data-factory/continuous-integration-deployment).
97+
CI process for an Azure Data Factory pipeline is a bottleneck in the whole CI/CD story for a data ingestion pipeline. There's no ***Continuous*** Integration. A deployable artifact for Azure Data Factory is a collection of Azure Resource Manager templates. The only way to produce those templates is to click the ***publish*** button in the Azure Data Factory workspace. There's no automation here.
98+
The data engineers merge the source code from their feature branches into the collaboration branch, for example, ***master*** or ***develop***. Then, someone with the granted permissions clicks the ***publish*** button to generate Azure Resource Manager templates from the source code in the collaboration branch. When the button is clicked, the workspace validates the pipelines (think of it as of linting and unit testing), generates Azure Resource Manager templates (think of it as of building) and saves the generated templates to a technical branch ***adf_publish*** in the same code repository (think of it as of publishing artifacts). This branch is created automatically by the Azure Data Factory workspace. This process is described in details in the [Azure Data Factory documentation](https://docs.microsoft.com/azure/data-factory/continuous-integration-deployment).
9999

100-
It's important to make sure that the generated ARM templates are environment agnostic. This means that all values that may differ from between environments are parametrized. The Azure Data Factory is smart enough to expose the majority of such values as parameters. For example, in the following template the connection properties to an Azure Machine Learning workspace are exposed as parameters:
100+
It's important to make sure that the generated Azure Resource Manager templates are environment agnostic. This means that all values that may differ from between environments are parametrized. The Azure Data Factory is smart enough to expose the majority of such values as parameters. For example, in the following template the connection properties to an Azure Machine Learning workspace are exposed as parameters:
101101

102102
```json
103103
{
@@ -147,7 +147,7 @@ The pipeline activities may refer to the pipeline variables while actually using
147147

148148
![adf-notebook-parameters](media/how-to-cicd-data-ingestion/adf-notebook-parameters.png)
149149

150-
The Azure Data Factory workspace ***doesn't*** expose pipeline variables as ARM templates parameters by default. The workspace uses the [Default Parameterization Template](https://docs.microsoft.com/azure/data-factory/continuous-integration-deployment#default-parameterization-template) dictating what pipeline properties should be exposed as ARM template parameters. In order to add pipeline variables to the list, update the "Microsoft.DataFactory/factories/pipelines" section of the [Default Parameterization Template](https://docs.microsoft.com/azure/data-factory/continuous-integration-deployment#default-parameterization-template) with the following snippet and place the result json file in the root of the source folder:
150+
The Azure Data Factory workspace ***doesn't*** expose pipeline variables as Azure Resource Manager templates parameters by default. The workspace uses the [Default Parameterization Template](https://docs.microsoft.com/azure/data-factory/continuous-integration-deployment#default-parameterization-template) dictating what pipeline properties should be exposed as Azure Resource Manager template parameters. In order to add pipeline variables to the list, update the "Microsoft.DataFactory/factories/pipelines" section of the [Default Parameterization Template](https://docs.microsoft.com/azure/data-factory/continuous-integration-deployment#default-parameterization-template) with the following snippet and place the result json file in the root of the source folder:
151151

152152
```json
153153
"Microsoft.DataFactory/factories/pipelines": {
@@ -179,9 +179,9 @@ Doing so will force the Azure Data Factory workspace to add the variables to the
179179
}
180180
```
181181

182-
The values in the json file are default values configured in the pipeline definition. They're expected to be overridden with the target environment values when the ARM template is deployed.
182+
The values in the json file are default values configured in the pipeline definition. They're expected to be overridden with the target environment values when the Azure Resource Manager template is deployed.
183183

184-
## Continuous Delivery (CD)
184+
## Continuous delivery (CD)
185185

186186
The Continuous Delivery process takes the artifacts and deploys them to the first target environment. It makes sure that the solution works by running tests. If successful, it continues to the next environment. The CD Azure Pipeline consists of multiple stages representing the environments. Each stage contains [deployments](https://docs.microsoft.com/azure/devops/pipelines/process/deployment-jobs?view=azure-devops) and [jobs](https://docs.microsoft.com/azure/devops/pipelines/process/phases?view=azure-devops&tabs=yaml) that perform the following steps:
187187
* Deploy a Python Notebook to Azure Databricks workspace
@@ -234,7 +234,7 @@ The ***Deploy_to_QA*** stage contains a reference to ***devops-ds-qa-vg*** varia
234234
235235
### Deploy an Azure Data Factory pipeline
236236
237-
A deployable artifact for Azure Data Factory is an ARM template. Therefore, it's going to be deployed with the ***Azure Resource Group Deployment*** task as it is demonstrated in the following snippet:
237+
A deployable artifact for Azure Data Factory is an Azure Resource Manager template. Therefore, it's going to be deployed with the ***Azure Resource Group Deployment*** task as it is demonstrated in the following snippet:
238238
239239
```yaml
240240
- deployment: "Deploy_to_ADF"
Loading
Loading
Loading
Loading
Loading

‎articles/machine-learning/toc.yml

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -198,6 +198,9 @@
198198
- name: Create datasets with labels
199199
displayName: data, labels, torchvision
200200
href: how-to-use-labeled-dataset.md
201+
- name: DevOps for data ingestion
202+
displayName: data, ingestion, devops
203+
href: how-to-cicd-data-ingestion.md
201204
- name: Train models
202205
items:
203206
- name: Use the designer

0 commit comments

Comments
 (0)
Please sign in to comment.