shanhix1 · Jan 31, 2020
diff --git a/‎articles/machine-learning/how-to-cicd-data-ingestion.md
Lines changed: 13 additions & 13 deletions b/‎articles/machine-learning/how-to-cicd-data-ingestion.md
Lines changed: 13 additions & 13 deletions
diff --git a/‎articles/machine-learning/media/how-to-cicd-data-ingestion/adf-notebook-parameters.png
-16.7 KB b/‎articles/machine-learning/media/how-to-cicd-data-ingestion/adf-notebook-parameters.png
-16.7 KB
diff --git a/‎articles/machine-learning/media/how-to-cicd-data-ingestion/adf-variables.png
-14.1 KB b/‎articles/machine-learning/media/how-to-cicd-data-ingestion/adf-variables.png
-14.1 KB
diff --git a/‎articles/machine-learning/media/how-to-cicd-data-ingestion/cicd-data-ingestion.png
-2.57 KB b/‎articles/machine-learning/media/how-to-cicd-data-ingestion/cicd-data-ingestion.png
-2.57 KB
diff --git a/‎articles/machine-learning/media/how-to-cicd-data-ingestion/data-ingestion-pipeline.png
3.21 KB b/‎articles/machine-learning/media/how-to-cicd-data-ingestion/data-ingestion-pipeline.png
3.21 KB
diff --git a/‎articles/machine-learning/media/how-to-cicd-data-ingestion/linting-unit-tests.png
9.86 KB b/‎articles/machine-learning/media/how-to-cicd-data-ingestion/linting-unit-tests.png
9.86 KB
diff --git a/‎articles/machine-learning/toc.yml
Lines changed: 3 additions & 0 deletions b/‎articles/machine-learning/toc.yml
Lines changed: 3 additions & 0 deletions
@@ -1,5 +1,5 @@
 ---
-title: DevOps for a Data Ingestion pipeline
+title: DevOps for a data ingestion pipeline
 titleSuffix: Azure Machine Learning
 description: Learn how to apply DevOps practices to a data ingestion pipeline implementation used to prepare data for a model training.
 services: machine-learning
@@ -16,9 +16,9 @@ ms.date: 01/30/2020
 
 ---
 
-# DevOps for a Data Ingestion pipeline
+# DevOps for a data ingestion pipeline
 
-In most scenarios, a Data Ingestion solution is a composition of scripts, service invocations, and a pipeline orchestrating all the activities. In this article, you learn how to apply DevOps practices to the development lifecycle of a common data ingestion pipeline. The pipeline prepares the data for the Machine Learning model training.
+In most scenarios, a data ingestion solution is a composition of scripts, service invocations, and a pipeline orchestrating all the activities. In this article, you learn how to apply DevOps practices to the development lifecycle of a common data ingestion pipeline. The pipeline prepares the data for the Machine Learning model training.
 
 ## The solution
 
@@ -34,9 +34,9 @@ As with any software solution, there is a team (for example, Data Engineers) wor
 
 ![cicd-data-ingestion](media/how-to-cicd-data-ingestion/cicd-data-ingestion.png)
 
-They collaborate and share the same Azure resources such as Azure Data Factory, Azure Databricks, Azure Storage account and such. The collection of these resources is a Development environment. The data engineers contribute to the same source code base. The Continuous Integration process assembles the code, checks it with the code quality tests, unit tests and produces artifacts such as tested code and ARM templates. The Continuous Delivery process deploys the artifacts to the downstream environments. This article demonstrates how to automate the CI and CD processes with [Azure Pipelines](https://azure.microsoft.com/services/devops/pipelines/).
+They collaborate and share the same Azure resources such as Azure Data Factory, Azure Databricks, Azure Storage account and such. The collection of these resources is a Development environment. The data engineers contribute to the same source code base. The Continuous Integration process assembles the code, checks it with the code quality tests, unit tests and produces artifacts such as tested code and Azure Resource Manager templates. The Continuous Delivery process deploys the artifacts to the downstream environments. This article demonstrates how to automate the CI and CD processes with [Azure Pipelines](https://azure.microsoft.com/services/devops/pipelines/).
 
-## Source Control Management
+## Source control management
 
 The team members work in slightly different ways to collaborate on the Python notebook source code and the Azure Data Factory source code. However, in both cases the code is stored in a source control repository (for example, Azure DevOps, GitHub, GitLab) and the collaboration is normally based on some branching model (for example, [GitFlow](https://datasift.github.io/gitflow/IntroducingGitFlow.html)).
 
@@ -49,7 +49,7 @@ It's highly recommended to store the code in `.py` files rather than in `.ipynb`
 
 The source code of Azure Data Factory pipelines is a collection of json files generated by a workspace. Normally the data engineers work with a visual designer in the Azure Data Factory workspace rather than with the source code files directly. Configure the workspace with a source control repository as it is described in the [Azure Data Factory documentation](https://docs.microsoft.com/azure/data-factory/source-control#author-with-azure-repos-git-integration). With this configuration in place, the data engineers are able to collaborate on the source code following a preferred branching workflow.    
 
-## Continuous Integration (CI)
+## Continuous integration (CI)
 
 The ultimate goal of the Continuous Integration process is to gather the joint team work from the source code and prepare it for the deployment to the downstream environments. As with the source code management this process is different for the Python notebooks and Azure Data Factory pipelines. 
 
@@ -94,10 +94,10 @@ If the linting and unit testing is successful, the pipeline will copy the source
 
 ### Azure Data Factory CI
 
-CI process for an Azure Data Factory pipeline is a bottleneck in the whole CI/CD story for a data ingestion pipeline. There's no ***Continuous*** Integration. A deployable artifact for Azure Data Factory is a collection of ARM templates. The only way to produce those templates is to click the ***publish*** button in the Azure Data Factory workspace. There's no automation here.
-The data engineers merge the source code from their feature branches into the collaboration branch, for example, ***master*** or ***develop***. Then, someone with the granted permissions clicks the ***publish*** button to generate ARM templates from the source code in the collaboration branch. When the button is clicked, the workspace validates the pipelines (think of it as of linting and unit testing), generates ARM templates (think of it as of building) and saves the generated templates to a technical branch ***adf_publish*** in the same code repository (think of it as of publishing artifacts). This branch is created automatically by the Azure Data Factory workspace. This process is described in details in the [Azure Data Factory documentation](https://docs.microsoft.com/azure/data-factory/continuous-integration-deployment).
+CI process for an Azure Data Factory pipeline is a bottleneck in the whole CI/CD story for a data ingestion pipeline. There's no ***Continuous*** Integration. A deployable artifact for Azure Data Factory is a collection of Azure Resource Manager templates. The only way to produce those templates is to click the ***publish*** button in the Azure Data Factory workspace. There's no automation here.
+The data engineers merge the source code from their feature branches into the collaboration branch, for example, ***master*** or ***develop***. Then, someone with the granted permissions clicks the ***publish*** button to generate Azure Resource Manager templates from the source code in the collaboration branch. When the button is clicked, the workspace validates the pipelines (think of it as of linting and unit testing), generates Azure Resource Manager templates (think of it as of building) and saves the generated templates to a technical branch ***adf_publish*** in the same code repository (think of it as of publishing artifacts). This branch is created automatically by the Azure Data Factory workspace. This process is described in details in the [Azure Data Factory documentation](https://docs.microsoft.com/azure/data-factory/continuous-integration-deployment).
 
-It's important to make sure that the generated ARM templates are environment agnostic. This means that all values that may differ from between environments are parametrized. The Azure Data Factory is smart enough to expose the majority of such values as parameters. For example, in the following template the connection properties to an Azure Machine Learning workspace are exposed as parameters:
+It's important to make sure that the generated Azure Resource Manager templates are environment agnostic. This means that all values that may differ from between environments are parametrized. The Azure Data Factory is smart enough to expose the majority of such values as parameters. For example, in the following template the connection properties to an Azure Machine Learning workspace are exposed as parameters:
 
 ```json
 {
@@ -147,7 +147,7 @@ The pipeline activities may refer to the pipeline variables while actually using
 
 ![adf-notebook-parameters](media/how-to-cicd-data-ingestion/adf-notebook-parameters.png)
 
-The Azure Data Factory workspace ***doesn't*** expose pipeline variables as ARM templates parameters by default. The workspace uses the [Default Parameterization Template](https://docs.microsoft.com/azure/data-factory/continuous-integration-deployment#default-parameterization-template) dictating what pipeline properties should be exposed as ARM template parameters. In order to add pipeline variables to the list, update the "Microsoft.DataFactory/factories/pipelines" section of the [Default Parameterization Template](https://docs.microsoft.com/azure/data-factory/continuous-integration-deployment#default-parameterization-template) with the following snippet and place the result json file in the root of the source folder:
+The Azure Data Factory workspace ***doesn't*** expose pipeline variables as Azure Resource Manager templates parameters by default. The workspace uses the [Default Parameterization Template](https://docs.microsoft.com/azure/data-factory/continuous-integration-deployment#default-parameterization-template) dictating what pipeline properties should be exposed as Azure Resource Manager template parameters. In order to add pipeline variables to the list, update the "Microsoft.DataFactory/factories/pipelines" section of the [Default Parameterization Template](https://docs.microsoft.com/azure/data-factory/continuous-integration-deployment#default-parameterization-template) with the following snippet and place the result json file in the root of the source folder:
 
 ```json
 "Microsoft.DataFactory/factories/pipelines": {
@@ -179,9 +179,9 @@ Doing so will force the Azure Data Factory workspace to add the variables to the
 }
 ```
 
-The values in the json file are default values configured in the pipeline definition. They're expected to be overridden with the target environment values when the ARM template is deployed.
+The values in the json file are default values configured in the pipeline definition. They're expected to be overridden with the target environment values when the Azure Resource Manager template is deployed.
 
-## Continuous Delivery (CD)
+## Continuous delivery (CD)
 
 The Continuous Delivery process takes the artifacts and deploys them to the first target environment. It makes sure that the solution works by running tests. If successful, it continues to the next environment. The CD Azure Pipeline consists of multiple stages representing the environments. Each stage contains [deployments](https://docs.microsoft.com/azure/devops/pipelines/process/deployment-jobs?view=azure-devops) and [jobs](https://docs.microsoft.com/azure/devops/pipelines/process/phases?view=azure-devops&tabs=yaml) that perform the following steps:
 * Deploy a Python Notebook to Azure Databricks workspace
@@ -234,7 +234,7 @@ The ***Deploy_to_QA*** stage contains a reference to ***devops-ds-qa-vg*** varia
 
 ### Deploy an Azure Data Factory pipeline
 
-A deployable artifact for Azure Data Factory is an ARM template. Therefore, it's going to be deployed with the ***Azure Resource Group Deployment*** task as it is demonstrated in the following snippet:
+A deployable artifact for Azure Data Factory is an Azure Resource Manager template. Therefore, it's going to be deployed with the ***Azure Resource Group Deployment*** task as it is demonstrated in the following snippet:
 
 ```yaml
   - deployment: "Deploy_to_ADF"
 
@@ -198,6 +198,9 @@
     - name: Create datasets with labels
       displayName: data, labels, torchvision
       href: how-to-use-labeled-dataset.md
+    - name: DevOps for data ingestion
+      displayName: data, ingestion, devops
+      href: how-to-cicd-data-ingestion.md
   - name: Train models
     items:
     - name: Use the designer