Kickstarting Data Science Projects in Azure DevOps (Part 2)

Henkel Data & Analytics
Henkel Data & Analytics Blog
6 min readApr 22, 2024

--

By Roberto Alonso.

Before starting, the good news: The first article Kickstarting Data Science Projects in Azure DevOps (part 1) has gained attention and enthusiasm in the community, so we decided to not only write this planned second part, but add a third part in which we are going to cover specifics about the contents of the template repository itself. In the template, we leverage hatch, ruff, and mkdocs, among others. So, stay tuned for the third part and enjoy this second part of the series.

Now that we have an Azure DevOps Board with Work Items (see the first part), it is natural for data scientists to jump right into the actual experimentation and later the documentation. This article, the second part of this series, is all about repository and wiki automation.

Automation of repository creation on Azure DevOps

Our project structure is based on the different practices we have noticed from our data scientists. We understand that not all data science use cases are the same, and our common structure covers only a portion of the use cases. However, we argue that it is a good starting point. If the structure does not fit entirely, it can be changed.

In general, our automation has two purposes:

  1. Speed up the initial creation of repositories and wikis
  2. Streamline the way to work with Python and Azure Machine Learning

Overall, we have a common way of structuring our projects as follows:

  • .pipelines – YAML definitions of the Azure DevOps pipelines (e.g., deployment of models)
  • aml - Code related to Azure Machine Learning pipelines
  • notebooks – Jupyter notebooks
  • src – Source code of the Python package
  • wiki – Non-technical documentation about the project
  • .gitignore – Ignore common files that we have identified
  • .pre-commit-config.yaml – Pre-commit hooks configuration
  • mkdocs.yml – Configuration of material docs
  • pyproject.toml – Configuration of the Python project (e.g., ruff, pytest, hatch, etc.)
  • README.md – High level description of the project for other Data Scientists
  • setup.sh – Bash script that initializes Python considering our corporate proxy

Implementation

For the automation we leverage the Azure DevOps REST API and Cookiecutter. Cookiecutter is a command line utility that creates projects from templates based on Jinja2. Concretely, we have created a Cookiecutter template with the structure we have shown above.

The complete process looks like this:

  1. Create an empty GIT repository and initialize the main branch of the repository with a README.md file.
  2. Execute Cookiecutter with the Data Science template structure and commit the created files into the repository.

Create an empty repository & initialize it

To create an empty repository, we use the Repositories API in the following way:

url = f"https://dev.azure.com/{organization}/{project_id}/_apis/git/repositories"
querystring = {"api-version": "6.0"}
payload = {"name": repo_name}

response = requests.request("POST", url, json=payload, auth=HTTPBasicAuth(user_name, token), params=querystring, headers=headers)
repository_id = response["id"]

In this case, we save the repository ID so we can reference it later for the initialization of the repo.

Then, to initialize the repository with an empty README.md file we use the Pushes API. The reasoning behind initialization is to easily checkout the repo from the pipeline and update it after the Cookiecutter process finishes, but this is optional.

url = f"https://dev.azure.com/{organization}/{project_id}/_apis/git/repositories/{repository_id}/pushes"
querystring = {"api-version": "6.0"}

payload = {
"commits": [
{
"comment": "ops: initial commit",
"changes": [
{
"changeType": 1,
"item": {"path": "/README.md"},
"newContentTemplate": {"name": "README.md", "type": "readme"},
}
],
}
],
"refUpdates": [
{
"name": "refs/heads/main",
"oldObjectId": "0000000000000000000000000000000000000000",
}
],
}

response = requests.request("POST", url, json=payload, auth=HTTPBasicAuth(user_name, token), params=querystring, headers=headers)

Execution of the Cookiecutter template & commit changes

This is a straightforward execution of Cookiecutter with our own Data Science cookiecutter template. After Cookiecutter creates the whole structure, we simply commit and push the changes to the recently created repository.

We executed Cookiecutter from an Azure DevOps pipeline. An example YAML is given below:

trigger:
- manual
parameters:
- name: project_manager
type: string
displayName: project_manager
- name: repo_name
type: string
displayName: repo_name

stages:
- stage: Create_and_push_to_repo
displayName: "Create project structure and push to created repo"
jobs:
- job: create_n_push
pool:
vmImage: "ubuntu-20.04"
steps:
- checkout: "git://gitOrganization/${{ parameters.repo_name }}"
persistCredentials: true
- checkout: self
- task: UsePythonVersion@0
displayName: Fix python version to 3.10.
inputs:
versionSpec: "3.10"
- bash: pip install --user cookiecutter
displayName: Install cookiecutter
- bash: cookiecutter . --no-input project_manager='${{parameters.project_manager}}'
workingDirectory: "$(System.DefaultWorkingDirectory)/cookiecutter_template"
displayName: Creates cookiecutter project
- bash: yes | cp -rf cookiecutter_template/${{ parameters.repo_name }}/* ${{ parameters.repo_name }}
workingDirectory: "$(System.DefaultWorkingDirectory)"
- bash: git config user.email "cookieCutterProcess@company.com" && git config user.name "Cookiecutter AZDO Process"
workingDirectory: ${{ parameters.repo_name }}
- bash: git checkout main
workingDirectory: ${{ parameters.repo_name }}
displayName: Switch to main
- bash: git add --all
workingDirectory: ${{ parameters.repo_name }}
displayName: Adding files
- bash: git commit -m "ops: initialize DS project "
workingDirectory: ${{ parameters.repo_name }}
displayName: Commit repo
- bash: git push
workingDirectory: ${{ parameters.repo_name }}
displayName: Pushing files to repo

Remark: Your Azure DevOps Build Service must have permissions to write or change repositories in the target Azure DevOps project.

Note that we are adding parameters like project manager, or repository name in this example. In practice, we include all the information related to the project used to populate the README.md file and the wiki and configurations of the Python project.

Automation of wiki creation on Azure DevOps

During the project execution and to report the outcome, typically all data scientists must document their insights. We concluded that for technical documentation, for example to share information to other developers or data scientists, Jupyter Notebook and Pydocstrings (with Material for MkDocs) is enough.

Still, for stakeholders or project/product managers other information related to the project is necessary. For this documentation, we leveraged the Azure DevOps Wiki.

During the creation of the repository, our Cookiecutter template asks for details about the project like:

  • Project manager
  • Data science lead
  • Business unit
  • etc.

This information is mostly known before starting the project, so it makes sense that we already fill it out.

As part of the template, we have created a one-page wiki (Markdown format) that contains all this information. So, when the project is created via Cookiecutter the information is there, but not yet published in the Azure DevOps Wiki. An example wiki with Cookiecutter placeholders is shown below:

**{{ cookiecutter.project_name }}**

[[_TOC_]]
## Team
### Point of contact business unit
* {{ cookiecutter.responsible_person_business }}
### Project manager(s)
* {{ cookiecutter.project_manager }}
### Lead Data scientist(s)
* {{ cookiecutter.lead_ds }}
---

## Project Info
### Project summary and scope
* {{ cookiecutter.project_description }}
### Goal of the project from stakeholders’ perspective
* {{ cookiecutter.project_goal }}
### Project duration
* {{ cookiecutter.project_duration }}
---

## Resources
### Links to codebase (repositories and notebooks)
<text here>

### Links to associated resources (Azure DevOps board, Sharepoint,…)
<text here>

### If helpful, high-level architecture (i.e., intended for BU (Business Unit) stakeholders)
<text here>

### Relevant links (e.g., papers)
<text here>

### Relevant technical information (e.g., security rules implemented)
<text here>

---

Implementation

In Azure DevOps there are two types of wikis: Team Project Wiki and Code Wiki. The project wiki is part of the whole Azure DevOps project. To create more than one wiki, we must use a Code wiki.

In our case, we usually have either a PoC Azure DevOps project where each repository is a PoC, or we have dedicated Azure DevOps projects for each data science solution that has reached at least the MVP phase. In both cases we have multiple repositories that need a dedicated wiki. Thus, in our automation we are using Code Wikis.

As mentioned before, the wiki is already part of the cookiecutter template. What is left is to publish the Code wiki. For that we use the Wikis API in the following way:

url = f"https://dev.azure.com/{organization} /{project_id}/_apis/wiki/wikis"
querystring = {"api-version": "6.0"}
payload = {
"version": {"version": "main"},
"type": "codeWiki",
"name": wiki_name,
"projectId": uui_project_id,
"repositoryId": repository_id,
"mappedPath": "/wiki",
}
response = requests.request("POST", url, json=payload, auth=HTTPBasicAuth(user_name, token), params=querystring, headers=headers)

Relevant information for the payload is:

  • name - Name that will show up in the Wiki page
  • projectID - UUID of the project in Azure DevOps
  • repositoryID - The ID of the repository that we get in the step before

After this step you should be able to see the new Code Wiki in Azure DevOps.

Conclusion

In this article, we have shown how we streamlined the repository and wiki creation in Azure DevOps for Henkel data science projects. Data scientists can use the pipeline in a self-service manner, by providing some inputs about their project, and after 2 minutes all the Azure DevOps resources will be created for them.

As a limitation of this approach, we mention that this template fits a portion of all data science use cases in the company. However, even for special projects, our automation provides a good starting point.

In the third part of this series, we are going to give an in-depth look at the cookiecutter template we created. This template uses MkDocs, ruff, pytest, hatch, among others. There we will explain how we are leveraging all of these to successfully kickstart our data science projects.

Whether shampoo, detergent, or industrial adhesive — Henkel stands for strong brands, innovations, and technologies. In our data science, engineering, and analytics teams we solve modern data challenges for the benefit of our customers.
Learn more at
henkel.com/digitalization.

--

--

Henkel Data & Analytics
Henkel Data & Analytics Blog

Find out how Henkel creates its next digital innovations and tech driven business solutions based on data & analytics. henkel.com/digitalization