Kickstarting Data Science Projects in Azure DevOps (Part 1)

Henkel Data & Analytics
Henkel Data & Analytics Blog
10 min readJan 30, 2024

--

By Roberto Alonso.

Data Science projects are more than models. They start as an action plan how to solve a business question, the creation of repositories, the agreement on how to document the project, among other tasks. Many Data Scientists come from different backgrounds. Thus, it is expected that without a concrete strategy, we end up with a diversity of practices.

In this two-part article, we would like to highlight how we Kickstart Data Science Projects on Azure DevOps. In this regard, a Data Science project consist of Azure DevOps resources such as Azure DevOps Board with work items, a Repository with a predefined structure, and a Wiki. In our approach we start from a simple set of project-related questions to be answered by a Data Scientist and our pipeline creates the Data Science project for them. With this approach we also streamline the way of working for our Data Science use cases.

In this first part of the article, we will focus on Azure DevOps Boards. We show how we leverage the Azure DevOps REST API to automatically create a pre-filled Azure DevOps Board with work items.

Overview

To motivate you to read this series of articles we would like to give a first overview of the complete automation. Each Data Science project is composed of three Azure DevOps resources: an Azure DevOps Board, a Repository, and an Azure DevOps Wiki. For simplicity, let’s assume all three resources are located inside an Azure DevOps project called “Data Science PoCs”, however, in this approach different Azure DevOps projects can be used.

To achieve the automation, we have written an Azure DevOps pipeline that orchestrates the creation of all resources.

Overview pipeline
Azure DevOps pipeline that orchestrates the creation of all resources

The pipeline uses Python and the Azure DevOps REST API for the creation of the Azure DevOps Board (this first article). As a second stage, the pipeline creates the Repository with Cookiecutter, for that we have a Cookiecutter Data Science template that fits our projects. In the last stage, we publish a project Wiki template using Python and the Azure DevOps REST API. Creation of the repository and the Wiki will be fully covered in a follow-up article.

This pipeline is used by Data Scientists in a self-service manner. It is parametrized to provide relevant project details that are later used during the project creation. For example, the name of the repository and the name of the product manager are used in our Cookiecutter template and in the Wiki.

The creation of such resources, while trivial in the UI, involves many intermediate steps that are abstracted from the frontend user. Microsoft provides good documentation about their Azure DevOps REST API , however, some of the steps to create Azure DevOps resources are not evident without context. As an example, to create an Area Path you need to use the classificationnodes API, if you have used Azure DevOps you know that the Classification Nodes is a concept that is rarely mentioned in the UI or in trainings.

Azure DevOps Board Motivation

Many Data Scientists come from different backgrounds. Thus, it is expected that when joining an organization, they come with their own best practices or questions like:

“How should I title this ticket in the Kanban board?” “Should I put acceptance criteria?” “Can you show me an example of a ticket?”.

Without a concrete strategy, we could end up with diverse Kanban Board structures.

Further, for many Data Science projects, we were repeating many tasks over and over. For example, "Connect to DB" or "Exploratory Analysis". Clearly, writing such work items can be automated. However, the challenge at the beginning went beyond pure technical aspects, it really involved sitting with the Data Scientists to understand how they structure their work so it is reflected in a Kanban Board.

Prerequisites

Before jumping into the technical details let us mention some permissions you need to make this approach work.

  1. If you run it as a user, you would need permissions to create Teams, Repos and Wikis inside the Azure DevOps project. Typically, this means you would need to be part of the Project Collection Administrators (Organization level permission) or at least the Project Administrators group (Project level permission). More granular permissions can be granted by your Organization Owner.
  2. If you would like to run this in an Azure DevOps pipeline, your Build Service in Azure DevOps must have the permissions to create and modify Teams, Repos and Wikis as well. This permissions can be granted either by the Organization Owner or the Administrator of the Azure DevOps Project.

Automation of Azure DevOps Board creation

Many of us in the Data Science field have noticed that we do “Data collection”, “EDA”, “Setup connection to the DB”, “Implementation of ML pipeline” and so on. It is a set of common tasks that fits a large portion of our Data Science use cases.

Along with the Data Scientists, we defined common tasks. However, instead of aiming at 100% project coverage that will be hard to achieve given the uniqueness of each project, we have decided to come up with common tasks as much as possible. In the team we come up with over 28 common tasks and 10 parent tasks.

Implementation

Now that we have identified common tasks across projects. We need to implement the automation behind it. To that aim, we have developed a Python script that leverages the Azure DevOps REST API.

Overall, the process to create a Kanban board looks as follows:

  1. Create an Azure DevOps team inside the Azure DevOps project “Data Science PoCs”.
  2. Create an Area Path and an Iteration Path.
  3. Associate the team to the Area and Iteration Path.
  4. Create Azure DevOps work items.

Create an Azure DevOps team

One interesting insight related to this process is that an Azure DevOps Team has an Azure DevOps Board associated to it. This means, that if you create a Team, the Board will be created automatically. Then, the first step is to create a Team. This has another benefit, we can assign people to specific projects such that they have visibility only on those Boards that they are currently working.

To create the Team we use the Teams API call as follows:

# Create team
url = f"https://dev.azure.com/{organization}/_apis/projects/{project_id}/teams"
querystring = {"api-version": "6.0"}
payload = {"name": team_name}

response = requests.request("POST", url, json=payload, auth=HTTPBasicAuth(user_name, personal_access_token_pat), params=querystring)
team_id = {"team_id": response["id"]}

Please note, that we keep in a variable, the id of the team, that is coming back from the response. This variable team_id is used to assign the Area and Iteration to the team.

Create an Area

One way to understand an Area Path is to think about it as a logical grouping of work items (Features, Product Backlog Items, Bugs, etc.) in Azure DevOps; i.e., it is the set of work items that will show up in the Board. While typically the Area Path is associated to a Team in the UI, it is possible to create it with the REST API without associating it.

To create the Area Path we use the Classification Nodes API call as follows:

url = f"https://dev.azure.com/{organization}/{project_id}/_apis/wit/classificationnodes/Areas"
querystring = {"api-version": "6.0"}
payload = {"name": area_name}

response = requests.request("POST", url, json=payload, auth=HTTPBasicAuth(user_name, personal_access_token_pat), params=querystring)

Associate Area and Iteration to the team

A work item belongs to one Area Path and one Iteration Path, without this step, work items won't show up in the new Azure DevOps Board.

To assign an Area to the team, we have used the Teamfieldvalues API call as follows:

url = f"https://dev.azure.com/{organization}/{project_id}/{team_name}/_apis/work/teamsettings/teamfieldvalues"
querystring = {"api-version": "6.0"}
payload = {
"defaultValue": f"{project_id}\\{area_name}",
"values": [{"value": f"{project_id}\\{area_name}", "includeChildren": True}],
}
response = requests.request("PATCH", url, json=payload, auth=HTTPBasicAuth(user_name, personal_access_token_pat), params=querystring)
return response

Noticed that in this case, we have used the PATCH call. This is because the team is already created, and we are simply updating the properties of the team, namely the Area.

For the assignment of the Iteration, we found out that the only way to assign it is to pass the actual Team ID and using the UpdateIterationsData API call as follows:

url = f"https://dev.azure.com/{organization} /{project_id}/{team_id}/_admin/_Iterations/UpdateIterationsData"
querystring = {"useApiUrl": True, "teamId": team_id, "__v": 5}
payload = {
"saveData": '{"rootIterationId":"' + root_iter_id + '","selectedIterations":[]}'
}
response = requests.request("POST", url, json=payload, auth=HTTPBasicAuth(user_name, personal_access_token_pat), params=querystring)

Remark: The rootIterationId is the Parent Iteration that is associated to the Azure DevOps Project. It is essential to make the approach work. We couldn't find proper documentation from Microsoft on how to properly get it besides this discussion on GitHub. What we did to get it, was to look at the network traces from Azure DevOps when assigning this Parent Iteration to a dummy team. In this sense, each Azure DevOps project must be 'manually' onboarded in our automation.

Once these steps are completed, the new team has a Board with an Area and an Iteration associated to it. What remains is to add the work items.

Create Azure DevOps work items

Each work item must contain a Title, Description, Acceptance Criteria, the Area and the Iteration Path. Some work items might contain additional information like the parent-child relationship.

Since we are going to setup the parent-child relationship, in our case Features are parent work items. They must be created first before creating the Product Backlog Items which will be under them.

To create Features we use the Work Items API call as follows:

workitem_info = [] # information about your work item (will be shown below)
url = f"https://dev.azure.com/{organization}/{project_id}/_apis/wit/workitems/$Feature"
querystring = {"api-version": "6.0"}
headers={"Content-Type": "application/json-patch+json"}

area_path = {
"op": "add",
"path": "/fields/System.AreaPath",
"from": None,
"value": f"\\{project_id}\\{area}",
}
workitem_info.append(area_path)
payload = workitem_info

response = requests.request("POST", url, json=payload, auth=HTTPBasicAuth(user_name, personal_access_token_pat), params=querystring, headers=headers)
workitem_id = response["id"]
workitem_tile = response["fields"]["System.Title"]

Noticed that before making the POST request we add all information (workitem_info) that the work item contains. Typical information includes the Area Path (to make sure it shows up in the board), the title, description, etc. We include an example payload with all the properties we use:

workitem_info = [
{
"op": "add",
"path": "/fields/System.Title",
"from": None,
"value": "Azure workspace set up",
},
{
"op": "add",
"path": "/fields/System.Description",
"from": None,
"value": "<div><b>What needs to be done?</b> </div><div>New Azure workspace is required for the project. </div><div><br> </div><div> </div>",
},
{
"op": "add",
"path": "/fields/Microsoft.VSTS.Common.AcceptanceCriteria",
"from": None,
"value": "<ul><li>Azure AD groups set up </li><li>Workspace and resources provisioned </li><li>Users got access to the workspace </li> </ul>",
},
{
"op": "add",
"path": "/fields/System.AreaPath",
"from": None,
"value": f"\\{project_id}\\{area}",
}
]

Remark: The value of the Description and Acceptance Criteria fields accepts HTML. This is a nice functionality that we use to format properly the work items. As you can see, e.g. we included a bullet list in the Acceptance Criteria.

The process to add Product Backlog Items is similar to Feature work items:

workitem_info = [] # information about your work item
url = f"https://dev.azure.com/CoEDataScience/{project_id}/_apis/wit/workitems/$Product%20Backlog%20Item"
querystring = {"api-version": "6.0"}
headers={"Content-Type": "application/json-patch+json"}
area_path = {
"op": "add",
"path": "/fields/System.AreaPath",
"from": None,
"value": f"\\{project_id}\\{area}",
}
workitem_info.append(area_path)
payload = workitem_info

response = requests.request("POST", url, json=payload, auth=HTTPBasicAuth(user_name, personal_access_token_pat), params=querystring, headers=headers)
return response

As part of the information of the work item, in the workitem_info list we include information about the parent work item. An example payload for Product Backlog Items is shown below:

workitem_info = [
{
"op": "add",
"path": "/fields/System.Title",
"from": None,
"value": "Design data pipeline",
},
{
"op": "add",
"path": "/fields/System.Description",
"from": None,
"value": "<div> Based on the results of data understanding and data sourcing discussion design the data pipeline </div>",
},
{
"op": "add",
"path": "/fields/Microsoft.VSTS.Common.AcceptanceCriteria",
"from": None,
"value": "<div>Documentation of data pipeline design (e.g. architecture) </div>",
},
{"op": "add", "path": "/fields/System.Tags", "value": "#DataEng"},
{
"op": "add",
"path": "/relations/-",
"value": {
"rel": "System.LinkTypes.Hierarchy-Reverse",
"url": f"https://dev.azure.com/{organization}/{project_id}/_apis/wit/workitems/{parent_workitem_id}",
},
}
]

Let's briefly talk about the last property of the Product Backlog Item, /relations/-. The value of this property "rel": "System.LinkTypes.Hierarchy-Reverse" is used to indicate that there will be a parent-child relationship. This is not the only relationship that we can set, more can be seen here. The url is a reference to the parent card which is in the form of an Azure DevOps URL. Before any work item is created we don't know the IDs that we are going to get assigned. This is the reason why we saved the Title and the ID of the Feature cards (see code above). With the title, we simply join the PBI with the Feature and assign the correct parent ID.

Lastly, the whole process of creation of the Board (Team, Area, Iteration, and work items) happens in an Azure DevOps pipeline that we have written, we simply executed the script.

Conclusion

In this article, we have shown how we managed to streamline the creation of an Azure DevOps Board. The Board is automatically filled with common Data Science tasks to save time. While these work items cover only a portion of the use cases, we argue that it is a good starting point for many Data Science projects.

To achieve the automation we used the Azure DevOps REST API. To effectively use the REST API, it is necessary to understand some concepts of Azure DevOps. We hope that with this article, some of these concepts become clearer for the reader and they follow a similar approach towards the automation of their Data Science projects.

In a follow-up article, we will explain how we took a step forward and automated the creation of a Git Repository and a project Wiki page.

Whether shampoo, detergent, or industrial adhesive — Henkel stands for strong brands, innovations, and technologies. In our data science, engineering, and analytics teams we solve modern data challenges for the benefit of our customers. Learn more at henkel.com/digitalization.

--

--

Henkel Data & Analytics
Henkel Data & Analytics Blog

Find out how Henkel creates its next digital innovations and tech driven business solutions based on data & analytics. henkel.com/digitalization