Building a collaboration platform for a data science team

Daniel Taube
Data Science at Microsoft
9 min readMar 28, 2023
Photo by Jason Goodman on Unsplash.

In the past, data scientists were often considered to be a “one-man-show,” but this is no longer the case. It’s common for modern data science projects to be created, driven, and maintained by increasingly larger groups of data scientists. In recent years Machine Leaning Operations (MLOps) has answered the data science community’s need to have a structured approach and documented best practices for the increasingly complex solutions these projects intend to address.

In this article I introduce an MLOps data science collaboration package. I use a common Machine Learning issue, overfitting, as a practical example to highlight various features of this package. The goal of this walkthrough is to expose you to some common issues that data science teams struggle with and provide you with a structured approach for collaboration and communication within a data science team to overcome these issues.

You can find the package here.

Illustrating data science team collaboration with a Machine Learning project as an example

There are several factors that data science teams must consider when training and testing Machine Learning (ML) models, including the data used for training, the evaluation and testing method, the model type, the features of the model (if applicable), the target to be predicted, and the chosen hyperparameters. Companies that create AI products incorporating ML may use platforms such as Azure Machine Learning to run experiments, save metrics, store data, and create pipelines.

In the sections below, I explore an approach to addressing ML model overfitting. Overfitting is a common issue that many ML data science teams contend with. Overfitting occurs when a model becomes too complex and learns the training data too well, to the point that it fits the noise and random fluctuations in the data rather than the underlying patterns.

I have had success with using the technique outlined below in facilitating collaboration among data scientists with different backgrounds and levels of experience, across a variety of different projects with some of Microsoft’s larger customers. I believe that teams that follow this approach can have meaningful discussions based on data, leading to better decision making and results.

Just as importantly — and applicable to many different types of data science projects other than Machine Learning — I introduce a Python-based package that can be used as a simple but powerful framework for collaboration among data scientists. This can serve as a foundation for building a data science collaboration platform.

The data science collaboration package

The data science collaboration package structure is based on the following:

. 
└── .ds_team/
├── data_class.py
├── features_class.py
├── model_class.py
├── evaluator_class.py
├── experiment_class.py
├── error_analysis.py
├── experiment.py

Each file in the package contains a class responsible for a specific step in the data science process. Let’s take a closer look at each of these steps and how they fit together.

The data science flow

Data

The first step of building a data science solution is to source and understand the data. After conducting exploratory data analysis (EDA), the data science team should agree on the dataset to work with and label its version.

Why is this step important? Because if two data scientists are working with different datasets from the start, it will be difficult to determine which model is better in the end. Even simple actions such as dropping or filling in missing values can have a significant impact on the final result. This is also known as data leakage.

According to Kaggle.com, “data leakage (or leakage) happens when your training data contains information about the target, but similar data will not be available when the model is used for prediction. This leads to high performance on the training set (and possibly even the validation data), but then the model will perform poorly in production.”

In this example, the data is stored as a Pandas DataFrame and can be loaded from various sources such as a CSV file in storage or a SQL table. However, for simplicity, we will use the Iris dataset provided by the scikit-learn package.

Here is the code for the Data class in the data_class.py file:

This way, the Data class can be initialized without loading the data and the data will be loaded only when needed.

Feature engineering

Features are the input variables that the model uses to make predictions. Why is this step important?

First, the quality and relevance of the features can greatly affect the performance of the model, and second, this is the place for the data scientist to show creativity, to explore and find different aspects of related domain knowledge. The feature engineering step involves selecting, transforming, and creating new features that will improve the model’s performance.

If no additional features need to be calculated, this class can be removed or simply return the original data frame.

Here is the code for the Features class in the features_class.py file:

Model

Once the features are prepared, the next step is model selection. Why is this step important?

The Model class is responsible for fitting and predicting the dataset. This is the core of data science work, from role-based models to state-of-the-art AI models. It is the place to explore different approaches and verities in the data science world.

In this example, we implement a class that can train and predict using a Random Forest model from scikit-learn.

Here is the code for the Model class in the model_class.py file:

Evaluator

The evaluation step involves measuring the performance of the model on unseen data, using performance metrics such as accuracy, precision, recall, F1-score, or ROC-AUC. The Evaluator class in the evaluator_class.py file shows how to use the classification_report from sklearn to compute the performance metrics.

Why is this step important? Model evaluation is important to assess the efficacy of a model during initial research phases, and it also plays a role in model monitoring. To understand whether your model is working well with new data, you can leverage a number of evaluation metrics and implement them in this class.

For example, which metric should be considered: better precision or better recall? Precision measures the proportion of correct positive predictions among all positive predictions made by the model, whereas recall measures the proportion of correct positive predictions among all actual positive observations. This decision affects the final data science product.

Here is the code for the Evaluator class in the evaluator_class.py file:

Experiment

The experiment class is responsible for defining the “rules” under which the data science experiment is run.

Why is this step important? The experiment class serves as the central hub for coordinating and organizing the data science experiment. It brings together all the different components established in the previous classes and allows you to control the flow of the experiment for your data science team. Depending on the skill level of your team members, you can adjust the complexity of the class accordingly.

For example, if the team is mostly composed of junior data scientists, you may want to keep the class simple and straightforward. On the other hand, if your team is more experienced and comfortable with more advanced approaches, you have the flexibility to add more arguments and explore a wider range of possibilities in your experiment. This class provides you with the power to customize the experiment flow to best suit the needs and abilities of your team.

This includes the train/test split and any filtering on the target variable. The evaluation and testing method, the model type, the features of the model (if applicable), the target to be predicted, and the chosen hyperparameters are all factors to consider when training and testing ML models. In this file we use a simple train-test split, but this can be easily changed to a k-fold cross-validation or any cross validation you find suitable for your challenge. This can also be used to evaluate the model on different subsets of the data to obtain a more robust estimate of performance.

The experiment class coordinates the other classes in the package to execute the experiment. It retrieves the data, generates features, trains, and predicts using a model, and evaluates the results.

Here is the code for the Experiment class in the experiment_class.py file:

Implementation

Now that all the necessary classes have been established, we can utilize them to run our data science experiment. By instantiating each class and passing them as arguments to the experiment class, we can take our data science team to the next level of collaboration and efficiency. One way to do this is by incorporating MLOps tools such as MLflow, which allow data scientists to experiment and test various approaches while ensuring consistency in the data and experiment methods.

Here is the code for implementing the data science collaboration package:

The results

Error Analysis

The Error Analysis class is a useful tool for understanding the results of a data science experiment. It provides a simple way to visualize the performance of a model through plots like the confusion matrix.

Why is this step important? This error analysis step is a best practice and can greatly enhance the level of understanding and insights gained from the experiment. By using the error analysis class, the team can have a more comprehensive understanding of results obtained from the Experiment class. Instead of each team member presenting individual results, the error analysis class allows for a more collective and holistic analysis of the experiment, provides a broader perspective on the performance and areas for improvement, and provides consistency when evaluating experiment results.

In this example, the error analysis class has a single function that plots the confusion matrix for a given set of results. It could easily be expanded to include additional functions for analyzing the performance of a model, from grouping by a key and printing different confusion matrices, to plotting distributions of the results. It’s all up to what you believe is important.

After the experiment is over you can implement this class as simply as follows:

Using the error analysis class can be as simple as calling the “plot_confusion_matrix” function and passing in the results data frame.

The Error Analysis class not only serves as an important tool for understanding the performance of the experiment within the data science team, but it also provides a consistent and valuable way to communicate and share the results and improvements with the business side of the product. This class can be an effective tool to bridge the gap between the technical aspects of data science work and business objectives, allowing for clear and effective communication of the progress and areas for improvement.

Conclusion

In the past decade, we’ve seen significant progress in data science across various industries. Many companies have begun to recruit data scientists and build larger data science teams. This shift marks a significant opportunity for data scientists, but also presents a challenge. How can data scientists with different backgrounds and experiences collaborate effectively and make meaningful decisions based on data?

In this article, I’ve proposed a solution that has worked well for me. I introduced a Python package that serves as a framework for data science collaboration. It consists of several classes, each responsible for a specific step in the data science process, including data handling, feature creation, model training and evaluation, and experimentation. It also includes an error analysis class, which features various functions such as a confusion matrix plot, to help facilitate discussions and identify areas for improvement. By using this package, data scientists can easily compare different approaches and make informed decisions based on the results.

This data science collaboration package provides a consistent framework for running experiments, and the benefit of this approach is that it allows the team to answer questions like:

  • What data version did you use?
  • Did you try to calculate new features?
  • What were the models/hyperparameters?
  • Did you use the same metrics to evaluate?
  • What was your experiment method?
  • What cross-validation approach did you use?

By having a clear record of the experiment setup, it is easier to understand and reproduce the results of different models. In addition, the flexibility of the package allows the team to implement a variety of models and experiment setups.

Daniel Taube is on LinkedIn.

--

--