How to speed up Data Science in your company

Matheus Rodrigues Rugolo
Comunidade XP
4 min readDec 19, 2022

--

The first step (and the last one sometimes) consolidating your data science team is to guarantee that you can deliver value as quickly as possible. Nowadays we have tools such as Feature Stores and AutoML to facilitate these tasks.

WHY SHOULD I HAVE A FEATURE STORE?

When structuring your workflows you should know how the data scientists spend their time at work.

https://www.forbes.com/sites/gilpress/2016/03/23/data-preparation-most-time-consuming-least-enjoyable-data-science-task-survey-says/?sh=64cbc71f6f63

Most of the time they will be working with data sources to obtain the desired features and this is where feature stores will shine to save your team resources.

Here is a basic diagram of where it fits in your workflow and what it helps you with.

Diagram of a basic feature store. (source: https://neptune.ai/blog/feature-stores-components-of-a-data-science-factory-guide)
  • Registry
  • Monitor features over time
  • Manage Transformations
  • Single source storage
  • Single source serving

Most of the data platforms offers ways of implementing it in an efficient way with little effort.

Make sure your team is not working on features individually for a use case, but if they do, they must enable it for the whole team through the feature store.

We will certainly heard a lot about this in the future. Though it is relatively new concept there is no doubt it is one of the key components to have implemented in data science infrastructure and will continue to evolve rapidly.

REASONS TO AUTOMATE YOUR ML PIPELINE

Now that we have established the feature store how do we get value from them as quickly as possible ?

Sage Maker, Google AutoML, PyCaret, etc… you certainly have heard about many AutoML solutions available for a modest price, right ?

While they certainly can add value to your products, here are some reasons to consider building a custom AutoML solution by/to your Data Science team.

1 — Reusable Components

However you choose to create your standard AutoML pipeline, you will certainly find many ways to extend pieces of it during the process. It can be a good way to teach your Data Scientists principles of software engineering that can spare them a lot of time in repetitive tasks.

2 — Improve code quality

Maintaining a single source for your components will facilitate peer review from your team. This will not only save you time, but make sure you avoid silent bugs in your ML pipelines.

Let’s say you have a hundred different models and each of them have a different implementation that applies a very similar prep of the features, how much effort would require to maintain it if your data contracts change?

3 — Model Validation Baseline

Evaluation Metrics For Classification Model | Classification Model Metrics

While certainly this is a hot topic of debate for any data science team, when you have to come up with a baseline it will deeply increase the quality of your models.

The objective here is not only deciding for a threshold number of a metric, but also making sure the pipeline respects the train set characteristic like stable timestamp, entity, weight, etc…

Cherish this opportunity to customize data set splits, cross validation techniques, evaluation plots, etc… being creative and incrementing your process overtime is key.

4 — Improve Model Monitoring

By this point you have certainly improved your MLOps environment and now the value generated by your ML products will be taken as granted.

It is time to make sure your models keep performing consistently over time. A good starting point would be detecting data drift in its different manifestations.

Observability of your computing resources and execution of pipelines with tools like Apache Airflow are also a good way of keeping it going as smooth as possible.

Integrating notifications from such system to your team communication tools like Slack, Skype, etc …

Together a Feature Store and an AutoML solution can help make the development of machine learning models more efficient and effective. Check out next articles to know how we built a robust pipeline for automatic machine learning experiments.

--

--