Feature store is a concept that’s attracting the attention of data scientists to store and re-use features across different projects. Feature generation for ML algorithms is often time and resource consuming. A store where you have many features pre-built for your data science project that you can just access to load your tensors, sounds really attractive. However there are a few challenges.
- Feature storage is not just a storage problem. Features go through complex transformations before being used for training. It is also important for data scientists to understand the assumptions behind the feature, the nature of the mutations that have been applied to data. It is important that the provenance of the feature is also available for data scientists to analyze.
- ML Features are not a static source of data. One training project may choose to fill the missing values by zero, while another project may prefer an average based on other dimensions, therefore, we end up with multiple different versions of the same feature. These versions proliferate and are difficult to track and index.
- Updating the data pipelines to tweak the features is a complex change management process. The features need tweaking as data scientists learn about the problem and develop new insights into the feature requirements. A heavy weight feature generation data pipeline involving hadoop/spark is often managed by a different team and not the data scientists. Any change to that pipeline involves a process involving multiple teams.
The ML training process usually consists of multiple stages of feature transformation before launching a training algorithm. This is true even if the team is using a feature store, because feature tweaking is an essential part of ML training.
MLFlow and features
MLFlow is getting really popular for tracking the ML training process. ML experiments can be captured with ML flow along with the code and data artifacts providing full provenance for the models.
MLFlow can also be used to help orchestrate a multi stage training experiment where output of one stage is captured and fed into the next stage. This needs some help from the underlying compute platform, but enables significant optimizations in terms of time and compute costs.
InfinStor platform enables this orchestration out of the box, however in this article I want to highlight how MLFlow can help with feature organization as well. Following approach has worked well for some of our customers at InfinStor.
- Break down the training code into stages of data processing. This is often how data scientists implement ML code, but due to lack of a suitable platform all of the code gets executed as a single python executable. InfinStor platform allows execution of different stages independently as separate MLFlow runs.
- A separate MLFlow run for feature tweaking, not only captures the feature transformation code as an MLFlow artifact but also the output generated and this is recorded in your experiment with a unique run-id, which can be given an easy name to remember.
- This MLFlow run becomes the input for subsequent stages of processing.
- We could use this run as input for many different experiments or multiple runs of the same experiment. Essentially, you have a feature available for you appropriately indexed by MLFlow, along with its complete provenance.
- InfinStor platform makes it easier because it lets you specify an MLFlow run as an input for any ML process.
The above approach easily addresses most of the issues that were highlighted above.
- MLFlow doesn’t just provide an index into a storage system, but it also captures the entire context in which the feature was generated along with the complex transformations, sources and configurations.
- This approach captures a real feature that is used by a data scientist in their training. In other words, this approach lets data scientists decide and manage what exists as a feature.
- Any update to the feature is naturally captured in the next MLFlow run, and each run captures a snapshot of the feature. Essentially features are being generated as part of the same ML training pipeline instead of having a different infrastructure in place.
Cloud storages are highly scalable and cost effective, however feature storage also needs to track provenance for each version of the feature, and should be manageable by data scientists. Data scientists are already using MLFlow for tracking their experiments and models, therefore it does make sense to track features using MLFlow as well.
Please visit www.infinstor.com, if you would like to experiment with some of these ideas.