Data Lifecycle of Machine Learning Project

Saurabh Mishra
Analytics Vidhya
Published in
12 min readFeb 12, 2022
Photo by Tolga Ulkan on Unsplash

In one of my articles, I have described why a machine learning (ML) project is hard and also explained how a model can be managed by using MLFlow. I have already explained pretty much about model management in that blog but here I will go through with life cycle of the data for ML applications.

The paradigm of ML is different with respect to traditional software development. Generally in traditional software, we write rules in code (our favorite programming languages) that act on data to give expected results but in Machine Learning it is governed by data. Data and answers get fed as input and the used algorithm predicts output (rules). Predicted outcomes solely depend on supplied data. If data is wrong in any means, the algorithm can’t perform as expected because of garbage in and garbage out principles.

In Machine learning, data is crucial which was also recognized by Andrew Ng and he launched general awareness and campaign towards a data-centric approach for improving the model’s performance because improving data creates a marginally better impact on performance than hyper-parameterization. He literally emphasizes putting more focus on Data to improve the model’s performance and accuracy by being more data-centric rather than model-centric.

In the Data-centric approach, data is the first-class citizen. But only having data is not enough unless it is curated and managed well. Data life cycle comes into the picture to address data quality, data management along with its traversal from generation, consumption, and prediction.

Data life cycle focuses on various challenges of capturing, storing, processing, analyzing, and curating data used for all data applications including ML. It also focuses on industry-oriented tools and technologies to facilitate different services across end-to-end data flow by mitigating data risk, data quality, data consistency, and manual processes.

The below image conceptualizes the data flow and its different stages along with a feedback loop to verify whether supplied data is performing and good enough as expected. Going further we will touch on a quick definition and function of each stage.

Typical data flow in an ML/AI project

Data Generation

In this digital world, every moment we are generating a digital footprint. One of the Forbes articles says every day 2.5 quintillion bytes of data are being generated. In the future, it will only be accelerated with the further growth of the Internet of Things, Social Media, and other digital innovation.

Data Collection

There is too much data is being generated. During data collection, we focus on data that is complete and statistically significant. We also deal with whether we should capture all data or some time-sliced data to train our model and many times it depends on business choices and other practical reasons.

Before start collecting data, we should ask ourselves certain relevant questions which can help us to figure out the right tools, technologies, and platform —

Are we interested in all data or a subset of the data?

what patterns are we going to follow?

Can we tolerate the loose data points?

Is this data sufficiently accurate and reliable?

How can stakeholders get access to this data?

What features can be made available by combining multiple sources of data?

Will this data be available in real-time?

How much will this process cost in terms of time and resources?

How will data be updated once the model is deployed?

Will the use of the model itself reduce the representativeness of the data?

Is there personally identifiable information (PII) that must be hide or anonymized?

Are there features, such as gender, that legally cannot be used in this business context?

Data Processing

Data Processing is always slower than Generation and consumption as it requires validations and transformations before it can be consumed for any meaningful insights. A numerical representation of the several different features and feeding them into the model for the training are resource-intensive and iterative processes. In order to make data processing and algorithm fast and efficient, we need to first understand what kind of approach we need.

  • All cores on a single machine
  • Using a GPU
  • Distributed computing e.g. MLib and Spark.

Each approach has its own pros and cons. It is also worth noting that not everything can be parallelized. For example, data preprocessing and transformation can be done in parallel but feeding preprocessed data to algorithm depends on the type of algorithm. Some algorithms do, some don’t.

However, for parallel processing, these tools can be useful (below is not the full list of all parallel frameworks but some popular ones)

Data Storage

There are a variety of data exists and as per business use cases, different varieties and volumes of data are required to train the model. These data could be structured (tabular data), unstructured (images, audios, videos) from huge volume to low volume, and for it, the definitely relational database could not be a good fit because of its tremendous cost. For the managed cloud, these storages exist which provide cheap cost and data replication to avoid any data loss e.g. SaaS Storage — Azure BLOB, Azure, ADLS), S3 (AWS), and IaaS storages — HDFS( Hadoop distributed file system).

Data Management

As I already mentioned Machine learning learns and predicts as per supplied data and its features. If data and features get changed so does the model’s output. And, we can’t afford this behavior if our model is in production. So, maintaining a version of the data and identifying the right source, data quality, and overall governance is important and that comes under the umbrella of Data Management. Data Management helps reproducibility for the auditing, debugging, tracing, and also implementing new changes by looking at all past behavior. This is quite a big topic and if someone is more interested to learn in more detail then this research paper could be useful.

Data Analysis

Data Analysis is the heart of data science. We often hear the statement that “data scientists spend 80% of their time on data preparation” and this statement pretty much summed up the process of data analysis. Without it, we cannot confirm and conclude the features and their impact on the model. And for the same reason data scientists spend 80% of their time on getting data, cleaning data, aggregating data, reshaping data, and exploring data using exploratory data analysis and data visualization.

A proper data analysis can tell us about all the good and bad about the data. Statistical techniques are used for data analysis to recognize hidden patterns, new features, data completeness. This step is also crucial to establish confidence in the model. Data visualization is also part of data analysis and it gets used within and as a product as well.

Model Training

Finally, we have reached the level where we do train our model with the data which we have gathered and curated so far.

At this point, it is also important to evaluate the performance of the model. In general, we use different ensemble techniques and try to find the best combination of parameters and hyper-parameters. But in a Data-Centric approach, it is important to validate input data, features, and most importantly correct labels. The precision and accuracy of the model depend on this question — Do we have quality (and labeled) data for our model? If not then we need to first focus on this before we proceed further. This process must start at the very beginning after data collection and by having subject/domain experts we can solve this problem. Below points can be taken as general guidelines for improving data quality (proposed by Andrew Ng in one of his courses)

For Small data-

  • Clean labels are critical.
  • Manual checks can be made through a dataset and fix labels.
  • Can be set a rule to be agreed on for data labelers.

For Big Data —

  • Emphasis data process.
  • Big data can also have a long tail of rare events which eventually turn out to be small data challenges within big data.

For Unstructured data —

  • May or may not have a huge collection of unlabeled examples.
  • Humans can label more data.
  • Data augmentation/synthetic data generation is more likely to be useful.

For Structured data —

  • More difficult to obtain more data. ML models typically require at least 100 examples for each class to learn to classify that class. So if you have a large number of classes e.g. 500, you already need at least 50,000 examples. The data collection can be especially difficult when some of the classes are rare. When you have a large number of of classes, it’s likely that some of them are rare.
  • Human labeling may be difficult for some use cases where we need expert domain knowledge and skills to determine actual ground truth.

Interpretation

At this level, we need an interpretation of the features and model. It’s a crucial step and to make it more reliable we need to understand bias, the behavior of the data, and AI ethics. The Explainability of data along with the model is very important and if not done in the right way it can badly impact the decision-making process. By having understandable and healthy dialogues between developers and users a common ground can be set to make data and models explainable. For instance, these questions are helpful for both parties

For a decision/recommendation, a user might ask

Why did you do that?

Why not something else?

When do you succeed or fail?

How can I trust you?

How do I correct an error?

And if a dialogue (explainability of the model and reasoning for any of the above questions) is already established then those questions can’t bother the users.

To understand more about interpretability please refer to this blog.

Data Orchestrations

We have covered all stages separately but in order to function altogether, they must need to communicate with each other. These stages need an orchestration tool that can glue and execute them together. Fortunately, there are many such tools that exist, and depending on the platform we choose for our application we can decide which fits best. Reference of few of them are here

Monitoring and Feedback loop

This is the step that verifies and continuously checks the relevance of a model. Model monitoring tools give us visibility into what is happening in production and also enable us to take appropriate action to improve the feedback loop. Centralize logging and Monitoring help us to track the performance and take corrective measures for the model by measuring data drift, concept drift, and other parameters /hyper-parameters. Ease of integration, monitoring functionality, alerting, and cost could be the parameters for choosing the right tools. A few examples are below.

Challenges in Data Life Cycle

The data flow from its generation to model training and other related metadata generation in between these processes required good management. If not managed properly it would be difficult to reproduce the model and build trust in it. Below are typical challenges that need attention on the data journey.

Absence of Good Quality Data

In the era of big data where Peta and Zeta bytes of the data are being generated and having good quality data is still a challenge. This fact was recognized and the Data-Centric approach was introduced by Andrew Ng. He proposed to focus on labeling (input-output data mapping) and quality of the data rather than big data. Data Quality, Domain and Subject Matter Experts, and Data Stewards are the people who can guide organizations to produce quality data for their analytics and decision-making teams.

Data labeling is another challenge that falls under the Data Quality domain. Depending on whether data is small/big and structured/unstructured, human labeler and synthetic data augmentation/generation tool could be useful for such scenario.

Data Drift

Data drift is the phenomenon where prediction starts to behave in an unexpected way (there is also concept drift — when the statistical property of target gets changed) because of the underlined data and features change. There could be two primary reasons for data drift

  • Sample selection bias, where the training sample is not representative of the population. For instance, building a model to assess the effectiveness of a hiring program will be biased if the decision tends towards a certain group and ethnicity.
  • Non-stationary environment, where training data collected from the source population does not represent the target population. This often happens for time-dependent tasks — such as forecasting use cases — with strong seasonality effects, where learning a model over a given month won’t generalize to another month.

In order to avoid this situation, we need a better monitoring system to track supplied data for the model. Regular checks on data with univariate statistical tests come into effect to tackle this kind of situation.

Data Security and Privacy

Security and privacy comprise confidentiality, integrity, availability, authentication, encryption, data masking, and access control. These all situations are tough to control when data flows from several steps where generally we use a mix of tools and services. To control such situations in a better way multiple data privacy laws have been introduced to make sure that data get used in a legit way. Though following all these laws still much rely on the organization that uses personal data and to create a better world we need to follow these laws and data ethics.

Data Sharing

Data sharing is important in terms of accessibility. All components and modules of the system use data-sharing technologies. The purposes of data life cycle management are to re-use the data for various purposes and it should be shared among modules in the application. Therefore, every component in the system monitors different data (structured, unstructured, and semi-structured data). It passes through different locations, systems, operating atmosphere, and access by the different platforms. It helps to support to provide meaningful data at the right time.

Metadata Management

Metadata Management is a process for the organization to become truly data-driven. Effective metadata enables data to be discovered by users, systems, or AI/ML applications, whereas without it, a manual and time-consuming process is required to physically interpret whatever data is available and decide if it’s relevant or not. This also enables us to take timely action before the model starts degrading as we can fix the issue by monitoring metadata of the data and workflow. Besides that, it is helpful in audit, the provenance of the facts, comparing the performance of different artifacts, and helping in reproducibility. However, managing metadata from end to end is a tough job and if it is done in the right way it can enhance the capability of the organization to further perform analytics over its own data.

Data Lineage

It is an important part to track the data journey from generation to deployment including the feedback loop so that the pipeline can be upgraded and leveraged for optimal performance. Pretty much every organization struggles with this. Some are maintaining it partially and some are on their way to deal with it. However, there are a few tools e.g. Talend Data Catalog, IBM DataStage, Datameer, Spark Delta (a good option as open-source and also with a proprietary product known as Databricks Delta), etc. that can be used but implementation is not always straight forwards because in this whole data life cycle we use different stacks to solve the problem and connecting each dot is a daunting task.

Conclusion

Data is the heart of an ML/AI application and handling data at each stage very much defines the success of all data applications. Practically, several challenges exist in implementation, and making a decision about relevant services and tools to create a robust data flow is a challenging task but if processes are designed and architected properly most of the challenges can be mitigated easily. The Global Data Management community is really a good place to refer for best practices regarding data management but we need to take all steps with a fine grain of salt as data operation for ML/AI is a bit different and rapidly evolving.

Thank you for taking the time to read it 💕

References

https://www.sciencedirect.com/science/article/pii/S1877050920315465

https://www.youtube.com/watch?v=06-AZXmwHjo

https://www.dama.org/cpages/books-referenced-in-dama-dmbok
Book by Chip Huyen — Designing Machine Learning Systems: An Iterative Process for Production-Ready Applications

--

--