An Overview of the Data Science Lifecycle

Honzik J
13 min readMar 5, 2023

--

A nine-step approach to the data science lifecycle

If you’re new to the world of data science, you may be overwhelmed by the sheer amount of information available. Whilst the data science lifecycle can seem complex, with a solid overview, you can navigate it like a pro.

In this article, we’ll take you on a journey through the data science lifecycle, exploring everything from business understanding to model deployment. I will illustrate the data science lifecycle as I have personally experienced it, so there may be variations depending on where you work, the complexity of your data, or regulatory constraints. Regardless, there should be a few takeaways for the beginner. So grab a cup of coffee, sit back, and let’s dive in!

1. Business understanding

The first stage of the data science lifecycle is problem formulation, which involves understanding the business problem and defining the project's objectives. This stage is crucial to the project’s success as it sets the foundation for the entire data science process.

This requires engaging with stakeholders to understand their needs and requirements. It is important to fully understand the business problem before jumping straight into technical approaches. It is also possible that stakeholders have unrealistic expectations or may not fully understand the technical limitations. Therefore, it is essential to have a good grasp of the business problem to avoid misunderstandings and ensure that the project is aligned with the business goals.

It is also necessary to consider the required level of complexity. For example, machine learning models are typically more complex and may require more data, time, and resources to develop and maintain. Therefore, it is essential to consider whether a machine learning model’s complexity and cost are justified by its potential benefits. We should adhere to the principle of parsimony, choosing the simplest model that answers the business problem.

Another important consideration is the level of explainability required for the solution. The business question, regulatory framework or ethical considerations may require you to explain how the solution was arrived at, and the factors that influenced the outcome. For example, a financial institution may need to explain how a credit scoring model arrived at a certain decision. In some cases, explainability may be more important than the model's predictive power.

Once we have defined the business question, it is time to think about how to translate this into a technical problem. For example, we might start formulating this as a machine learning problem. We would start thinking about what data points we will need, whether it will be a supervised or unsupervised learning method and whether it is regression, classification or something else.

As data scientists, it is our responsibility to understand both the business domain and technical aspects. The success of a data science project depends on how well we can grasp both of these. If the problem is not formulated correctly, it may lead to incorrect or irrelevant results. It is also a good idea to consistently communicate with stakeholders throughout the entire lifecycle. As we progress, we are bound to hit roadblocks, discover new questions that need to be asked, or need to pivot our approach. By keeping our stakeholders looped in, we ensure that our solution stays in touch with the business objective.

2. Data collection

Once we have understood the business problem and started formulating the technical solution, we can start collecting the data we need.

The process of collecting data can be challenging, especially when working with a large number of data sources. Acquiring data from different systems and sources can lead to data inconsistency and duplication issues, which can negatively impact the quality of the analysis. We must consider the reliability of the data sources and their compatibility with the analysis tools. If the data we are feeding in is of poor quality, our outputs will likely also be of poor quality. If one of the data sources breaks, it could lead to downtime.

As a part of data collection, information management and data governance also play a crucial role. It is essential to ensure that the definitions and metrics used in the analysis are consistent with the business objectives and are correctly defined. This is a point where we can check in with our stakeholders to ensure that we are on the same page. By implementing proper data governance and information management practices, we can ensure that the data is trustworthy and can be used for making informed business decisions.

3. Exploratory data analysis (EDA)

Once we’ve collected some data, we can start doing an exploratory data analysis, or EDA. This is part art, part science and there is no set way of approaching an EDA. But the overall objective is to understand the characteristics of our data, and how we might handle it. It should be noted that it is flexible at what stage we do this. In some cases, it will help to do this before we start transforming or pre-processing our data, and in some cases, we will want to do it after. In most cases, we will constantly be learning things about our data as we progress with the lifecycle.

The general goals of EDA are:

  • Assessing data quality
  • Identifying Key Variables
  • Visualising the Data
  • Uncovering patterns in the data
  • Formulating Hypotheses

A good starting point for an EDA (and a time-saver) is to use a data profiling tool, which allows you to do many of these tasks, such as assessing how many missing values you have, identifying categorical and numerical variables and doing a general data quality assessment. For Python users, a valuable tool called Pandas Profiling is a quick and efficient way to profile your data.

4. Data pre-processing

Once we have a good handle on our data, we can transform the raw data into a suitable format for analysis and modelling. Data pre-processing aims to prepare the data in a way that maximises the accuracy and effectiveness of the subsequent modelling stages. Some core tasks are data cleaning, feature selection, feature engineering, and creating a training and testing split (for supervised learning).

Data cleaning

Data cleaning tasks may include removing outliers, dealing with missing values (imputation), and removing duplicates. Data cleaning is important because it can help improve the accuracy and efficiency of subsequent modelling stages—poor quality in, poor quality out. How we handle things like missing values will depend on the problem itself and has multiple strategies. This is a topic in and of itself.

Transforming features

We will also need to consider how we handle specific features. Many models only take numerical inputs, meaning we must encode categorical variables into binary variables (known as one-hot encoding). Scaling our numerical features can also help boost performance in certain types of models

Feature selection

Feature selection involves identifying the most important features in the data and removing irrelevant or redundant features. This can help simplify the model and improve its accuracy and efficiency. One approach to feature selection is to use techniques such as correlation analysis or tree-based models to identify the most important features. Removing correlated features is another step to consider in data pre-processing, as correlated features can introduce bias into the model and make it more difficult to interpret the relative importance of each feature.

Feature engineering

Feature engineering involves creating new, more meaningful features from the existing data that may be more useful in predicting the target variable. This may involve creating new variables based on existing ones or transforming existing variables. For example, suppose we’re trying to predict customer churn. In that case, we may engineer features such as the customer’s tenure, their recent purchase history, or their overall level of engagement with the company. This is where our domain knowledge comes into play, as it allows a more informed approach to feature engineering rather than blind experimentation.

Prepare train and test set

The final step involves splitting the data into training and testing sets. The training set is used to build the model, while the testing set is used to evaluate its performance on new data (as our goal is to create a model that generalises well to unseen data). It’s essential to ensure that the training and testing sets are representative of the overall population and are randomly sampled from the data. It is important to note that the test set should not be used for any part of the modelling process, including pre-processing and feature selection. Doing so can lead to overfitting, where the model performs well on the test set but poorly on new, unseen data.

5. Model Building

Model building is the stage of the data science lifecycle where we train and evaluate predictive models using our pre-processed and scaled data. This is the stage where we choose the appropriate algorithm for our problem and tweak the hyperparameters to achieve the best performance.

Select an algorithm

The first step in model building is to select an appropriate algorithm. This is often based on the problem type and the characteristics of the data. First, we will need to know whether our problem is supervised or unsupervised and then whether it is regression or classification (supervised) or something else like clustering (unsupervised). There are many different algorithms to choose from, each with its strengths and weaknesses. Some popular algorithms include linear regression, logistic regression, decision trees, random forests, support vector machines, and neural networks. The choice of algorithm depends on the nature of the data, the complexity of the problem, and the performance metrics we’re interested in.

Train your model

Once we have selected an algorithm, we can begin training it on the data. This involves using the training data (prepared earlier) to estimate the model parameters. Once this is done, we now have our model. The goal of training is to find the best set of parameters that minimise the loss function. The loss function is a measure of how well the model fits the data. Depending on the problem type and the modelling approach, many different loss functions can be used.

6. Model validation

When we have a trained model, we can use it to make predictions on a test set. Remember, this is the unseen data we have kept from the model. We know the answers, but the model doesn’t. This lets us test the model and see how well it guesses the answers. This is how we evaluate the model’s performance on new, unseen data and make sure it’s not overfitting to the training set.

Assessing model performance

The evaluation of the model is typically done using various validation metrics, each with its own merits and applications. Some examples are accuracy, precision, recall, R-Score, and Mean Squared Error (MSE). Keep in mind that there may be tradeoffs between different metrics, and it’s up to the business stakeholders to decide which metric(s) are most important for the problem at hand.

Overfitting

Overfitting is when the model memorises the details of the training set too well and doesn’t generalise well on new data. Think of this as memorising the exact answers of a multiple-choice test without reading or understanding the actual questions. You can do very well on that exact test, but you'll perform poorly as soon as you’re faced with even a slightly different test. We want to ensure the model ‘understands’ the pattern in the data.

Reaching satisfactory performance

Whether you’ve reached a sufficient level of performance will depend on the specific business problem you’re trying to solve. For example, if you’re trying to predict customer churn for a subscription service, you might aim to achieve a certain level of accuracy in your predictions. Alternatively, if you’re trying to optimise a marketing campaign, you might be more interested in maximising the number of conversions or minimising the cost per conversion.

Improving model performance

If your current model isn’t performing well, don’t stress. Model building is an iterative process, and you almost certainly won’t get it on the first try. This is where you will experiment and find out what works best (prepare to train many models). You can try the following things:

  1. Revisit your data: Sometimes, the problem is with the data you‘re working with. In this case, you can also go back to feature selection (adding or removing features), feature engineering (creating more meaningful features) or getting more data could be the solution. This might involve collecting additional data or finding new sources of data to work with. You can make substantial improvements in this area.
  2. Try a different algorithm: For example, if you’re using linear regression, you might try a decision tree or a neural network instead. Different algorithms can be better suited for different types of data and problems, so experimenting can be an effective way to improve your results. If you’re using an overly complex algorithm, it may be overfitting, and you can try a simpler one.
  3. Change the hyperparameters: Lastly, you can tune the model hyperparameters. In machine learning, hyperparameters are like the settings that help the model learn and make predictions. Imagine you want to build a robot that can clean your room. You have all the parts you need, but you must decide how fast it should move, how long it should pause between movements, and how sensitive it should be to finding obstacles. These are like the robot’s hyperparameters — they don’t change as the robot works, but you have to choose the right values for them to make the robot work best. Changing these settings can make your model perform its task better. You can also tune your hyperparameters to find the optimal values.

7. Communication of insights

If you are not deploying the model and are simply sharing insights with stakeholders, your focus will be on communicating your findings in an accessible way that ties it back to the defined business problem.

Data visualisation

Data visualisations and storytelling can be powerful tools to convey complex ideas clearly and engagingly. You may create graphs and charts highlighting key findings or trends. Another option is using interactive dashboards to allow stakeholders to explore the data themselves. Different visualisations may be more effective for different types of data, so you may need to experiment with different approaches to see what works best to convey your message.

Data storytelling

In addition to visualisations, data storytelling can be an effective way to communicate insights. This involves framing your analysis as a narrative with a clear beginning, middle, and end. You may choose to use real-world examples or anecdotes to make the insights more relatable and use data to support your points.

Consider your target audience

As with any form of communication, keeping your audience in mind and tailoring your message accordingly is essential. You may need to adjust the level of detail, use language that is accessible to non-technical stakeholders, and provide context for your findings.

Ultimately, sharing insights with stakeholders aims to help them make better decisions based on data. By presenting your findings clearly and engagingly, you can help them understand the implications of the data and make informed choices.

8. Model Deployment

Once you have a model you are happy with, you may wish to deploy it in a real-world setting. This means you will need to integrate the model into your existing systems or create new ones to support it. This process can involve collaboration with other teams, such as IT, engineering, and operations.

Testing for robustness and reliability

Before deploying the model, it is essential to thoroughly test it to ensure it is robust and reliable. This may involve testing the model in a staging environment, where it is subjected to various scenarios to ensure it performs well under different conditions. It is important to keep in mind that the model's success is not solely determined by its performance metrics but also by how well it meets the needs of the business. It is essential to continually assess the impact of the model on the business and to ensure that it is providing real value.

Model monitoring

Additionally, it is crucial to monitor the model's performance once it is deployed to ensure that it continues to perform well over time. Once the model is deployed, you should continue to iterate on it and improve its performance. Model drift and data drift are two things that can occur after deploying a machine learning model in production. Therefore it is important to regularly monitor the model’s performance over time and re-evaluate the model when necessary. This can involve retraining the model on new data, adjusting the model’s hyperparameters, or changing the model architecture to better suit the current business problem or data distribution.

9. Ethical considerations

As data scientists, it is important to be aware of the ethical considerations involved in model building. While I’ve put this step at the end of the lifecycle, we should be considering the ethics of what we implement throughout the entire lifecycle, starting with business understanding. The use of data can have a significant impact on individuals, communities, and society as a whole. Therefore, it is crucial to ensure that data and models are used ethically and responsibly.

Key considerations

Below are some key ethical considerations that data scientists should keep in mind when building models:

  1. Bias: Models can perpetuate and amplify biases that are present in the data used to build them. This can result in discrimination against certain groups of people.
  2. Privacy: Data can contain sensitive personal information, and the use of this data needs to be done in a way that protects individual privacy rights.
  3. Transparency: Models should be transparent and explainable so that users can understand how they work and make decisions. This is particularly important for models that significantly impact people’s lives, such as those used for credit scoring or hiring decisions.
  4. Accuracy: Models should be accurate and reliable, with clear measures of performance and validation. It is essential to test the model thoroughly and ensure that it is fit for purpose.
  5. Accountability: Data scientists have a responsibility to ensure that their models are used responsibly and ethically. It is important to be aware of the potential impact of the model and to take steps to mitigate any possible negative effects.

By keeping these points in mind, data scientists can ensure that their models are used responsibly and ethically and contribute to positive outcomes for individuals and society as a whole.

Conclusion

Data science is revolutionising how we approach problem-solving and decision-making across various industries. The data science lifecycle provides a structured and systematic approach to solving business problems using data-driven insights. Following the key steps outlined in this article can help you harness the power of data to stay ahead of the curve. As the field continues to grow and evolve, the future of data science looks bright, and those who embrace its potential are well-positioned for success in the coming years!

--

--