Chapter 1 — Getting started Full stack ML course (Part 2: Unveiling the Foundations: Problem Definition, Data Acquisition, and EDA)

Shivam Kaushik
9 min readMar 7, 2024

--

In the previous blog, I discussed the course introduction, focusing on fundamental aspects. We explored the reasons for using ML solutions and began discussing the ML development lifecycle. In this blog, we’ll delve deeper into the ML development lifecycle, addressing key topics crucial for any ML project. Specifically, we’ll cover ML model problem definition, the significance of defining a problem statement, followed by the data acquisition step, and conclude with “Exploratory Data Analysis.”

3. The ML Dev lifecycle

An image showing model lifecycle

The typical ML lifecycle encompasses several key stages:

1. Problem identification

2. Data acquisition

3. Exploratory Data Analysis

4. Model creation

5. Model assessment

6. Model deployment

7. Monitoring

While these steps are essential, they don’t cover every aspect of the process. It’s worth noting that a significant portion of time is usually dedicated to the first three steps, with the least amount of time spent on model development. Additionally, it’s important to remember that these steps represent just one iteration of the development cycle.

In the following section, I’ll delve into each of these steps in more detail, providing a comprehensive overview of what they entail. Stay tuned for more insights in the upcoming discussion.

3.1 Problem definition

Alright, this segment holds significant importance. Through my experience, I’ve come to realise that this aspect is often neglected. In the past, I would repeatedly review research papers, placing excessive emphasis on methodologies rather than the paper’s overarching objectives. The crux of the matter lies in problem definition. Grasping the problem definition aids in minimizing the time required to comprehend a new problem you undertake, especially considering that you won’t be addressing the same issue throughout your career.

In the upcoming section, I’ll endeavor to outline the process of defining a problem. But what does that entail exactly? Simply put, defining a problem involves articulating the objective of a problem in a technical manner. What is the objective the method is aiming to accomplish? This must be articulated in technical terms. You can opt for plain English or, better yet, employ mathematical notation, as the latter enables a more precise and succinct definition of the problem.

Example 1: A classification problem

Imagine we have to identify if the image is hot dog or not hot dog. Then defining problem technically would look like

“The problem involves developing a binary classification model to accurately label images as ‘hot dog’ or ‘not hot dog’. The objective is to maximize the likelihood of correctly assigning labels to input images.”

Mathematically it would mean. Let x be an input image. The goal is to determine the label y such that:

We aim to learn a function f(x) that maps input images to their correct labels, where f(x) is optimized to maximize the probability of the correct label:

Example 2: A foreground detection

Consider a task where we need to segment images to identify different objects within the scene. Defining the problem technically would appear as follows:

The problem entails developing a segmentation model to partition images into distinct regions corresponding to different objects or classes. The objective is to accurately delineate object boundaries within the image.

Mathematically it would mean: Let x represent an input image.

The objective is to assign each pixel in the image to a specific class y , such that:

We aim to learn a function f(x) that maps input images and their pixel (x) to foreground or background, where f(x) is optimized to maximize the accuracy of pixel-wise class assignments:

Example 3: A next word prediction task

Suppose we have a task where we want to predict the next word in a sentence based on the context provided. Defining the problem technically would appear as follows:

The problem involves developing a language model capable of predicting the most likely next word given a sequence of input words. The objective is to accurately anticipate the word that follows the input context.

Mathematically it would mean:

Let w₁, w₂, …., wₙ represent a sequence of n words in a given text.

The goal is to predict the next word wₙ₊₁ given the preceding n words.

Mathematically, the prediction can be formulated as:

wₙ₊₁ = ₐᵣ𝓰ₘₐₓw P(w | w₁, w₂, …., wₙ)

where P(w | w₁, w₂, …., wₙ) represents the probability of the next word w given the preceding sequence of words w₁, w₂, …., wₙ .

The idea for defining the problem is to have a crystal clear understanding of the problem at hand. Unless you are able to define the problem correctly, you won’t be able to apply correct methods to the problem.

3.2 Data acquisition

Data acquisition is a crucial step in the machine learning pipeline, where the quality and relevance of the acquired data greatly influence the performance and accuracy of the resulting models. When acquiring data, it’s essential to consider various factors to ensure the integrity and representativeness of the dataset. One significant concern is the presence of biases in the data, which can lead to skewed or inaccurate model predictions. Biases can arise from various sources, including sampling methods, data collection processes, and societal or cultural influences. To mitigate biases, it’s important to employ diverse and representative datasets, actively identify and address any inherent biases, and ensure transparency and accountability in the data collection process. Additionally, adopting good practices such as data anonymization, ensuring data privacy and security, and obtaining necessary permissions or consents from data subjects are essential for ethical data acquisition. Moreover, incorporating domain knowledge and expertise during the data acquisition phase can help guide the selection of relevant features and ensure the dataset’s suitability for the intended machine learning task. By adhering to these practices, practitioners can acquire high-quality data that forms the foundation for robust and reliable machine learning models.

Okay so here is my experience with this point. I would like to show you what happens when you do a data collection and not care about its quality. Here is the thing. You must have seen this phrase

“Garbage in, Garbage out”

Well that is what actually happens. See neural networks are just function approximators, they learn from the data they are fed in. So you shouldn’t be surprised if they give predictions that are not inline with your intuition. Now is the time for example

While developing a foreground detection model, commonly referred to as “Salient Object Detection,” I conducted training and made inferences on the provided image. The resulting prediction, depicted by the red mask, appeared satisfactory at first glance. However, intuition suggested that the branch should have been identified as part of the foreground. This scenario underscores the significance of data in model training. Upon reviewing the dataset, I encountered images where the branch was not categorized as part of the foreground. Given this observation, the prediction made earlier aligns with the dataset’s characteristics.

Image samples from the dataset

3.3 Exploratory Data Analysis

Exploratory Data Analysis (EDA) is like peering through a window into the soul of your dataset. It’s the crucial first step in any data science journey, where you roll up your sleeves and dive deep into the data. EDA is all about getting acquainted with your data, understanding its quirks, patterns, and potential pitfalls before jumping into modeling. It’s like exploring a new city — you wander around, take note of landmarks, peek into hidden alleys, and get a feel for the vibe. Without EDA, you’re basically flying blind, risking misinterpretation or overlooking important insights. To tackle EDA like a pro, you should

  1. Jot down a few inquiries you’re curious about uncovering. These can span a wide range of topics.
  2. Formulate hypotheses regarding the data and put them to the test during the exploratory data analysis (EDA) phase.
  3. Begin by graphically representing your data, identifying any anomalies, examining distributions, and unraveling connections among variables.

Think of it as laying the groundwork before constructing a building — you need a solid foundation to build something meaningful. Okay, I know this is the time where I have to give you some examples. I promise these would be really interesting. Okay, I can say this for me atleast. Before I start, I should provide you with this link https://www.kaggle.com/code/headsortails/hidden-gems-a-collection-of-underrated-notebooks , this is called kaggle hidden gems. Kaggle is a platform for data science enthusiast, and learners. You should check this out. It has links to many interesting notebooks. Back to some examples of EDA.

3.3.1 Example 1: Analysis of Chai Time Data Science Podcast Notebook Link

In this insightful notebook, the author dives deep into the YouTube analytics of the Chai Time Data Science (CTDS) podcast, shedding light on what makes its episodes successful and how to expand its reach. I highly recommend you to check out his channel. The author explores why YouTube is a key player in CTDS’s strategy and uncovers valuable insights to enhance its performance.

Top 5 Questions Explored:

  1. What factors influence viewer engagement with video thumbnails?
  2. How does the thumbnail format impact click-through rates?
  3. What’s the effect of releasing mini-CTDS episodes on the same day?
  4. Are custom thumbnails with CTDS branding more effective?
  5. What drives viewers to watch a video beyond the initial click?

Hypotheses Examined:

  1. Thumbnail design, including topic and duration, influences viewer behavior.
  2. Spacing out mini-CTDS releases could boost overall engagement.
  3. Custom thumbnails with CTDS branding may attract more viewers.
  4. Viewer interest correlates with the percentage of the video watched.
  5. Relevance and content quality determine viewer retention, especially evident in mini-CTDS episodes.

Conclusion:

The analysis provides valuable insights into optimizing CTDS’s YouTube presence. By understanding viewer behavior and preferences, the podcast can tailor its content and marketing strategies to attract a wider audience. With these insights, CTDS can continue to foster a vibrant community of data science enthusiasts and thought leaders. Let’s harness the power of data to amplify the impact of CTDS on YouTube!

3.3.2 Analysis of Kaggle survey. Notebook link

This notebook is incredibly intriguing as it delves into several key questions regarding data science professionals. It explores topics such as the necessary skills for these professionals and the preferred programming languages they utilize. Among the fascinating insights provided, one can observe:

  • The diverse range of professions within the data science field, extending beyond just data scientists. The accompanying graph offers a visual representation of these roles.
  • A comparison of the skills required for business analysts versus ML engineers, highlighting significant differences between the two roles. Below figure is made by me. The author has shown cluster plot which more or less conveys the same message.

The primary purpose of exploratory data analysis (EDA) is to glean insights from the data, to find answers to your inquiries, and to uncover any latent biases that must not be disregarded.

Conclusion

In conclusion, this blog installment has provided a comprehensive exploration of key components in the machine learning (ML) development lifecycle. We’ve discussed the critical importance of accurately defining problem statements, ensuring high-quality data acquisition, and conducting thorough exploratory data analysis (EDA) to lay the groundwork for successful ML projects. By understanding these foundational principles and practices, practitioners can navigate the complexities of ML development with greater clarity and efficacy. As we continue our journey, let’s apply these insights to real-world scenarios and explore further stages of the ML lifecycle. Stay tuned for more practical examples and in-depth discussions to deepen your understanding of machine learning concepts.

The link for next chapter will be provided here.
https://medium.com/@shivam.kaushik73/chapter-1-getting-started-full-stack-ml-course-part-3-model-creation-model-assessment-model-03828b7614d8

--

--