Mastering Industry-Grade Data Pipelines: Web Scraping with Python E-commerce Data: Flipkart Edition 🛒💻 (Part 1: Building the Foundation — Utilities and Logging).

Data Engineering Blogpost
11 min readJun 5, 2024

--

  • Struggling to find real-time data for your data science projects? 🤔
  • Ever spent hours searching for a real-time dataset to train your machine learning model? 🕵️
  • Dreamt of training machine learning models on data that’s as fresh as yesterday’s news? 📰

Yeah, me too. Kaggle datasets get old!

Remember those nights spent scouring the web for a decent real-time dataset to train your machine learning model? I’ve been there. Seeing countless friends struggle with the same issue, I decided to create this project. 🚀

Welcome to our two-part blog series, Mastering Industry-Grade Data Pipelines: Web Scraping with Python for E-commerce Data. During my journey in learning data science, I observed many of my peers struggling to find real-time raw datasets for building their machine learning models. This common challenge inspired me to create a project that provides a practical solution.

Let’s dive in and make real-time data accessible for everyone! 🌐

In this blog, we will focus on building a robust, industry-grade data pipeline using Python. We will guide you through the essential components of an industry-grade data pipeline, including:

  • 🔧 Modular coding practices (say goodbye to spaghetti code! ‍🍝👋)
  • 🧩 Object-oriented programming principles (let’s make your code classy and reusable ♻️)
  • 📝 Effective logging strategies

By the end of this blog, you’ll not only know how to design an efficient scraper but also how to create a seamless data pipeline that automates and optimizes data processing.

Project Overview

In this project, we will embark on a journey to scrape raw data from Flipkart, focusing specifically on laptop data. Our objective is to:

  1. 🛠 Extract raw data from Flipkart
  2. 🧹 Preprocess and clean it, ensuring its quality and usability (clean data, happy data! 😉✨)
  3. 🗄 Load the refined data into a MySQL database for efficient storage and management

This comprehensive approach will equip you with the skills to handle real-world data pipeline challenges, enabling you to transform raw e-commerce data into valuable insights.

Project Architecture

This project constructs a data pipeline using Python to gather laptop data from Flipkart and store it in a MySQL database. Here’s an overview of the process:

  1. 📦 Extracting Data from Flipkart: We use the requests module to send requests to Flipkart's website and retrieve the raw HTML content of product pages.
  2. 🔄 Data Transformation: Using BeautifulSoup, we extract key elements like titles, processor names, RAM types, operating systems, and prices. We then clean 🧹 and process 🪄 this data with Python (fixing typos, standardizing units) to ensure clean data (Making sure our data isn’t a hot mess 🥴🔥).
  3. 🗄 Loading Data into MySQL: After transforming the data, we utilize pymysql to connect and interact with our MySQL database, where we store the transformed data.

By following these steps, you will learn to create an efficient scraper and build a seamless data pipeline that automates and optimizes data processing for real-time e-commerce data.

Prerequisites

Before we embark on our data pipeline adventure, let’s make sure you have the essential tools in your toolbox:

  1. 🐬MySQL Installation: Ensure MySQL is installed on your machine. Follow the instructions in this video: MySQL Installation Guide.
  2. 🐍 Python Installation: Ensure Python is installed on your machine. Download and install it from the official Python website.

📁 Crafting an Effective Folder and File Structure

A well-defined folder structure is crucial for maintaining modular, scalable, and readable code. Systematically organizing your files ensures logical separation of project components, enhancing collaboration, debugging, and maintenance. Remember, a tidy project is a happy project! 🧹✨ .

Now, let’s have a look at the folder structure and a brief explanation of each file and folder:

Now, let’s have a look at how to create the folder structure with the help of the video🎥 below:

Creating a folder structure

📚Resources Section

  • Comprehensive YouTube Video🎬: This YouTube playlist contains all the videos related to this blog, We’ll Walk you through each step of the process visually, making it easier to grasp the concepts.
  • GitHub Text File 📄: This repository includes all the code referenced in this blog, ensuring you have access to all the scripts and tools you need for your project.

🚀Steps Covered

In this section, we outline the key steps covered in this blog to guide you through setting up and executing the project. Each step is designed to help you systematically build and manage your project from the ground up.Let’s dive in! 🌟

Part 1: Building the Foundation — Utilities and Logging

In this first part, we’ll focus on essential steps to set up a robust data pipeline, including building utility functions and implementing a logging system. We will see all the utility and logger functions with examples.

Part 2: Diving Deep into Data Scraping — Full Load, Incremental Updates, and Orchestration

In this second part, we will develop scripts to automate the data scraping process. We’ll cover both full and incremental data loads, ensuring our pipeline can handle real-time data updates efficiently.

In This Part 1

We will focus on establishing the basics: setting up the project environment, creating the database structure, and building essential utility and logging modules. These steps are crucial for creating a scalable and maintainable data pipeline. By the end of this part, you will have a well-structured foundation ready for more advanced data scraping and processing tasks.

Now, let’s have a detailed look at each and every step. Buckle up, it’s going to be an exciting ride! 🎢✨

Step 1: 🔧Setting up the project environment and installing libraries

Before we dive into web scraping, we need to set up our project environment and install the necessary libraries.

Step 1: 🔧Setting up the project environment and installing libraries

For a visual guide of this step, refer to the video tutorial above. 🎬

1.1 Setting up the Virtual Environment

First, let’s create a virtual environment to keep our project dependencies isolated. This makes it easier to manage and avoids conflicts with other projects.

  1. Open your terminal/command prompt within VS Code.
  2. Navigate to your project directory using terminal commands.
  3. Create the virtual environment:
python -m venv End_To_End_ML_Project

4. Activate the virtual environment:

#Windows
.\End_To_End_ML_Project\Scripts\activate
#Linux or mac os
source End_To_End_ML_Project/bin/activate

1.2 Installing Required Libraries

With the virtual environment set up and activated, let’s install the necessary libraries:

pip install requests
pip install beautifulsoup4
pip install PyMySQL
pip install logging

By following these steps, you have successfully set up a virtual environment and installed the required libraries for our project. Now we are ready to move on to creating the database. Let’s keep going! 🚀

Step 2: 📊 Creating the Database and Table Structure

In this step, we will create a database in MySQL Workbench and define the table structure for storing our scraped data.

2.1 Creating the Database

To create the database, follow these steps:

2.1 Creating the Database

For a visual guide to create a database, refer to the video tutorial above. 🎬

  1. Open MySQL Workbench.
  2. Write the following command in a new SQL tab:
CREATE DATABASE ecommerce_laptop_product_scraper;

3. Execute the command to create the database.

4. Save the file as SQL_queries.sql in your project directory.

2.2 Viewing the Table Structure

Below is an image of the table structure we will use for storing our scraped laptop product data.

We will create the table in the Python code in further steps. This ensures that our table structure is defined programmatically, making it easier to manage and modify as needed.

By following these steps, you have successfully created the database in MySQL Workbench. Next, we will use Python code to create the table and start interacting with our database. Stay tuned! 🚀

Step 3: 🛠️Building the utils.py module for utility functions

In any software project, utility functions play a crucial role by providing reusable pieces of code that handle common tasks. These functions enhance code modularity, maintainability, and readability. In this section, we will create a utils.py module, which will contain the necessary code that is used multiple times throughout our project.

This utils.py module will encapsulate four essential functions: establishing database connections, creating tables, inserting data, and retrieving data. Let's explore each function in the utils.py module. We will start with a brief overview of each function, followed by an in-depth explanation of all four functions and how to execute and use them effectively.

First, you have to create a file named utils.py as demonstrated in the video tutorial below. 🎬📹

Step 3.1 : Creating a Utils.py File📄

Overview of Functions

Detailed Explanation of Each Function

  1. Function :-Connect_To_Mysql_Database

The function Connect_To_Mysql_Database is a static method that establishes a connection to a specified MySQL database. 🗄️

  • It uses the pymysql.connect function to connect to the database using the provided database name, localhost as the host, root as the user, and the specified password.
  • It creates a cursor object for executing SQL queries.
  • It returns both the connection and cursor objects for further database operations.

Think of this as your VIP pass to the database club! 🎟️

2. Function :- Create_Table

The Create_Table function creates a new table in the specified database if it doesn't already exist. 🏗️

  • Defines the table schema with columns such as Title, Processor, RAM, etc.
  • Executes the SQL command and commits the changes.

Watch the table creation in action in the video below! 📺✨

2 ) Executing🚀 the Create_Table function

3. Function :- Save_Laptop_Data

The Save_Laptop_Data function inserts laptop data into a specified table in the database. 💾

  • It retrieves the data values from the Data dictionary, converts them to strings, and stores them in a tuple.
  • Constructs an SQL INSERT statement with placeholders for data values.
  • Executes the SQL command and commits the changes.

Check out the data insertion process in the video below! 🚀🎥

3) Executing🚀 the Save_Laptop_Data function

4. Function: Get_Title

The Get_Title function retrieves all the titles from a specified table in the database. 📚

  • It constructs and executes an SQL SELECT statement to get the Title column data.
  • It fetches all the data returned by the SQL command.
  • It flattens the list of tuples into a single list of titles.
  • It returns the list of titles.

See how to fetch title data using the Get_Title function in the video below! 🔍📹

4) Executing🚀 the Get_Title function

And there you have it! With these utility functions, you’re now equipped to handle database operations like a pro. Whether it’s connecting to your database, creating tables, saving data, or fetching it, our utils.py module has got you covered.

Check out the code for this step in the GitHub repository.

Step 4: 📝Building the logger.py module for error handling and logging messages

In this step, we will focus on creating a robust logging system to help with error handling and logging important messages. We’ll create a logger.py module that sets up a logging system to write log messages to a file. This file will be stored in a designated directory with a unique name based on the current timestamp. This ensures that each run of the program generates a new log file, making it easier to track changes and debug issues.

First, you have to create a file named __init__.py as demonstrated in the video tutorial below. 🎬📹

Creating a __init__.py File📄

Code Explanation:

The code sets up a logging system in Python for your project using the logging module. 🐍

  • 📁 It creates a directory named ‘Logs’ if it doesn’t exist.
  • 🕒 Generates a log file with a timestamped name in that directory.
  • 📝 The logger is configured to write messages in a specific format to the log file.

This setup allows for error handling and logging of important messages, aiding in debugging and tracking changes in the program. 🛠️🔍

Sample Log File

Below is an example of how the logs might appear in the generated log file, giving a more detailed understanding of how the code works:

To demonstrate the working of the logger, you can view a short video🔍📹:

Executing logger.py Module

With the logger.py module set up, you now have a powerful tool for error handling and logging important messages. Each run generates a new log file, providing a clear way to track your application's behavior over time.

Logging is like a black box for your application — it records everything, making debugging easier. Implement this in your project to simplify error tracking and resolution. Happy logging! 📜🛠️

Check out the code for this step in the GitHub repository and try_log.py.

Conclusion

In this first part, we set up the project environment, created the database structure, and built essential utility and logging modules. These foundational steps are crucial for creating a scalable and maintainable data pipeline.

Stay tuned for Part 2, where we’ll dive into automating the data scraping process and handling real-time data updates.

Follow me on Medium to stay updated on future posts and feel free to give claps if you found this post helpful!

Thank you for reading! If you have any questions or feedback, leave a comment below. Happy coding! 🧑‍💻🎉

--

--

Data Engineering Blogpost
Data Engineering Blogpost

Written by Data Engineering Blogpost

Data engineering enthusiast ️ Building pipelines, wrangling data, & unlocking insights. AWS, Snowflake, Python . Join me on the journey!