Mastering Industry-Grade Data Pipelines: Web Scraping with Python E-commerce Data: Flipkart Edition 🛒💻 (Part 1: Building the Foundation — Utilities and Logging).
- Struggling to find real-time data for your data science projects? 🤔
- Ever spent hours searching for a real-time dataset to train your machine learning model? 🕵️
- Dreamt of training machine learning models on data that’s as fresh as yesterday’s news? 📰
Yeah, me too. Kaggle datasets get old!
Remember those nights spent scouring the web for a decent real-time dataset to train your machine learning model? I’ve been there. Seeing countless friends struggle with the same issue, I decided to create this project. 🚀
Welcome to our two-part blog series, Mastering Industry-Grade Data Pipelines: Web Scraping with Python for E-commerce Data. During my journey in learning data science, I observed many of my peers struggling to find real-time raw datasets for building their machine learning models. This common challenge inspired me to create a project that provides a practical solution.
Let’s dive in and make real-time data accessible for everyone! 🌐
In this blog, we will focus on building a robust, industry-grade data pipeline using Python. We will guide you through the essential components of an industry-grade data pipeline, including:
- 🔧 Modular coding practices (say goodbye to spaghetti code! 🍝👋)
- 🧩 Object-oriented programming principles (let’s make your code classy and reusable ♻️)
- 📝 Effective logging strategies
By the end of this blog, you’ll not only know how to design an efficient scraper but also how to create a seamless data pipeline that automates and optimizes data processing.
Project Overview
In this project, we will embark on a journey to scrape raw data from Flipkart, focusing specifically on laptop data. Our objective is to:
- 🛠 Extract raw data from Flipkart
- 🧹 Preprocess and clean it, ensuring its quality and usability (clean data, happy data! 😉✨)
- 🗄 Load the refined data into a MySQL database for efficient storage and management
This comprehensive approach will equip you with the skills to handle real-world data pipeline challenges, enabling you to transform raw e-commerce data into valuable insights.
Project Architecture
This project constructs a data pipeline using Python to gather laptop data from Flipkart and store it in a MySQL database. Here’s an overview of the process:
- 📦 Extracting Data from Flipkart: We use the
requests
module to send requests to Flipkart's website and retrieve the raw HTML content of product pages. - 🔄 Data Transformation: Using
BeautifulSoup
, we extract key elements like titles, processor names, RAM types, operating systems, and prices. We then clean 🧹 and process 🪄 this data with Python (fixing typos, standardizing units) to ensure clean data (Making sure our data isn’t a hot mess 🥴🔥). - 🗄 Loading Data into MySQL: After transforming the data, we utilize
pymysql
to connect and interact with our MySQL database, where we store the transformed data.
By following these steps, you will learn to create an efficient scraper and build a seamless data pipeline that automates and optimizes data processing for real-time e-commerce data.
Prerequisites
Before we embark on our data pipeline adventure, let’s make sure you have the essential tools in your toolbox:
- 🐬MySQL Installation: Ensure MySQL is installed on your machine. Follow the instructions in this video: MySQL Installation Guide.
- 🐍 Python Installation: Ensure Python is installed on your machine. Download and install it from the official Python website.
📁 Crafting an Effective Folder and File Structure
A well-defined folder structure is crucial for maintaining modular, scalable, and readable code. Systematically organizing your files ensures logical separation of project components, enhancing collaboration, debugging, and maintenance. Remember, a tidy project is a happy project! 🧹✨ .
Now, let’s have a look at the folder structure and a brief explanation of each file and folder:
Now, let’s have a look at how to create the folder structure with the help of the video🎥 below:
📚Resources Section
- Comprehensive YouTube Video🎬: This YouTube playlist contains all the videos related to this blog, We’ll Walk you through each step of the process visually, making it easier to grasp the concepts.
- GitHub Text File 📄: This repository includes all the code referenced in this blog, ensuring you have access to all the scripts and tools you need for your project.
🚀Steps Covered
In this section, we outline the key steps covered in this blog to guide you through setting up and executing the project. Each step is designed to help you systematically build and manage your project from the ground up.Let’s dive in! 🌟
Part 1: Building the Foundation — Utilities and Logging
In this first part, we’ll focus on essential steps to set up a robust data pipeline, including building utility functions and implementing a logging system. We will see all the utility and logger functions with examples.
- Step 1: 🔧Setting up the project environment and installing libraries: Learn how to initialize your project and install the necessary libraries.
- Step 2: 📊Creating the database and table structure: Understand how to design and create the database and tables for storing your data.
- Step 3: 🛠️Building the
utils.py
module for utility functions: Implement functions to facilitate interaction with your database and streamline table creation. - Step 4: 📝Building the
logger.py
module for error handling and logging messages: Create a logging module to handle errors and log important messages.
Part 2: Diving Deep into Data Scraping — Full Load, Incremental Updates, and Orchestration
In this second part, we will develop scripts to automate the data scraping process. We’ll cover both full and incremental data loads, ensuring our pipeline can handle real-time data updates efficiently.
- Step 5: 🌐Developing the
fullLoad.py
script for initial data scraping: Write a script for scraping the initial dataset and loading it into your database. - Step 6: 🔄Creating the
incrementalLoad.py
script for capturing new data updates: Implement a script to capture and load incremental data updates. - Step 7: Implementing the
main.py
script to orchestrate the scraping process: Develop the main script to coordinate the entire scraping process and ensure seamless data updates. - Step 8: 🧪 Testing the fullload and increamental load code: Test the code to ensure they work as expected.
In This Part 1
We will focus on establishing the basics: setting up the project environment, creating the database structure, and building essential utility and logging modules. These steps are crucial for creating a scalable and maintainable data pipeline. By the end of this part, you will have a well-structured foundation ready for more advanced data scraping and processing tasks.
Now, let’s have a detailed look at each and every step. Buckle up, it’s going to be an exciting ride! 🎢✨
Step 1: 🔧Setting up the project environment and installing libraries
Before we dive into web scraping, we need to set up our project environment and install the necessary libraries.
For a visual guide of this step, refer to the video tutorial above. 🎬
1.1 Setting up the Virtual Environment
First, let’s create a virtual environment to keep our project dependencies isolated. This makes it easier to manage and avoids conflicts with other projects.
- Open your terminal/command prompt within VS Code.
- Navigate to your project directory using terminal commands.
- Create the virtual environment:
python -m venv End_To_End_ML_Project
4. Activate the virtual environment:
#Windows
.\End_To_End_ML_Project\Scripts\activate
#Linux or mac os
source End_To_End_ML_Project/bin/activate
1.2 Installing Required Libraries
With the virtual environment set up and activated, let’s install the necessary libraries:
pip install requests
pip install beautifulsoup4
pip install PyMySQL
pip install logging
By following these steps, you have successfully set up a virtual environment and installed the required libraries for our project. Now we are ready to move on to creating the database. Let’s keep going! 🚀
Step 2: 📊 Creating the Database and Table Structure
In this step, we will create a database in MySQL Workbench and define the table structure for storing our scraped data.
2.1 Creating the Database
To create the database, follow these steps:
For a visual guide to create a database, refer to the video tutorial above. 🎬
- Open MySQL Workbench.
- Write the following command in a new SQL tab:
CREATE DATABASE ecommerce_laptop_product_scraper;
3. Execute the command to create the database.
4. Save the file as SQL_queries.sql
in your project directory.
2.2 Viewing the Table Structure
Below is an image of the table structure we will use for storing our scraped laptop product data.
We will create the table in the Python code in further steps. This ensures that our table structure is defined programmatically, making it easier to manage and modify as needed.
By following these steps, you have successfully created the database in MySQL Workbench. Next, we will use Python code to create the table and start interacting with our database. Stay tuned! 🚀
Step 3: 🛠️Building the utils.py
module for utility functions
In any software project, utility functions play a crucial role by providing reusable pieces of code that handle common tasks. These functions enhance code modularity, maintainability, and readability. In this section, we will create a utils.py
module, which will contain the necessary code that is used multiple times throughout our project.
This utils.py
module will encapsulate four essential functions: establishing database connections, creating tables, inserting data, and retrieving data. Let's explore each function in the utils.py
module. We will start with a brief overview of each function, followed by an in-depth explanation of all four functions and how to execute and use them effectively.
First, you have to create a file named utils.py
as demonstrated in the video tutorial below. 🎬📹
Overview of Functions
Detailed Explanation of Each Function
- Function :-Connect_To_Mysql_Database
The function Connect_To_Mysql_Database
is a static method that establishes a connection to a specified MySQL database. 🗄️
- It uses the
pymysql.connect
function to connect to the database using the provided database name,localhost
as the host,root
as the user, and the specified password. - It creates a cursor object for executing SQL queries.
- It returns both the connection and cursor objects for further database operations.
Think of this as your VIP pass to the database club! 🎟️
2. Function :- Create_Table
The Create_Table
function creates a new table in the specified database if it doesn't already exist. 🏗️
- Defines the table schema with columns such as Title, Processor, RAM, etc.
- Executes the SQL command and commits the changes.
Watch the table creation in action in the video below! 📺✨
3. Function :- Save_Laptop_Data
The Save_Laptop_Data
function inserts laptop data into a specified table in the database. 💾
- It retrieves the data values from the
Data
dictionary, converts them to strings, and stores them in a tuple. - Constructs an SQL
INSERT
statement with placeholders for data values. - Executes the SQL command and commits the changes.
Check out the data insertion process in the video below! 🚀🎥
4. Function: Get_Title
The Get_Title
function retrieves all the titles from a specified table in the database. 📚
- It constructs and executes an SQL
SELECT
statement to get theTitle
column data. - It fetches all the data returned by the SQL command.
- It flattens the list of tuples into a single list of titles.
- It returns the list of titles.
See how to fetch title data using the Get_Title
function in the video below! 🔍📹
And there you have it! With these utility functions, you’re now equipped to handle database operations like a pro. Whether it’s connecting to your database, creating tables, saving data, or fetching it, our utils.py
module has got you covered.
Check out the code for this step in the GitHub repository.
Step 4: 📝Building the logger.py
module for error handling and logging messages
In this step, we will focus on creating a robust logging system to help with error handling and logging important messages. We’ll create a logger.py
module that sets up a logging system to write log messages to a file. This file will be stored in a designated directory with a unique name based on the current timestamp. This ensures that each run of the program generates a new log file, making it easier to track changes and debug issues.
First, you have to create a file named __init__.py
as demonstrated in the video tutorial below. 🎬📹
Code Explanation:
The code sets up a logging system in Python for your project using the logging
module. 🐍
- 📁 It creates a directory named ‘Logs’ if it doesn’t exist.
- 🕒 Generates a log file with a timestamped name in that directory.
- 📝 The logger is configured to write messages in a specific format to the log file.
This setup allows for error handling and logging of important messages, aiding in debugging and tracking changes in the program. 🛠️🔍
Sample Log File
Below is an example of how the logs might appear in the generated log file, giving a more detailed understanding of how the code works:
To demonstrate the working of the logger, you can view a short video🔍📹:
With the logger.py
module set up, you now have a powerful tool for error handling and logging important messages. Each run generates a new log file, providing a clear way to track your application's behavior over time.
Logging is like a black box for your application — it records everything, making debugging easier. Implement this in your project to simplify error tracking and resolution. Happy logging! 📜🛠️
Check out the code for this step in the GitHub repository and try_log.py.
Conclusion
In this first part, we set up the project environment, created the database structure, and built essential utility and logging modules. These foundational steps are crucial for creating a scalable and maintainable data pipeline.
Stay tuned for Part 2, where we’ll dive into automating the data scraping process and handling real-time data updates.
Follow me on Medium to stay updated on future posts and feel free to give claps if you found this post helpful!
Thank you for reading! If you have any questions or feedback, leave a comment below. Happy coding! 🧑💻🎉