AI Web Scraping Revolution: Your Guide to Scraping 30 E-Commerce Sites in 30 Minutes

Discover the Future of Web Scraping: Minimize Code, Maximize Scale, and Surpass ChatGPT’s Speed

Neha Setia Nagpal
6 min readMar 29, 2024

This article introduces web scraping developers to an innovative and efficient system for constructing large-scale projects with minimal code. It offers a faster, more cost-effective, and easily scalable alternative to other AI tools and LLMs like ChatGPT.

There are multiple AI tools and plenty of articles promising to automate web scraping using AI tools and LLMs like ChatGPT but can you fully automate large-scale web scraping? Here’s a blunt truth!

Currently, you cannot completely automate large-scale web scraping projects with AI unless you are okay
1. with sacrificing data quality.
2. to give up the entire control of your web data extraction projects to black-box AI solutions, which sometimes hallucinate and when they do, you can’t do anything about them.

You should do neither! And the good news is: you don’t have to with Zyte’s AI Tech Stack for Web Scraping!

In this article, I will demonstrate a new way of approaching {any}-scale web scraping projects with the Spider Template System using Zyte’s AI Tech Stack without losing control, without compromising the data quality or without the overhead maintenance of managing multiple tools. This system is

  • Self-healing and resilient towards the website layout changes and antiban upgrades.
  • effortlessly scales to accommodate hundreds or thousands of websites.

The objective of this blog is to demonstrate Zyte’s AI Scraping Tool to scrape product data from various e-commerce websites with a brief introduction to the Scrapy Spider Template System. But if you want to understand this new system in depth, I recommend reading this blog- A Brief History of the Evolution of the Web Scraping Tech Stack at Zyte.

In my view, the best way to proceed with this blog would be to go to the second heading-How to set up a product data extraction project from 30 e-commerce websites using Zyte’s AI tech stack and Try it :)

Also, this blog is a supporting article for the hands-on workshop-How to Build 30 Scrapy Spiders to scrape product data in under 30 minutes. If you prefer learning through video, You can watch the recording, here.

Table of contents

  1. Introduction to Zyte’s AI tech stack and Scrapy spider template system for web scraping.
  2. Setting up a product data extraction project from 30 e-commerce websites using Zyte’s AI tech stack.
  3. Surprise!
  4. Conclusion.
  5. Next Steps- Tailoring the Solution: Customization Possibilities and Implementation Guide For Spider Templates.

Introduction to Zyte’s AI tech stack and scrapy spider template system for web scraping.

Zyte’s AI tech stack combines Scrapy, Zyte API with AI capabilities, and Scrapy Cloud into a predefined Scrapy Spider Template(currently only available for product data). This template simplifies and automates the traditional complexities of a web scraping project.

Smart Spider Templates equip you to scrape websites rapidly and build self-healing spiders that adapt to changes without writing the crawling logic or Xpath/CSS selectors. These templates have no limitations and allow for extensive customization.

  1. Scrapy Spider Templates
    zyte-spider-templates: This library contains Scrapy spider templates or automatic crawlers. They can be used out of the box with the Zyte features such as Zyte API or modified to be used standalone. The AI-powered Spider Templates automate and streamline the web scraping process. They enable quick and easy setup by automatically recognizing navigation patterns and handling pagination. Whatever you can achieve with Scrapy, you can accomplish within a template, including altering crawling patterns, filtering results, adding arguments, processing data, and exporting it.

    zyte-spider-templates-project : This is a starting template for a Scrapy project, with built-in integration with Zyte technologies (scrapy-zyte-api, zyte-spider-templates). We will use this template in the project.
  2. Self-Healing AI capabilities of the Spider template system :
    Crawling
    Zyte API’s AI capabilities eliminate the burden of writing and maintaining custom crawling code for each website, enabling scalability. The AI is trained to recognize both common and uncommon navigation patterns, relieving developers from the task of figuring out site navigation, URL structures, and pagination. It works out of the box and can be easily customized using Python Scrapy code if desired. Zyte’s Automated ban management is another benefit of AI crawling. The system automatically avoids bans by utilizing the minimum infrastructure required.
    Extraction
    AI also solves the challenge of parsing and extracting data by learning to read pages like a human in real-time. It autonomously finds the required data without explicit instructions, making the process fast and immune to site changes. By providing an e-commerce page URL, the AI delivers product details in a legally compliant schema, which can be further extended with manual rules if desired.

So, how does the template system work? Let’s understand through this small project.

Prerequisites

  1. Sign up for Zyte’s AI Scraping Tool, here
  2. Take a note of Zyte_API_Key.
  3. Understanding of Scrapy

Setting up a product data extraction project from 30 e-commerce websites using Zyte’s AI Tech Stack.

You can also download the entire code here. {insert a github}

But here’s a step-by-step guide to running the e-commerce spider template

  1. Open the terminal and clone the zyte-spider-templates-project :
git clone https://github.com/zytedata/zyte-spider-templates-project

2. Rename the zyte_spider_templates_project folder to a valid Python module name (e.g. use _ instead of - or spaces).

3. cd <your-project-name>

4. update these files with your project name.

  • scrapy.cfg
  • settings.py
    BOT_NAME, SPIDER_MODULE, NEWSPIDER_MODULE, SCRAPY_POET_DISCOVER

5. Add in settings.py `ZYTE_API_KEY`

6. Remove or replace the LICENSE and README.rst files.

7. Delete .git, and start a fresh Git repository:

rm -rf .git
git init
git add -A
git status
git commit -m "Initial commit"

8. Create a Python virtual environment and install requirements.txt into it:

python3 -m venv venv
. venv/bin/activate
pip install -r requirements.txt

9. You can now run the e-commerce spider with a URL of your choice.

scrapy crawl ecommerce -a url="<enter the url >"

10. Time to Set up 30 e-commerce spiders.

Create two files- domains.txt and run30sites.py and paste the following code into the respective files.
The code in the run30sites.py reads URLs from the domain.txt(containing 30 different e-commerce domain URLs) and runs a scrapy e-commerce template spider.

domain.txt

wayfair.ca
kleinanzeigen.de
lowes.com
tesco.com
shein.com
allegro.pl
lowes.com
adidas.com
otto.de
idealo.de
homedepot.com
xxxlutz.at
kroger.com
ubereats.com
direct.leaseplan.co.uk
target.com
farfetch.com
shein.co.uk
leroymerlin.fr
tesco.ie
xxxlutz.ch
wayfair.com
118118.com
kick.com
justeat.com
sainsburys.co.uk
dba.dk
shopee.sg
rappi.com.br
myvue.com
ubereats.com
doordash.com
cars.com
finde-offen.de
tori.fi
grubhub.com

run30sites.py

import subprocess
import os

def main():
processes = []
os.makedirs("output",exist_ok=False)
with open("domains.txt") as input_file:
for domain in input_file:
domain = domain.strip()
processes.append(
subprocess.Popen(
[
"scrapy",
"crawl",
"ecommerce",
"-a",
"max_requests=50",
"-a",
f"url=https://{domain}/",
"-o",
f"output/{domain}.jsonl",
]
)
)
for process in processes:
process.wait()

if __name__ == "__main__":
main()

How to scale this project to hundreds or thousands of e-commerce websites.
With the provided code, you can effortlessly set up an impressive number of spiders — whether it’s 30, 100, or even 1000 — in just 10 minutes(Just add the URLs in `domain.txt`). The speed, quality and efficiency of this process are truly remarkable, enabling you to enhance your productivity in web scraping tasks.

This code empowers developers to save valuable time and swiftly deploy spiders for extracting product data from various sources. The streamlined spider creation process and faster development cycles make it an exhilarating tool to work with.

Conclusion

Zyte’s mission is to connect data enthusiasts with quality reliable web data at any scale without coding hassles, getting banned or breaking spiders.

With this Spider Template System supercharged with Zyte’s AI Stack, Zyte is one more step closer to fulfilling its mission. Of course, we won’t stop inventing and optimising, so stay tuned :P

Get ready to revolutionize your web scraping endeavours with Zyte’s AI Scraping Tool, and let us know what you think about this new approach to web scraping projects or reach out to us if you need coding help on our discord- Join and connect with 1000s of industry leaders on Extract Data Community!

Happy Scraping and Love to all :)
Neha, Developer Advocate, Zyte

Read the Next Blog- Tailoring the Solution: Customization Possibilities and Implementation Guide For Spider Templates.

--

--

Neha Setia Nagpal

Developer Advocate @Zyte, generally journaling on web scraping, machine learning, NLP, developer advocacy, systems thinking, and on being a young mom :)