A Brief History of the Evolution of the Web Scraping Tools at Zyte.

Neha Setia Nagpal
4 min readMar 29, 2024

--

Before you move, this is the first part of the blog series-understanding and Implementing Scrapy Spider Templates Powered by Zyte’s AI Scraping Tool.
Also, these Blogs are supporting articles for the hands-on workshop-How to Build 30 Scrapy Spiders to scrape product data in under 30 minutes. If you prefer learning through video, You can watch the recording, here.

If you truly want to understand the present or yourself, you must begin in the past. You see, history is not simply the study of the past. It is an explanation of the present — Paul Hunham from the movie- The Holdovers.

Let me tell you the story of a developer named Lucas. Lucas has approximately ten years of experience in the web scraping industry. He broadly went through three broad phases in his entire web scraping experience namely- the manual phase, the automation phase, and the optimisation phase.

If I divide this evolution into three broad phases-:

  1. The Manual Phase using Scrapy
    In this phase, Lucas diligently wrote every line of code for web scraping projects. This included crafting the crawling logic, implementing rotating proxy solutions, and utilizing browser automation tools like Splash, Puppeteer, and Selenium. Lucas also had to invest time in researching and manually upgrading the solution to tackle advanced antiban software.
    However, this manual approach posed numerous challenges. Writing the spiders alone consumed several days of work, not to mention the additional time required to write the crawling code, selectors, and unblock bans. Moreover, the constant need to adjust both selectors and bans whenever the website layout or anti-bot measures changed added to the complexity. The arduous process of creating site-specific code and integrating multiple tools not only made scaling difficult but also increased costs and maintenance overhead.
  1. The Automation Phase using Scrapy and Zyte API
    Lucas experienced a significant reduction in workload thanks to the integration of Zyte API into the web scraping system. With Zyte API automatically handling bans, proxy management, and supporting browser automation through its actions capability, Lucas was able to focus primarily on writing the crawling and extraction logic using scrapy spiders. This eliminated the need for configuring the spider to avoid bans or managing rotating proxies, resulting in streamlined operations.
    Furthermore, integrating Zyte API into the system eliminated the need for multiple tools, reducing maintenance overhead and improving monitoring capabilities. However, despite these advancements, challenges remained.
    Writing spiders manually still posed a hurdle, and hence, scaling the system proved to be difficult. While the system was equipped to self-heal in terms of antiban software upgrades, it lacked the adaptability to seamlessly handle website layout changes.
  • The Automation Phase uses Scrapy and Zyte API with AI-powered extraction.
    With the integration of patented machine learning into Zyte API, Lucas could now enjoy a range of benefits. The system offers supervised machine learning that surpasses the performance of large language models in terms of speed, accuracy, and cost-effectiveness. The high accuracy of the system is further enhanced by the ‘quick fixes’ process, where the support team applies specialized hints for optimizing the model on a per-site basis.
    Moreover, the economical nature of running the system on every page at runtime makes it self-healing when website layouts change, which means no more writing selectors manually. Hence, effectively addressing the maintenance challenges.
    Additionally, the system provides compliance confidence out of the box by employing standard schemas that automatically exclude sensitive data such as personally identifiable information (PII) and copyrighted content. These features ensure that Lucas can extract data quickly, accurately, and in a compliant manner.

The Optimisation Phase using the Scrapy Spider Template System composed of Zyte AI Stack-Scrapy + Zyte API (with AI-powered extraction capabilities)

During the Optimization Phase, Lucas leverages the Scrapy Spider Template System powered by the Zyte AI Stack, which combines Scrapy and Zyte API with AI-powered extraction capabilities. This integrated template system revolutionizes the web scraping process for Lucas.

With the template system, Lucas no longer needs to write spiders, and selectors, or manually configure them for bans, site upgrades, or proxy management. Instead, Lucas runs the template using the command
"Scrapy crawl ecommerce url='enter your URL'"
to effortlessly extract product data. The template eliminates the need for manual coding and streamlines the entire data extraction process.

Next Steps

  1. Sign up for AI For Web Scraping- Try it and Let us know what you think on Extract Data Discord.
  2. Read the Next Blog in the Series- AI web scraping-web scraping product data from 30 e-commerce websites in under 30 minutes
  3. Join the Extract Data Community- We’ve established a vibrant Discord community for web scraping enthusiasts like yourself, dedicated to sharing insights, learning new technologies, and advancing in web scraping. I am looking forward to seeing you there :)

--

--

Neha Setia Nagpal

Developer Advocate @Zyte, generally journaling on web scraping, machine learning, NLP, developer advocacy, systems thinking, and on being a young mom :)