LLM Web Scraping with ScrapeGraphAI: A Breakthrough in Data Extraction

4 min readMay 1, 2024

Unleash the Power of Large Language Models for Efficient Data Collection

Introduction: The Evolution of Web Scraping

In the dynamic realm of data-driven industries, extracting valuable insights from online sources is paramount. From market analysis to academic research, the demand for specific data fuels the need for robust web scraping tools. Traditionally, Python libraries like BeautifulSoup and Scrapy have been the go-to solutions, requiring users to navigate intricate web structures with programming expertise.

# BeautifulSoup Example
from bs4 import BeautifulSoup
import requests

url = 'https://example.com'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
print(soup.title)

# Scrapy Example
import scrapy

class ExampleSpider(scrapy.Spider):
    name = 'example'
    start_urls = ['https://example.com']

    def parse(self, response):
        title = response.css('title::text').get()
        print(title)

Introducing ScrapeGraphAI: Simplifying Data Extraction

Enter ScrapeGraphAI, a groundbreaking Python library reshaping the landscape of web scraping.This innovative tool harnesses the power of Large Language Models (LLMs) and direct graph logic to streamline data collection. Unlike its predecessors, ScrapeGraphAI empowers users to articulate their data needs, abstracting away the complexities of web scraping.

%%capture
!apt install chromium-chromedriver
!pip install nest_asyncio
!pip install scrapegraphai
!playwright install

# if you plan on using text_to_speech and GPT4-Vision models be sure to use the
# correct APIKEY
OPENAI_API_KEY = "YOUR API KEY"
GOOGLE_API_KEY = "YOUR API KEY"

from scrapegraphai.graphs import SmartScraperGraph

graph_config = {
    "llm": {
        "api_key": OPENAI_API_KEY,
        "model": "gpt-3.5-turbo",
    },
}


smart_scraper_graph = SmartScraperGraph(
    prompt="List me all the projects with their descriptions.",
    # also accepts a string with the already downloaded HTML code
    source="https://perinim.github.io/projects/",
    config=graph_config
)

result = smart_scraper_graph.run()

import json

output = json.dumps(result, indent=2)

line_list = output.split("\n")  # Sort of line replacing "\n" with a new line

for line in line_list:
    print(line)

SpeechGraph

SpeechGraph is a class representing one of the default scraping pipelines that generate the answer together with an audio file. Similar to the SmartScraperGraph but with the addition of the TextToSpeechNode node.

from scrapegraphai.graphs import SpeechGraph

# Define the configuration for the graph
graph_config = {
    "llm": {
        "api_key": OPENAI_API_KEY,
        "model": "gpt-3.5-turbo",
    },
    "tts_model": {
        "api_key": OPENAI_API_KEY,
        "model": "tts-1",
        "voice": "alloy"
    },
    "output_path": "website_summary.mp3",
}

# Create the SpeechGraph instance
speech_graph = SpeechGraph(
    prompt="Create a summary of the website",
    source="https://perinim.github.io/projects/",
    config=graph_config,
)

result = speech_graph.run()
answer = result.get("answer", "No answer found")

import json

output = json.dumps(answer, indent=2)

line_list = output.split("\n")  # Sort of line replacing "\n" with a new line

for line in line_list:
    print(line)

from IPython.display import Audio
wn = Audio("website_summary.mp3", autoplay=True)
display(wn)

GraphBuilder (Experimental)

GraphBuilder creates a scraping pipeline from scratch based on the user prompt. It returns a graph containing nodes and edges.

GraphBuilder is an experimental class that helps you to create custom graphs based on your prompt. It creates a json with the essential elements that identify a graph and allows you to visualize it using graphviz. It knows what are the kind of nodes that the library provides by default and connect them to help you reach your goal.

from scrapegraphai.builders import GraphBuilder

# Define the configuration for the graph
graph_config = {
    "llm": {
        "api_key": OPENAI_API_KEY,
        "model": "gpt-3.5-turbo",
    },
}

# Example usage of GraphBuilder
graph_builder = GraphBuilder(
    user_prompt="Extract the news and generate a text summary with a voiceover.",
    config=graph_config
)

graph_json = graph_builder.build_graph()

# Convert the resulting JSON to Graphviz format
graphviz_graph = graph_builder.convert_json_to_graphviz(graph_json)

# Save the graph to a file and open it in the default viewer
graphviz_graph.render('ScrapeGraphAI_generated_graph', view=True)

graph_json
graphviz_graph

How ScrapeGraphAI Works: A Closer Look

ScrapeGraphAI operates by interpreting user queries and intelligently navigating web content to fetch desired information. Leveraging LLMs, it autonomously constructs scraping pipelines, minimizing user intervention. This approach not only enhances efficiency but also reduces the barrier to entry, enabling users to focus on data analysis rather than technical intricacies.

Unlocking Efficiency with ScrapeGraphAI

With its ability to automate complex scraping tasks while ensuring high accuracy, ScrapeGraphAI is a game-changer for professionals across industries. Whether monitoring competitors or conducting academic research, this tool empowers users to harness web data efficiently. As the digital landscape continues to evolve, ScrapeGraphAI emerges as an indispensable ally in driving data-driven decision-making forward.

Conclusion: Embrace the Future of Data Extraction

In a data-centric world, the importance of efficient data extraction cannot be overstated. ScrapeGraphAI represents a paradigm shift in web scraping, offering a user-friendly approach powered by cutting-edge technology. As businesses and researchers strive to stay ahead in a competitive environment, embracing tools like ScrapeGraphAI is essential for unlocking actionable insights and driving informed decisions.