Crawl4AI: Automating Web Crawling and Data Extraction for AI Agents

Richardson Gunde
4 min readJun 29, 2024

--

Introduction
Crawl is an open-source tool that revolutionizes web crawling and data extraction processes for AI agents. It automates tasks that were once time-consuming and laborious, empowering developers to build intelligent agents to gather and analyze information effectively.

Features of Crawl
Crawl offers a robust set of features that streamline web crawling and data extraction:

Open Source and Free: Crawl is free, allowing developers to use its capabilities without any financial barriers.
AI-Powered: Crawl leverages AI to automatically define and parse elements, saving time and effort.
Structured Output: Crawl converts extracted data into structured formats, such as JSON and markdown, for easy analysis.
Versatile Functionality: Crawl provides support for scrolling, multiple URL crawling, media tag extraction, metadata extraction, and screenshot capture.

Step-by-Step Guide to Using Crawl
Step 1: Installation and Setup

pip install “crawl4ai @ git+https://github.com/unclecode/crawl4ai.git" transformers torch nltk

Step 2: Data Extraction

Create a Python script to initiate the web crawler and extract data from a URL:

from crawl4ai import WebCrawler

# Create an instance of WebCrawler
crawler = WebCrawler()

# Warm up the crawler (load necessary models)
crawler.warmup()

# Run the crawler on a URL
result = crawler.run(url="https://openai.com/api/pricing/")

# Print the extracted content
print(result.markdown)

Step 3: Data Structuring using LLM

Use LLM (Large Language Model) to define the extraction strategy and convert the extracted data into a structured format:

import os
from crawl4ai import WebCrawler
from crawl4ai.extraction_strategy import LLMExtractionStrategy
from pydantic import BaseModel, Field

class OpenAIModelFee(BaseModel):
model_name: str = Field(..., description="Name of the OpenAI model.")
input_fee: str = Field(..., description="Fee for input token for the OpenAI model.")
output_fee: str = Field(..., description="Fee for output token ßfor the OpenAI model.")

url = 'https://openai.com/api/pricing/'
crawler = WebCrawler()
crawler.warmup()

result = crawler.run(
url=url,
word_count_threshold=1,
extraction_strategy= LLMExtractionStrategy(
provider= "openai/gpt-4o", api_token = os.getenv('OPENAI_API_KEY'),
schema=OpenAIModelFee.schema(),
extraction_type="schema",
instruction="""From the crawled content, extract all mentioned model names along with their fees for input and output tokens.
Do not miss any models in the entire content. One extracted model JSON format should look like this:
{"model_name": "GPT-4", "input_fee": "US$10.00 / 1M tokens", "output_fee": "US$30.00 / 1M tokens"}."""
),
bypass_cache=True,
)

print(result.extracted_content)

Step 4: Integration with AI Agents

Integrate Crawl with Praison CrewAI agents for efficient data processing:

pip install praisonai

Create a tool file (tools.py) to wrap the Crawl tool.

# tools.py
import os
from crawl4ai import WebCrawler
from crawl4ai.extraction_strategy import LLMExtractionStrategy
from pydantic import BaseModel, Field
from praisonai_tools import BaseTool

class ModelFee(BaseModel):
llm_model_name: str = Field(..., description="Name of the model.")
input_fee: str = Field(..., description="Fee for input token for the model.")
output_fee: str = Field(..., description="Fee for output token for the model.")

class ModelFeeTool(BaseTool):
name: str = "ModelFeeTool"
description: str = "Extracts model fees for input and output tokens from the given pricing page."

def _run(self, url: str):
crawler = WebCrawler()
crawler.warmup()

result = crawler.run(
url=url,
word_count_threshold=1,
extraction_strategy= LLMExtractionStrategy(
provider="openai/gpt-4o",
api_token=os.getenv('OPENAI_API_KEY'),
schema=ModelFee.schema(),
extraction_type="schema",
instruction="""From the crawled content, extract all mentioned model names along with their fees for input and output tokens.
Do not miss any models in the entire content. One extracted model JSON format should look like this:
{"model_name": "GPT-4", "input_fee": "US$10.00 / 1M tokens", "output_fee": "US$30.00 / 1M tokens"}."""
),
bypass_cache=True,
)
return result.extracted_content

if __name__ == "__main__":
# Test the ModelFeeTool
tool = ModelFeeTool()
url = "https://www.openai.com/pricing"
result = tool.run(url)
print(result)

Configure the AI agents to use the Crawl tool for web scraping and data extraction.

framework: crewai
topic: extract model pricing from websites
roles:
web_scraper:
backstory: An expert in web scraping with a deep understanding of extracting structured
data from online sources. https://openai.com/api/pricing/ https://www.anthropic.com/pricing https://cohere.com/pricing
goal: Gather model pricing data from various websites
role: Web Scraper
tasks:
scrape_model_pricing:
description: Scrape model pricing information from the provided list of websites.
expected_output: Raw HTML or JSON containing model pricing data.
tools:
- 'ModelFeeTool'
data_cleaner:
backstory: Specialist in data cleaning, ensuring that all collected data is accurate
and properly formatted.
goal: Clean and organize the scraped pricing data
role: Data Cleaner
tasks:
clean_pricing_data:
description: Process the raw scraped data to remove any duplicates and inconsistencies,
and convert it into a structured format.
expected_output: Cleaned and organized JSON or CSV file with model pricing
data.
tools:
- ''
data_analyzer:
backstory: Data analysis expert focused on deriving actionable insights from structured
data.
goal: Analyze the cleaned pricing data to extract insights
role: Data Analyzer
tasks:
analyze_pricing_data:
description: Analyze the cleaned data to extract trends, patterns, and insights
on model pricing.
expected_output: Detailed report summarizing model pricing trends and insights.
tools:
- ''
dependencies: []

AI Agent Example
The example Praison-AI agents perform web scraping, data cleaning, and data analysis based on the data extracted by Crawl. The agents work together to extract pricing information from multiple websites and provide a detailed report summarizing the findings.

Conclusion
Crawl is a powerful tool that empowers AI agents to perform web crawling and data extraction tasks with greater efficiency and accuracy. Its open-source nature, AI-powered capabilities, and versatility make it an invaluable asset for developers aiming to build intelligent and data-driven agents.

Share your thoughts: Leave a comment below on how you plan to use Crawl in your projects.

Share this post: Help spread the knowledge about Crawl by sharing it on social media or with others who may find it useful.

--

--

Richardson Gunde

"Experienced Gen AI Researcher & Innovator | Driving Digital Transformation at Infosys | Transforming Ideas into Reality with Cutting-Edge Solutions"