Scraping Dynamic and Static Data Using Spiders: Where Scraping Starts

MuhammadAnfaal
8 min readFeb 11, 2024

--

Do you want to learn how to make a spider that can scrap data from static or dynamic sites? Yes you can learn it, you need patience and continuous working for it and you will love spiders even if you feel scared of spiders.

Types of websites:

There are two types of websites form which you extract data mostly.

  1. Dynamic
  2. Static

Static sites are comparatively easy to extract data as it mantain its structure and mostly it contan HTML and CSS tags. Your first task is to observe that websites structure then you can proceed further.
You feel difficulty in how I can see what is website’s HTML and CSS look like it is simple.

  1. just go to that website and after right click open inspect option.
  2. then you can see a side bar contain all the tags used in that website.

Now you can see its HTML and CSS quite easy!

Now the real deal is going to start by making the Scraper.

Scraper for Static website:

Requirements:

  1. You need to have downloaded python and pycharm in you PC you can download it from this link.(https://www.jetbrains.com/pycharm/download/?section=windows)
  2. Make sure you have downloaded python in your PC and also add the path while downloading it. Otherwise you can get in to some issue.
  3. Now, you have downloaded the python and pycharm now time to setup you Pycharm for scraping.
  4. For static websites like Politifact you can just use Scrapy and beautifulSoup, come in the libraries of python. you can install it by using commad
pip install Scrapy
pip install beautifulsoup4

5. After seting up complete project on pycharm you can look multiple option on you side bar like

6. Now you need to make a python file in the spider folder by you desired name with .py extension

now you can see demo.py file has been created now we are goin to dive in to coding part.

7. After dive in to complex part of coding I recommend you to visit the link https://scrapy.org/ for basic understanding of scraper it also has a demo of simple project it will be benificial for you as a beginner.

8. I am expecting that you have visit the link and now have basic understanding of what scraper is and how it works, so lets take a bit complex example of static website and make a scraper for it.

9. The website I am goin to scrap is Politifact.com

From now on I will just repete the steps I mention above. Keep in mind I have made the project and setup my pycharm accordingly and going for the code but, the thing I personally like to do is, when you will go to scrap the data it require it demand reference in the form of CSS Selectors or in other forms but I like to use Xpath method I think it is easy to use, butit is your choice what to do either way it is correct. In this example I use CSS Selector.

First of all I analyze the website’s structure it was consist bit complex HTML but the code of it is given below you can check for it, but I recommend you to try it by yourself first then look for my code.

import scrapy
from bs4 import BeautifulSoup

I have use Scrapy and BeautifulSoup for it as I told you for static websites we can use these. I import these and then use these.

class PolitifactSpider(scrapy.Spider):
name = 'politifact'
start_urls = ['https://www.politifact.com/article/list/?page={}'.format(page) for page in range(1, 11)]

def parse(self, response):
soup = BeautifulSoup(response.text, 'html.parser')
articles = soup.find_all("div", class_="m-teaser")
article_links = ["https://www.politifact.com" + article.find("h3", class_="m-teaser__title").find("a")["href"] for article in articles]

for link in article_links:
yield scrapy.Request(url=link, callback=self.parse_article)

def parse_article(self, response):
article = {}
# Extract relevant details using Scrapy selectors
article["title"] = response.css("div.m-statement__quote::text").get().strip()

# Extract and join the content into a string
article["content"] = ' '.join(response.css("article.m-textblock *::text").getall()).strip()

# Extract and join the images into a list
article["images"] = response.css("article.m-textblock img::attr(src)").getall()

# Extract author details
authors = []
for author in response.css("div.m-author"):
a = {}
if author.css("img"):
a["avatar"] = author.css("img::attr(src)").get()
a["name"] = author.css("div.m-author__content.copy-xs.u-color--chateau a::text").get().strip()
a["profile"] = "https://www.politifact.com" + author.css("div.m-author__content.copy-xs.u-color--chateau a::attr(href)").get()
authors.append(a)
article["authors"] = authors

yield article

Here is the that you were waiting for a read. Now to run this code you need to run a command in terminal, but wait you are missing somthing important! Guess what, Yes you are missing to move to the file that you are going to run for this you can use

cd name_of_folder
cd next_file
cd spiders
cd your_spider

After moving to your desired file you woult look like this in terminal

you can see what a long path my command has follow to move my desired location, yeah it is a long journey. Now run the commad

scrapy crawl your_spider_name

after running this command you can get your data but here is a thing if you want to strore data in json from you can use -o file_name.json or csv along with command. In jason form you can see results like

Output is webby, What you think?

Dynamic website:

These type of websites are not that much easy to extract data from. For this we use selenium for scraping the data it is different from Scrapy and use for Dynamic websites specially for infinite scrollable sites like twitter and Mastadon.

you can find some readings about selenium from website https://selenium-python.readthedocs.io/

before starting please visit this link, and you do take help from any LLM tool like GPT or hugging face.

there are two websites I am going to scrap using selenium both websites are dynamic one is https://www.altnews.in/ second one is https://mastodon.social/explore

I will provide you code but please first try it yourself it will enhance your learning you can get multiple options to do same things but if you just copy and paste you learning will limitized so lets move on and scrap altnews first.

Altnews Scraper:

from seleniumwire import webdriver
from selenium.webdriver.common.by import By
import asyncio
import json
json_file_path = r'E:\Scrapping\Assignment_no_1\politifact_website\politifact_spider\politifact_scrapy\politifact_scrapy\spiders\AltNewsData.json'


async def extract_data(driver):
# Scroll to the bottom
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

# Give some time for the new content to load
await asyncio.sleep(5) # Adjust this delay if needed

# Extract images
image_elements = driver.find_elements(By.XPATH, '//div[@class= "thumb-w"]//img')
image_urls = [img.get_attribute('src') for img in image_elements]

# Extract text links
text_link_elements = driver.find_elements(By.XPATH, '//h4[@class="entry-title pbs_e-t xs__h6 sm__h5"]//a')
text_link_data = [element.get_attribute('href') for element in text_link_elements]

# Extract author links
author_link_elements = driver.find_elements(By.XPATH, '//span[@class="author vcard"]//a')
author_text_data = [element.get_attribute('href') for element in author_link_elements]

# Extract news types
news_type_elements = driver.find_elements(By.XPATH, '//span[@class="author vcard"]//a')
news_type_data = [element.get_attribute('href') for element in news_type_elements]

# # Extract videos
# video_elements = driver.find_elements(By.XPATH, '//div[@class="video-player"]//video')
# video_urls = [video.get_attribute('src') for video in video_elements]

# Print or process the extracted data
print("Image URLs:", image_urls)
print("Text Link Data:", text_link_data)
print("Author Link Data:", author_text_data)
print("News Type Data:", news_type_data)
# print("Video URLs:", video_urls)

# Create a dictionary to store the extracted data
extracted_data = {
"Image URLs": image_urls,
"Text Link Data": text_link_data,
"Author Link Data": text_link_data,
"News Type Data": text_link_data,

# "Video URLs": video_urls
}

return extracted_data


async def main():
# Create a new instance of the browser
driver = webdriver.Chrome()

# Navigate to the Mastodon page
url = 'https://www.altnews.in/'
driver.get(url)

# Set the maximum number of scrolls
max_scrolls = 20

# List to store results from each iteration
results = []

# Outer loop for scrolls
for scroll_count in range(max_scrolls):
# Scroll to the bottom
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

# Run data extraction in parallel to scrolling
extracted_data = await asyncio.gather(extract_data(driver), asyncio.sleep(7)) # Adjust this delay if needed
results.append(extracted_data)

# Close the browser
driver.quit()

# Save results to a JSON file
with open(json_file_path, 'w') as json_file:
json.dump(results, json_file, indent=3)

# Run the event loop
asyncio.run(main())

Brief Explanation about it is to extract data from dynamic website is not that musch simple one is you can seperatly download chrome drivers from site https://chromedriver.chromium.org/downloads and use it after going to setting.py options and change that according to it and so on. One important thing in this method is you should choose relevent version of drivers from your chrome version otherwise it will not work. If you donot find exact version then you can choose close one it may work.

you can check your chrome version by clicking to verticle dots on top right corner on chrome, go to help and select about google chrome. you can see your chrome version there.

In my case I use seleniumwire as I do not have to download chrome driver if I use this it is also python library and it handle this stuff and rest is in the above code.

Output:

AltNews Scraped Data

Mastadon Scraper:

Similar to AltNews Scraper mastadon’s scraper code is

from seleniumwire import webdriver
from selenium.webdriver.common.by import By
import asyncio
import json
json_file_path = r'E:\Scrapping\Assignment_no_1\politifact_website\politifact_spider\politifact_scrapy\politifact_scrapy\spiders\mastadon_data.json'


async def extract_data(driver):
# Scroll to the bottom
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

# Give some time for the new content to load
await asyncio.sleep(5) # Adjust this delay if needed

# Extract images
image_elements = driver.find_elements(By.XPATH,
'//div[@class="media-gallery__item standalone media-gallery__item--tall media-gallery__item--wide"]/a[@class="media-gallery__item-thumbnail"]//img')
image_urls = [img.get_attribute('src') for img in image_elements]

# Extract text
text_elements = driver.find_elements(By.XPATH,
'//div[@class="status__content__text status__content__text--visible translate"]/p')
text_data = [element.text for element in text_elements]

# Extract videos
video_elements = driver.find_elements(By.XPATH, '//div[@class="video-player"]//video')
video_urls = [video.get_attribute('src') for video in video_elements]

# Print or process the extracted data
print("Image URLs:", image_urls)
print("Text Data:", text_data)
print("Video URLs:", video_urls)

# Create a dictionary to store the extracted data
extracted_data = {
"Image URLs": image_urls,
"Text Data": text_data,
"Video URLs": video_urls
}

return extracted_data


async def main():
# Create a new instance of the browser
driver = webdriver.Chrome()

# Navigate to the Mastodon page
url = 'https://mastodon.social/explore'
driver.get(url)

# Set the maximum number of scrolls
max_scrolls = 20

# List to store results from each iteration
results = []

# Outer loop for scrolls
for scroll_count in range(max_scrolls):
# Scroll to the bottom
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

# Run data extraction in parallel to scrolling
extracted_data = await asyncio.gather(extract_data(driver), asyncio.sleep(7)) # Adjust this delay if needed
results.append(extracted_data)

# Close the browser
driver.quit()

# Save results to a JSON file
with open(json_file_path, 'w') as json_file:
json.dump(results, json_file, indent=3)

# Run the event loop
asyncio.run(main())

Output:

One last but not least thing is that you do not need to run commad

scrapy crawl spider_name

in this case you have to run it directly as we are using selenium.

--

--