Scraping & Crawling Data on the Interweb

We’ll get to visualizations on a later day

Definitions of Terminologies:

  • Crawler. Used to get web pages. Downloads whatever is linked to from the starting point (or address).
  • Scraper. Extracts data from downloaded pages (or data formatted for display) to be stored in a database and manipulated as desired.
  • robots.txt . It specifies if and how crawlers should treat that site. It may list urls a crawler shouldn’t visit. Usually a file in the root.

I used scrapy for this exercise.

You’ll need to install Python if you don’t have it yet.

I followed the documentation below. It is pretty dope and flows smoothly.

Takes a while to get through but so worth the time and effort.

Step away from your desk if you must but then get right back at it.

Seeking a second opinion if something doesn’t make sense right away also helps a lot.

Happy scraping! Just ensure to review every site’s terms of use policy and respect the robots.txt file. Also adhere to ethical scraping practices by not flooding the site with numerous requests over a short span of time.


Some cool stuff happening around:

“You only know who is swimming naked when the tide goes out.” _Warren Buffet.
Show your support

Clapping shows how much you appreciated Hazel Apondi’s story.