Scraping & Crawling Data on the Interweb
We’ll get to visualizations on a later day
Definitions of Terminologies:
- Crawler. Used to get web pages. Downloads whatever is linked to from the starting point (or address).
- Scraper. Extracts data from downloaded pages (or data formatted for display) to be stored in a database and manipulated as desired.
- robots.txt . It specifies if and how crawlers should treat that site. It may list urls a crawler shouldn’t visit. Usually a file in the root.
I used scrapy for this exercise.
You’ll need to install Python if you don’t have it yet.
I followed the documentation below. It is pretty dope and flows smoothly.
Takes a while to get through but so worth the time and effort.
Step away from your desk if you must but then get right back at it.
Seeking a second opinion if something doesn’t make sense right away also helps a lot.
Some cool stuff happening around:
- Ghana launches first satellite into space
- Cytonn Investments (talk about organizations/companies that’ll outlive their founders)
- Kenya performs heart valve replacement w/o anesthetic
“You only know who is swimming naked when the tide goes out.” _Warren Buffet.