Popular Scraping Libraries
One of the ways of obtaining data is “scraping”, which can be done using modern engines such as Selenium, JSoup, and others. A notable feature of using such tools is the need to simulate the user’s activity, the manifestation of activity with web pages, which requires the development of an additional algorithm of actions and their depth. However, this approach allows one to get almost unlimited amounts of data that can be used for further analysis. Data preprocessing is one of the most important and time-consuming machine learning tasks there is, and is divided into several sub-stages, as presented on the slide.
Selenium. Supports multiple languages. www.seleniumhq.org
Beautiful Soup. Python. www.crummy.com/software/BeautifulSoup
Scrapy. Python. www.scrapy.org
Jsoup. Java. www.jsoup.org
Important considerations:
- Different web content shows up depending on web browser used
Scraper may need different “web drivers” (e.g., in Selenium), or browser “user agents” - Data may show up after certain user interactions (e.g., button clicks)
Scraper may need to simulate the actions.
Selennium supports more actions:
www.discoversdk.com/blog/web-scraping-with-selenium
Beautiful Soup supports some.