Popular Scraping Libraries

Rinat S, PhD
Product AI
Published in
1 min readJul 8, 2021

One of the ways of obtaining data is “scraping”, which can be done using modern engines such as Selenium, JSoup, and others. A notable feature of using such tools is the need to simulate the user’s activity, the manifestation of activity with web pages, which requires the development of an additional algorithm of actions and their depth. However, this approach allows one to get almost unlimited amounts of data that can be used for further analysis. Data preprocessing is one of the most important and time-consuming machine learning tasks there is, and is divided into several sub-stages, as presented on the slide.

Selenium. Supports multiple languages. www.seleniumhq.org
Beautiful Soup. Python. www.crummy.com/software/BeautifulSoup
Scrapy. Python. www.scrapy.org
Jsoup. Java. www.jsoup.org

Important considerations:

  • Different web content shows up depending on web browser used
    Scraper may need different “web drivers” (e.g., in Selenium), or browser “user agents”
  • Data may show up after certain user interactions (e.g., button clicks)
    Scraper may need to simulate the actions.
    Selennium supports more actions:
    www.discoversdk.com/blog/web-scraping-with-selenium
    Beautiful Soup supports some.

--

--

Rinat S, PhD
Product AI

Doctor of Technical Sciences, Associate Professor, Professor of the Department of Engineering Cybernetics