Python
How to scrape code from Medium using Python
Medium is filled with code snippets. Wouldn’t it be convenient to be able to extract and save them? If you answered yes, keep reading…
Libraries
The tools I will use are Selenium, Beautiful Soup and requests.
Selenium is a framework for programmatically controlling a browser. Originally, it was made for testing purposes of web interfaces, but it has become popular for scraping websites that dynamically load content using JavaScript.
Beautiful Soup is a library for parsing HTML and XML to extract information of interest. Everything that can be done with Beautiful Soup can also be done with Selenium, but I use Beautiful Soup here also because I find it easier.
Requests is a simple library for performing HTTP requests.
Steps
The scraping process will include the following steps:
- Load the site using Selenium to run the JavaScript and show all dynamically generated content (that is, the code snippets).
- Use Beautiful Soup to find the snippets.
- Extract the code snippets and save them in a dictionary.