Advanced web scraping in Python
I moved my article to here, please take a look.
Recently, I just created and open-sourced an unofficial Medium API, called PyMedium, which provides developers an easy way to access Medium.
One of the API in PyMedium is to parse post content, here I try to simply use web scraping technique to parse in the beginning. As the normal process of web scraping, I started to use “inspect element” in Chrome to find the tag pattern of post content (Right-click on the title element and select Inspect Element on Medium post page):
Obviously, the post content tag relies on
<div class=”postArticle-content js-postField js-notesSource js-trackedPost” data-post-id=”99a3d86df228"..., then I write simple python code to get the tag:
And execute it to print the tag first:
(env)$ python medium_scraper.py
The result output is
None, what happen? 🤔🤔🤔
I return Chrome to double check again that the tag I searched is correct. So I try another way to find the tag pattern: “view source” in Chrome (Right-click on any page element and select View source):
And I found that there is a little bit difference between the result of inspect element and view source. There is no any tag like
It’s not too difficult. The easy way is just use Chrome to help us tell the difference, that are the two ways to find the post content tag pattern we used above:
- “View source” (Right-click on any page element and select View source): get the actual source code of web page, without executing any scripts. This is what simple web scraper gets.
OK, turn back to our program, in this situation, here is the problem:
Selenium or some web drivers can help. Here I use the popular one — Selenium as web driver, you have to download and install it at first. The following is the code to use Selenium to get medium post content tags:
There is a useful technique to get the HTML tags from Selenium after you find elements by some specification, that is
content_element.get_attribute("innerHTML"). Execute the code, it will open your Chrome to load the URL you specify and get the post content tags to parse.
OK, it’s done! Now I can keep other parsing flow to get what I want from Medium post. This is my repository:
Feel free to star my repository and like this post. ❤️