Advanced web scraping in Python

Recently, I just created and open-sourced an unofficial Medium API, called PyMedium, which provides developers an easy way to access Medium.

One of the API in PyMedium is to parse post content, here I try to simply use web scraping technique to parse in the beginning. As the normal process of web scraping, I started to use “inspect element” in Chrome to find the tag pattern of post content (Right-click on the title element and select Inspect Element on Medium post page):

Obviously, the post content tag relies on <div class=”postArticle-content js-postField js-notesSource js-trackedPost” data-post-id=”99a3d86df228"..., then I write simple python code to get the tag:

And execute it to print the tag first:

(env)$ python
(evn)$ None

The result output is None, what happen? 🤔🤔🤔

I return Chrome to double check again that the tag I searched is correct. So I try another way to find the tag pattern: “view source” in Chrome (Right-click on any page element and select View source):

Right-click on any page element and select View source
The source code of web page

And I found that there is a little bit difference between the result of inspect element and view source. There is no any tag like <div class=”postArticle-content js-postField js-notesSource js-trackedPost” data-post-id=”99a3d86df228".... So I conclude that the post content is generated by JavaScript.

You may wonder how can I tell the difference? How can I know the web page is generated by JavaScript dynamically or just a simple static page?

It’s not too difficult. The easy way is just use Chrome to help us tell the difference, that are the two ways to find the post content tag pattern we used above:

  1. “View source” (Right-click on any page element and select View source): get the actual source code of web page, without executing any scripts. This is what simple web scraper gets.
  2. “Inspect element” (Right-click on the title element and select Inspect Element): get the html after executing all the source code of web page, including JavaScript. It includes the dynamic content. The original simple web scraper can’t get dynamic content. It has to use some technique to do this job.

Once you find the tags you can’t find from source code, but they appear in inspect element, it means that the tags is generated by JavaScript, and you need to use particular technique to get them. If you can find the tags from source code, you can use simple web scraping to get them.

OK, turn back to our program, in this situation, here is the problem:

how can I get the tags generated by JavaScript then?

All we have to do for our program is to simulate browser to execute the whole source code including all the JavaScript, and then get the tag after getting the generated page.

Selenium or some web drivers can help. Here I use the popular one — Selenium as web driver, you have to download and install it at first. The following is the code to use Selenium to get medium post content tags:

There is a useful technique to get the HTML tags from Selenium after you find elements by some specification, that is content_element.get_attribute("innerHTML"). Execute the code, it will open your Chrome to load the URL you specify and get the post content tags to parse.

Result of executing the selenium code

OK, it’s done! Now I can keep other parsing flow to get what I want from Medium post. This is my repository:

Feel free to star my repository and like this post. ❤️

Happy coding!!!