Web Crawler — Getting Medium Post

Micromike
4 min readAug 12, 2020

--

Web Crawler, what’s that

Ever wondering how to get a Medium posts of a users, but don’t quite sure how to do it.

This post will walk through the journey of how I get Medium post from their GraphQL API.

Why I Started to Investigate the Issue

One day, my girlfriend want to render her Medium post list to Wordpress sites.

However, the WordPress plugin that she currently used does not work as she expected. The plugin not only shows the post of a user, but also the comment activity.

The goal is simple, try to find a way to get the Medium post via username.

Okay, challenge accepted and let the hacking begins.

TL;DR

Here is the TL;DR version.

If you want to get posts from Medium, the following javascript get_medium_post function will do the magic for you.

Official Medium API

The whole investigation begin with the Official Medium Developer sites.

Nowadays, most web services will provide certain API to let the developer further integrate their services.

Medium also provided some API, but the bad news is, they do not provide any API that is related to list posts.

Since we cannot find any useful information from the official documentation, I started wonder how the WordPress plugin get the Medium posts.

RSS Feed

The WordPress plugin that we used is Display Medium Posts plugin.

After tracing a bit of the code.

The following snippet is how the magic happened.

Basically, api.rss2json.com is a web service that will turn the RSS format into an JSON format.

Yes, you heard the keyword, RSS feed.

It turns out that eventhough Medium does not provide an API to list posts, they have provided the RSS feed so that other user can subscribe to other uses’s activity.

The RSS feed url format is medium.com/feed/$username.

Now we know how the plugin get the medium post, let’s start investigate why the plugin will show something other than user post.

Investigation

After discussing with my best pals, Google, it lead me to the following blog.

As the blog suggest, it may be expected behaivor for Medium.

Medium RSS feed will contains not only user post but also activities.

Luckily, the blog also propose a simple workaround of this issue.

The idea is very simple, check whether or not the categories field is empty.

Mitigation

Just follow the above snippet to the WordPress plugin should filter out the non-post items.

Into the GraphQL World

After playing around with the RSS feed, I have discoverd another behavior.

The RSS feed will only show the latest 10 activities, but do not provide any pagination options.

Which means,

When an user is a popular one which has a lot of activities, the post list will not be shown due to starvation.

Now I know that the RSS feed is a dead end. We need to find another way to list the post.

So I start to open the Chrome developer console, drink a cup of coffee, corss my finger and pray for the best.

After monitoring the result for a while.

VWe’ve got a winner

Voalá, we’ve got a winner.

Medium use GraphQL. Good for you.

Since I’m using incongito mode, which implies the GraphQL does not need any sort of authentication check at all, we can simply resend the GraphQL query and parsing the response data.

Dig Deeper in GraphQL

Actually, to this point, other functionality is done, but to me there are still something that I want to tweek to make my little script more reliable and maintainable.

Minimize the Query

The original GraphQL query is pretty big(about 14KB) which is

  • hard to read
  • hard to maintain

So I have minimize the GraphQL query to a small one so that we can get the following data

  • Link
  • Author
  • Thundernail
  • Title
  • Publish Date
  • Pagination

The following are the code snippet for the minimized GraphQL query.

Final Words

The journey ends when I finally finished the GraphQL query and deploy the script to my customize Cloudflare workers.

I’ve got to say, I didn’t expect that this small issue will take me so long to investigate.

But the whole process is priceless and full of joy.

I enjoy digging through the Medium GraphQL API and I aslo realize how powerful GraphQL is.

Hope this post will help you and any feedback is welcome.

--

--

Micromike

Senior product developer in Synology. Programmer, #infosec enthusiasm, #linux, #python #Rust