Inside the TOP 1000 tags on Medium.com — Part 1

During my 3 months stay in the US, one of my pet project was to download all of the posts on medium.com. It took around 1 week to write the script, and another 1 week to download all of the posts.

The total database size is 8GB. I ended up with 6M posts, having a total of 9.2M total tags. Counting the unique tags, we end up with 620K tags.

I ended up filtering and extracting just the stats for the tags that are used in at least 1000 different posts.

There are 1016 unique tags that are used in at least 1000 posts.

You can download the CSV file with the TOP 1000 values from this gist.

All the visualizations are posted on Tableau Public, at this link.

Let`s plot the Average Image count / Average Reading Time.

Using the data, we plot on the Y axes the average image count inside posts with a specific tag, and on the X axes the average reading time required to read posts that have a specific tag.

Top 10 posts per :

Average Recommends

Number of Posts

Image Count

Reading Time


One thing that we can do, having this data, is to plot on the Y axes the count of posts with a specific tag, and on the X axes the count of distinct users who made posts with a specific tag.

Above the Line / Below the Line explanation.

To get a better understanding of what means that a value is above the line/below the line, you can see this graph, that shows 4 data points, 4 tags, every tag have around 3100 posts, but a different number of distinct users.

One general rule that seems to work for the majority of the cases, is that the terms that above the line are more specific to a domain, and the tags that are below the line, are more general terms.

Above the Line

The tags that are above the black line that is traversing the chart, means that a smaller group of people are writing all of the articles for that specific tag :

For example, for the IFTTT tag, there are 55,151 posts, and only 1,331 distinct users.

If we divide the number of posts by the number of distinct users, we get the average amount of posts written by an individual.

For ITFFF tag we have 55,151 posts/ 1,331 distinct users = 41 posts per user.

Let`s take another example : 
The SEO tag. Here we have 47,271 posts, written by 6,471 users. That means, in average, each user written 7 posts with that tag.

  • For the Poetry tag, the ratio is more balanced, with each user writing 2.6 posts.
  • For the Startup tag, the most used tag on medium, the ratio is 2.2 posts/user.
  • For the Politics tag, the ratio is 2.1 posts/user.

This make sense, if we consider that for these categories there are people that are more specialized and passionate about a particular field, and they tend to write more about that topic.

Also, this tags are from domains that require in depth experience about the topic.

For the IFTTT and SEO tags, i`m guessing somebody uses the tag to do some SEO spam.

Below the Line

The tags that are below the black line that is traversing the chart, means that a larger and more diverse group of people are contributing an article for that specific tag.

You don’t see the same monopoly of a group of users that are writing about a specific tag.

Also, this tags seems to be more generic tags. (Life Teens, Election, New Years Resolutions, Internet and other tags that are more generic and everybody have an opinion about this topic).

Is not something that require specialized skills and knowledge to write about it.

For example, for the medium tag, there are 16,313 posts, and only 10,267 distinct users.

If we divide the number of posts by the number of distinct users, we get the average amount of posts written by an individual.

For the medium tag we have 16,313 posts/ 10,267 distinct users = 1.6 posts per user.

Let`s take another example.

The death tag. Here we have 8,445 posts, written by 6,363 users.

That means, in average, each user written 1.3 posts with that tag. 
For the “Social Media” tag, the ratio is 1.9 posts/user.
For the Relationships tag, the ratio is 1.7 posts/user.

Let`s zoom in to get a better understanding of the data.

In total, we have 1000 tags, and the vast majority of them are in the bottom left corner.

You can see in this image the 3 zoom levels that we will dive into to get a better understanding of the data, and also the amount of tags that we have in each of the third views.

Zoom Level 1

In the first zoom level, we are left with 932 tags, with a total post count of 3M (See summary in the chart)

We can see the same trend emerging, with the majority of tags that are above the line are more specific to a industry(blackchain, investing, film) or around a subgroup of persons(Vietnam, Japan,Espanol).

The tags below the line are more generic, consisting of human states (Fear, Depression), generic terms (work, future,thanksgiving), etc

One thing that becomes apparent starting now is that the posts that are below the line, meaning the ones that are more generic, are getting in average more recommendations then the ones above the line. (the bigger the circle size for each tag, the more recommendations the post got)

Zoom Level 2

At zoom level 2, we are left with 658 tags, with a total post count of 1.4M (See summary in the chart)

We can see the same trend from zoom level 1, the posts that are more specific, (above the line) have fewer recommendations then the more generic ones (below the line).

Zoom Level 3

At zoom level 3, we are left with 312 tags, with a total post count of 437K (See summary in the chart)

At this zoom level it`s clear that the posts that are more general, the ones below the line, are getting more recommendations then the ones above the line.

To test this theory, we selected 156 points that are below the line and calculated the average and median value of them.

We did the same with 119 points that are above the line (more specific topics) The results are :

Average recommendations

Above the Line = 4.04 Median recommendations 
Above the Line = 3.23 Average recommendations

Below the Line = 6.58 Average recommendations
Below the Line = 4.73 Median recommendations


This is just the tip of the Iceberg of what we can do, learn from this data set. Searching for ideas of what to do next with the data set.

If you want to contribute and join me in the quest of playing with the data set, send an email to baditaflorin@gmail.com.

Also, looking for funds to:

  • Buy Tableau Software (900$)
  • Host the database on a remote server (for now it`s located in localhost) (50$/month)

I`m living in Romania and the medium salary is 250$/month.

This is a pet project, that i`m doing it in my spare time. You can help with a small donation here : Any help is appreciated.

About Me

In the last 3 years i collaborated with Rise Project, were i did data analysis and pattern recognition to uncover patterns of corruption in unstructured data-sets.

In September 2016 i moved for 3 months to San Francisco, to start a new life.

Now i`m back in Romania, searching for a Remote/ Part/Full Time Job were i can apply my expertise related with data science.

Currently :

You can find me online on Medium Florin Badita, AngelList, Twitter , Linkedin, Openstreetmap, Github, Quora, Facebook

Sometimes i write on my blog http://florinbadita.com/