MLearning.ai
Published in

MLearning.ai

Medium Articles Data Visualization and Analysis using Python

Data Analysis Projects for beginners as well as intermediates

Photo by Myriam Jessier on Unsplash

Hello, Learners..!!

I think the medium is a part of daily life now for all the geeks. As we all know that the data is a vital part of the information age and currently we are in the information age. Those who own the data own everything.

Let’s do some analysis on the medium dataset. I got a dataset of medium from Kaggle. This dataset has the information of the medium articles about data science, machine learning, neural network, etc. We will dig deeper into it later. So, let’s start this project with a cup of coffee.

Code with Analysis

Dataset and Github Link are at the end of this article.

  • Import the following Libraries
#for mathematical computationimport numpy as np
import pandas as pd
import scipy.stats as stats
#for data visualizationimport seaborn as sns
import matplotlib.pyplot as plt
from matplotlib.pyplot import figure
import plotly
import plotly.express as px
% matplotlib inline
  • Let’s load the data and take a sneak peek at the data. Download the dataset and add the path as shown. After that render the first 5 data of the dataset.
df = pd.read_csv("/content/articles.csv", encoding='latin-1')
df.head()

Now run the cell, you will see something like this on-screen.

medium data analysis
  • Get some more information about the data
#data info
df.info()
#Check missing values
df.isnull().sum()

Check out the null values in each column. We got lucky that there are no null values in our dataset.
After that, get more information about our dataset with the type of each column attributes.

  • Relation between reading times and claps.
fig1 = px.line(df, x='reading_time', y='claps')
fig1.show()

After you run the cell, you will see something like this on the screen.

medium data visualization

Umm... This is a little bit weird but we will take care of it later. But we can see that the articles reading time between 5 to 9 minutes are getting more claps. The reader doesn’t like short articles as they don’t want less information about any particular topic. And they also don’t like too long articles. I think this scenario is the same for other genres of articles too. Most of the readers want articles between the reading of 5 to 10 minutes.

  • Add 2 columns. Length of body text column and length of title column. Look at the new dataset
df['len_text'] = df['text'].str.len()
df['len_title'] = df['title'].str.len()
df.head()

No rocket science here. Get the length of text and title from the dataset and added them in a new column. Our new dataset will look something like this now.

medium data analysis
  • Body text observation
sns.distplot(df['len_text'], color='b')
plt.show()
Medium data analysis using python

For most of the articles in the dataset, we have 5000 to 15000 characters.

If you run the above cell then you might see a warning. In the Future, you can rename distplot to displot. It will work fine.

  • Title Length Observation
sns.displot(df['len_title'], color="b")
plt.show()
medium data analysis

The above curve contains two local maxima one at about 60 chars and the other at 100 char which is quite fascinating as it implies that a specific group of writers prefer to write a longer title.

  • Claps Observation

First, clean the data. The clap column is not cleaned yet. After that, we will do an observation. We have two options here. Either we remove the “K” from the clap column and add the new data (without K) in a separate column or We just remove the K from the clap. We are going with the second option.

df['claps'] = df['claps'].apply(lambda s: int(float(s[:-1]) * 1000) if s[-1] == 'K' else int(s))sns.distplot(df['claps'], color="b")
plt.show()
medium data analysis

As we clearly see that graph is highly skewed on the right side which clearly shows that a small class of authors get a high amount of claps.

  • Scatter plot of reading time and claps
fig5 = px.scatter(df, y='claps', x='reading_time')
fig5.show()
medium data analysis

We have already seen the analysis of reading time and claps relation. But this graph is more precise and looks decent.

  • Relation between the length of title and claps
fig6 = px.scatter(df, y='claps', x='len_title')
fig6.show()
medium data visualization

Well, that’s impressive. Claps depend upon title text. If you planning to write something on medium then make sure to write a decent title for your article.

  • Correlations between the columns

Let’s see the correlations between the columns, and check if we can find anything interesting.

%matplotlib inlinef,ax = plt.subplots(figsize=(14,10))
sns.heatmap(df.corr(),annot = True,fmt = ".1f",ax = ax)
plt.show()
medium data visualization

You can observe that the length of the text is highly correlated to the reading time. That makes sense, the more the length of text, the more the reading time. Reading time is not depends on the length of the title. Claps are more related to reading time and length of text than the length of the title. Well, that’s a lot of observable information here.

  • Top articles
df[df['claps'] >= df['claps'].quantile(0.95)][['author', 'title', 'claps']]
medium data analysis

This is the list of top articles on medium, out of the given dataset we have.

  • What highest clap getters do differently
df_author = df.groupby(['author']).mean().reset_index()df_top30 = df_author.sort_values(ascending=False, by='claps')[:30]imageSize = (10, 10)
fig, ax = plt.subplots(figsize=imageSize)
sns.barplot('reading_time', 'claps', data=df_top30)
plt.show()
Medium Data Analysis

We are doing this analysis on the top 30 authors. What do they do differently?

As we can see that the most of the authors who get the highest numbers of claps, write the articles between the reading time length of 10 to 15 minutes.

And if you want to see this analysis in the characters of body text. Then run the cell with the code below.

a4_dims = (10, 10)
fig, ax = plt.subplots(figsize=a4_dims)
sns.distplot(df_top30['len_text'])
plt.show()

After you run the cell, you will notice in graph that the most of the articles of these authors have the text character lies between 5000 to 15000. If you convert the 5000 characters into words then it will be somewhere between 700 to 1200 words.

Well, That’s it. Congrats, you analyzed the Spotify dataset. You can dig more on your own. Because you can do a lot with data. And the information you get is valuable.

Full Github code and dataset access are here.

Thank you for reading. If this article is informative then make sure to clap and share it with your community and follow for more.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store