Published in

MLearning.ai

# Medium Articles Data Visualization and Analysis using Python

## Data Analysis Projects for beginners as well as intermediates

Hello, Learners..!!

I think the medium is a part of daily life now for all the geeks. As we all know that the data is a vital part of the information age and currently we are in the information age. Those who own the data own everything.

Let’s do some analysis on the medium dataset. I got a dataset of medium from Kaggle. This dataset has the information of the medium articles about data science, machine learning, neural network, etc. We will dig deeper into it later. So, let’s start this project with a cup of coffee.

# Code with Analysis

• Import the following Libraries
`#for mathematical computationimport numpy as npimport pandas as pdimport scipy.stats as stats#for data visualizationimport seaborn as snsimport matplotlib.pyplot as pltfrom matplotlib.pyplot import figureimport plotly import plotly.express as px% matplotlib inline`
• Let’s load the data and take a sneak peek at the data. Download the dataset and add the path as shown. After that render the first 5 data of the dataset.
`df = pd.read_csv("/content/articles.csv", encoding='latin-1')df.head()`

Now run the cell, you will see something like this on-screen.

`#data infodf.info()#Check missing valuesdf.isnull().sum()`

Check out the null values in each column. We got lucky that there are no null values in our dataset.

• Relation between reading times and claps.
`fig1 = px.line(df, x='reading_time', y='claps')fig1.show()`

After you run the cell, you will see something like this on the screen.

Umm... This is a little bit weird but we will take care of it later. But we can see that the articles reading time between 5 to 9 minutes are getting more claps. The reader doesn’t like short articles as they don’t want less information about any particular topic. And they also don’t like too long articles. I think this scenario is the same for other genres of articles too. Most of the readers want articles between the reading of 5 to 10 minutes.

• Add 2 columns. Length of body text column and length of title column. Look at the new dataset
`df['len_text'] = df['text'].str.len()df['len_title'] = df['title'].str.len()df.head()`

No rocket science here. Get the length of text and title from the dataset and added them in a new column. Our new dataset will look something like this now.

• Body text observation
`sns.distplot(df['len_text'], color='b')plt.show()`

For most of the articles in the dataset, we have 5000 to 15000 characters.

If you run the above cell then you might see a warning. In the Future, you can rename distplot to displot. It will work fine.

• Title Length Observation
`sns.displot(df['len_title'], color="b")plt.show()`

The above curve contains two local maxima one at about 60 chars and the other at 100 char which is quite fascinating as it implies that a specific group of writers prefer to write a longer title.

• Claps Observation

First, clean the data. The clap column is not cleaned yet. After that, we will do an observation. We have two options here. Either we remove the “K” from the clap column and add the new data (without K) in a separate column or We just remove the K from the clap. We are going with the second option.

`df['claps'] = df['claps'].apply(lambda s: int(float(s[:-1]) * 1000) if s[-1] == 'K' else int(s))sns.distplot(df['claps'], color="b")plt.show()`

As we clearly see that graph is highly skewed on the right side which clearly shows that a small class of authors get a high amount of claps.

• Scatter plot of reading time and claps
`fig5 = px.scatter(df, y='claps', x='reading_time')fig5.show()`

We have already seen the analysis of reading time and claps relation. But this graph is more precise and looks decent.

• Relation between the length of title and claps
`fig6 = px.scatter(df, y='claps', x='len_title')fig6.show()`

Well, that’s impressive. Claps depend upon title text. If you planning to write something on medium then make sure to write a decent title for your article.

• Correlations between the columns

Let’s see the correlations between the columns, and check if we can find anything interesting.

`%matplotlib inlinef,ax = plt.subplots(figsize=(14,10))sns.heatmap(df.corr(),annot = True,fmt = ".1f",ax = ax)plt.show()`

You can observe that the length of the text is highly correlated to the reading time. That makes sense, the more the length of text, the more the reading time. Reading time is not depends on the length of the title. Claps are more related to reading time and length of text than the length of the title. Well, that’s a lot of observable information here.

• Top articles
`df[df['claps'] >= df['claps'].quantile(0.95)][['author', 'title', 'claps']]`

This is the list of top articles on medium, out of the given dataset we have.

• What highest clap getters do differently
`df_author = df.groupby(['author']).mean().reset_index()df_top30 = df_author.sort_values(ascending=False, by='claps')[:30]imageSize = (10, 10)fig, ax = plt.subplots(figsize=imageSize)sns.barplot('reading_time', 'claps', data=df_top30)plt.show()`

We are doing this analysis on the top 30 authors. What do they do differently?

As we can see that the most of the authors who get the highest numbers of claps, write the articles between the reading time length of 10 to 15 minutes.

And if you want to see this analysis in the characters of body text. Then run the cell with the code below.

`a4_dims = (10, 10)fig, ax = plt.subplots(figsize=a4_dims)sns.distplot(df_top30['len_text'])plt.show()`

After you run the cell, you will notice in graph that the most of the articles of these authors have the text character lies between 5000 to 15000. If you convert the 5000 characters into words then it will be somewhere between 700 to 1200 words.

Well, That’s it. Congrats, you analyzed the Spotify dataset. You can dig more on your own. Because you can do a lot with data. And the information you get is valuable.

Full Github code and dataset access are here.