Analyzing CNET’s Headlines
Exploring the news published on CNET using Python and Pandas
I wrote a crawler to scrape the news headlines from CNET’s sitemap and decided to perform some exploratory analysis on it. In this post, I will walk you through my findings, some anomalies and some interesting insights. You can find the code here if you want to jump into it directly.
Crawler
The crawler is written using Scrapy, an open source web crawling framework written in Python. There is an option to dump the data into a csv or a json file with a minor change in command. You can find the code and command in my GitHub repo. Time to dive into analysis.
Loading and Cleaning Data
The first step will be loading the data and then cleaning it to make it ready for analysis.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inlinedf = pd.read_csv('cnet_data.csv')df.head()
df.head() simply outputs the first 5 rows of a dataframe. Here we have 3 columns: title, link and date. A wise thing to do here is convert the values in date column to a datetime object. This will make accessing year, month, date, weekdays easier.
df['date'] = pd.to_datetime(df.date, format='%Y-%m-%d')
Note: df.date is same as df[‘date’]. As long as the column name is a valid variable name, it can be accessed using dot (.), else you will need to use df[‘col name’]
One of the most common problem with a dataset is null values. Let’s remove rows with empty titles from our dataframe.
df = df[pd.notnull(df['title'])]
Now we are good to start our analysis.
Analysis
Let’s analyze the frequency of article published by date.
ax = df.groupby(df.date.dt.year)['title'].count().plot(kind='bar', figsize=(12, 6))ax.set(xlabel='Year', ylabel='Number of Articles', title="Articles Published Every Year")plt.show()
The above lines of code groups the records by year and plot the counts. Everything in the code seems self explanatory. We can set the kind
parameter to other values like line for line plot, barh for horizontal bar. Check out the documentation for more details.
We can plot a similar graph for months and weekdays.
ax = df.groupby(df.date.dt.month)['title'].count().plot(kind='bar')
months = ['JAN', 'FEB', 'MAR', 'APR', 'MAY', 'JUN', 'JUL', 'AUG', 'SEP', 'OCT', 'NOV', 'DEC']ax.set_xticklabels(months)
ax.set(xlabel='Month', ylabel='Number of Articles', title="Articles Published Every Month")plt.show()
ax = df.groupby(df.date.dt.weekday)['title'].count().plot(kind='bar')days_of_week = ['MON', 'TUE', 'WED', 'THU', 'FRI', 'SAT', 'SUN']
ax.set_xticklabels(days_of_week)
ax.set(xlabel='Day of Week', ylabel='Number of Articles', title="Articles Published Every WeekDay")plt.show()
- Highest number of articles were published in the year 2009 and lowest in 1995.
- September witnesses the highest number of articles published.
- All other months except September have almost same number of articles published.
- Wednesday is the busiest day in a week.
- As expected, a very less number of articles are published on weekends.
df['date'].value_counts().head(10)
Interestingly, over 15K articles were on 2009–09–02 and over 3.5K on the day before. That is probably the reason why we saw September and 2009 wining the race by a healthy margin in the previous graphs.
Let’s ignore the top 5 results and plot a graph to show the distribution of articles published by date.
ax = df['date'].value_counts()[5:].plot(color='red', figsize=(12,6))
ax.set(xlabel='Date', ylabel='Number of Articles')
plt.show()
I am still curious about what happened on September 2nd, 2009 that leads to such massive amount of articles being published. Let’s see the dominant keywords that appeared on the headline on that day.
from wordcloud import WordCloud
stopwords = set(open('stopwords.txt').read().split(','))
wc = WordCloud(stopwords=stopwords)
wordcloud = wc.generate(' '.join(df[df.date=='2009-09-02']['title'].apply(str)))plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.show()
Apple seems to be dominant in the news on that day. We can see keywords like Mac, Apple, iTunes in the word cloud.
wordcloud = wc.generate(' '.join(df['title'].apply(str)))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.show()
Looking at the entire dataframe, the headlines are dominated by Google, Apple, Microsoft and Facebook which is expected.
We can also count the occurrence of some popular keywords in the headlines.
keywords = ['microsoft', 'apple', 'facebook', 'google', 'amazon',
'twitter', 'ibm', 'iphone', 'android', 'window', 'ios']for kw in keywords:
count = df[df['title'].str.contains(kw, case=False)].title.count()
print('{} found {} times'.format(kw, count))microsoft found 12918 times
apple found 17762 times
facebook found 6342 times
google found 13409 times
amazon found 4162 times
twitter found 3340 times
ibm found 3178 times
iphone found 11543 times
android found 5801 times
window found 6063 times
ios found 3199 times
This was a basic analysis on CNET headlines. You can find the notebook and data in my github repo here.
Feel free to play around, see if you can find something interesting in the data and let me know if I missed something.
Loved this article? Find me on Twitter where I share similar content.