Text Analysis of Indian Media

Abhijeet Pokhriyal
Analytics Vidhya
Published in
5 min readAug 17, 2020

--

Quantifying distance between various media outlets

Using text analytics to understand the topics Indian Media focuses on and to be able to quantify the similarity among different Media Houses.

Final results preview

Our final result will be a dashboard that will visualize a network graph of the media outlets based on the term selected. Below are the results for the term “china”

Results for “China”

Data

The data we have is based on YouTube videos. We have pulled video metadata from select news channels and are going to use the title and the description provided for each video in our analysis. Below is a preview of the available data.

  • Each row in the data represents one video

There are 5 columns

  1. Channel Id — Provides the Indian Media Channel Id that can be used to uniquely identify the outlet
  2. Playlist title is the title of the playlist to which the video
  3. Date — Date when the episode was published
  4. Title — Title of the video
  5. Description — Description of the video

Data Preprocessing

  • Combining Title and Description columns into a “text” column
  • Vectorizing the documents

To vectoize the documents we use a custom tokenizer from NLTK library — RegexpTokenizer to filter out numeric terms and terms that have fewer than 2 characters.

We pass the custom tokenizer to CountVectorizer through the tokenizer keyword

Count Vectorizer converts a collection of text documents to a matrix of token counts and outputs a matrix as below

Algorithm

Now that we have numeric representation for each video we proceed with the distance calculations.

  1. First we group the videos by the channel
  2. Then for each group we Transpose the dataframe and calculate pairwise distance.

— By transposing we have Words/terms as rows and each video as a column. Therefore now when we use pairwise_distance , it calculates distances between words/terms and not between videos.

3. After we have Term x Term distance, we simply combine the results for different groups

Code to achieve the above

4. Now that we have “term x term” matrix, we use that matrix to calculate distance between channels.

5. For each term we isolate it’s vector from the “term x term” matrix by sorting the column corresponding to the term and picking closest n terms.

As in the example below W1 and W2 are the only terms that show up in top 4 therefore we use them in the next step.

6. Now we pivot the vector on channels and calculate pairwise distances (euclidian) between the channels

7. The final step is to just melt the distances so that we have only 3 columns. columns 1 and 2 represent the channel pair and 3rd column holds the distance between the two

Visualization

Now we have the data in required format for our visualizations.

We have performed two calculations

  1. Term X Term distance
  2. Channel X Channel distance

Therefore we can do two types of visualizations

The term x term distance can be used for creating the network graph whereas the channel x channel distance can be used for generating a heatmap.

Since heatmap is pretty straightforward we will focus on the network visualization.

Network Visualization

We already have the term distance with all terms so now we use the NetworkX library to position the terms and channels in a network form using the distance values as edge distances.

Given a term’s distance vector, for each row in the vector we create an weighted edge where the weight is the distance value.

Then we use Fruchterman-Reingold force-directed algorithm to place the nodes in positions that are reflective of their distances from one another

Now that we have our network positions saved in pdf dataframe, we just have to plot them.

For plotting we use the Grammar of Graphics API similar to GGPLOT in R, provided by plotnine.

For each edge we add a line segment. For each word we add a text geom to the plot and for each Channel we add a label. Rest code is for customizing the colors and theme and removing the clutter.

Below is the final output for the term “China”

If you liked this article or have suggestions and feedback feel free to reach out on LinkedIn

--

--

Abhijeet Pokhriyal
Analytics Vidhya

School of Data Science @ University of North Carolina — Charlotte