Text Analysis of Indian Media
Quantifying distance between various media outlets
Using text analytics to understand the topics Indian Media focuses on and to be able to quantify the similarity among different Media Houses.
Final results preview
Our final result will be a dashboard that will visualize a network graph of the media outlets based on the term selected. Below are the results for the term “china”
Data
The data we have is based on YouTube videos. We have pulled video metadata from select news channels and are going to use the title and the description provided for each video in our analysis. Below is a preview of the available data.
- Each row in the data represents one video
There are 5 columns
- Channel Id — Provides the Indian Media Channel Id that can be used to uniquely identify the outlet
- Playlist title is the title of the playlist to which the video
- Date — Date when the episode was published
- Title — Title of the video
- Description — Description of the video
Data Preprocessing
- Combining Title and Description columns into a “text” column
- Vectorizing the documents
To vectoize the documents we use a custom tokenizer from NLTK library — RegexpTokenizer to filter out numeric terms and terms that have fewer than 2 characters.
We pass the custom tokenizer to CountVectorizer through the tokenizer keyword
Count Vectorizer converts a collection of text documents to a matrix of token counts and outputs a matrix as below
Algorithm
Now that we have numeric representation for each video we proceed with the distance calculations.
- First we group the videos by the channel
- Then for each group we Transpose the dataframe and calculate pairwise distance.
— By transposing we have Words/terms as rows and each video as a column. Therefore now when we use pairwise_distance , it calculates distances between words/terms and not between videos.
3. After we have Term x Term distance, we simply combine the results for different groups
4. Now that we have “term x term” matrix, we use that matrix to calculate distance between channels.
5. For each term we isolate it’s vector from the “term x term” matrix by sorting the column corresponding to the term and picking closest n terms.
As in the example below W1 and W2 are the only terms that show up in top 4 therefore we use them in the next step.
6. Now we pivot the vector on channels and calculate pairwise distances (euclidian) between the channels
7. The final step is to just melt the distances so that we have only 3 columns. columns 1 and 2 represent the channel pair and 3rd column holds the distance between the two
Visualization
Now we have the data in required format for our visualizations.
We have performed two calculations
- Term X Term distance
- Channel X Channel distance
Therefore we can do two types of visualizations
The term x term distance can be used for creating the network graph whereas the channel x channel distance can be used for generating a heatmap.
Since heatmap is pretty straightforward we will focus on the network visualization.
Network Visualization
We already have the term distance with all terms so now we use the NetworkX library to position the terms and channels in a network form using the distance values as edge distances.
Given a term’s distance vector, for each row in the vector we create an weighted edge where the weight is the distance value.
Then we use Fruchterman-Reingold force-directed algorithm to place the nodes in positions that are reflective of their distances from one another
Now that we have our network positions saved in pdf dataframe, we just have to plot them.
For plotting we use the Grammar of Graphics API similar to GGPLOT in R, provided by plotnine.
For each edge we add a line segment. For each word we add a text geom to the plot and for each Channel we add a label. Rest code is for customizing the colors and theme and removing the clutter.
Below is the final output for the term “China”
If you liked this article or have suggestions and feedback feel free to reach out on LinkedIn