# Text Analysis of Indian Media

Quantifying distance between various media outlets

Using text analytics to understand the topics Indian Media focuses on and to be able to quantify the similarity among different Media Houses.

# Final results preview

Our final result will be a dashboard that will visualize a network graph of the media outlets based on the term selected. Below are the results for the term “china”

# Data

The data we have is based on YouTube videos. We have pulled video metadata from select news channels and are going to use the title and the description provided for each video in our analysis. Below is a preview of the available data.

- Each row in the data represents one video

There are 5 columns

**Channel Id**— Provides the Indian Media Channel Id that can be used to uniquely identify the outlet**Playlist title**is the title of the playlist to which the video**Date**— Date when the episode was published**Title**— Title of the video**Description**— Description of the video

# Data Preprocessing

- Combining
*Title*and*Description*columns into a “text” column

- Vectorizing the documents

To vectoize the documents we use a custom tokenizer from NLTK library — RegexpTokenizer to filter out numeric terms and terms that have fewer than 2 characters.

We pass the custom tokenizer to CountVectorizer through the tokenizer keyword

Count Vectorizer converts a collection of text documents to a matrix of token counts and outputs a matrix as below

# Algorithm

Now that we have numeric representation for each video we proceed with the distance calculations.

- First we group the videos by the channel
- Then for each group we Transpose the dataframe and calculate pairwise distance.

— By transposing we have Words/terms as rows and each video as a column. Therefore now when we use pairwise_distance , it calculates distances between words/terms and not between videos.

3. After we have Term x Term distance, we simply combine the results for different groups

4. Now that we have “term x term” matrix, we use that matrix to calculate distance between channels.

5. For each term we isolate it’s vector from the “term x term” matrix by sorting the column corresponding to the term and picking closest n terms.

As in the example below W1 and W2 are the only terms that show up in top 4 therefore we use them in the next step.

6. Now we pivot the vector on channels and calculate pairwise distances (euclidian) between the channels

7. The final step is to just melt the distances so that we have only 3 columns. columns 1 and 2 represent the channel pair and 3rd column holds the distance between the two

# Visualization

Now we have the data in required format for our visualizations.

We have performed two calculations

- Term X Term distance
- Channel X Channel distance

Therefore we can do two types of visualizations

The term x term distance can be used for creating the network graph whereas the channel x channel distance can be used for generating a heatmap.

Since heatmap is pretty straightforward we will focus on the network visualization.

**Network Visualization**

We already have the term distance with all terms so now we use the NetworkX library to position the terms and channels in a network form using the distance values as edge distances.

Given a term’s distance vector, for each row in the vector we create an weighted edge where the weight is the distance value.

Then we use Fruchterman-Reingold force-directed algorithm to place the nodes in positions that are reflective of their distances from one another

Now that we have our network positions saved in pdf dataframe, we just have to plot them.

For plotting we use the Grammar of Graphics API similar to GGPLOT in R, provided by plotnine.

For each edge we add a line segment. For each word we add a text geom to the plot and for each Channel we add a label. Rest code is for customizing the colors and theme and removing the clutter.

Below is the final output for the term “China”

If you liked this article or have suggestions and feedback feel free to reach out on LinkedIn