Data Science Case Study (Finding Insights of Jabardasth)
Hello guys, this case study is about finding the insights of Jabardasth TV Show. I am really thankful to YouTube and ETV Channel. (This study is based on Jabardasth TV show’s data, which I collected from YouTube)
“Jabardasth is a very popular comedy show among Telugu tv shows. Currently Jabardasth has 13Lacks+ subscribers on YouTube.”
Collect the Data
The first and foremost important task for any project is the data collection. I have collected the data from YouTube API’s using ‘httr’ in “R” programming language. Data never be ready to use ready-made thing, we need to extract the features from the existing raw data.
(* Data collected from 2015–07–30 to 2017–06–19 of Jabardasth shows around 1942 enties)
The most crucial and key role player of the Data Scientists Job is Data Pre-Processing. (This is my opinion, I found it very interesting and useful in understanding the data insights before plotting any graphs or applying any ML algorithms. I have spent around 70 to 80 percent time of this case study in the Data Pre-Processing.
What do we understand from the features, what insights can we find.
Before proceeding to modelling(ML), we need to understand the data and try to get as much insights as possible.
Jabardasth vs Extra-Jabardasth
We can clearly understand the views trend of Jabardasth vs Extra-Jabardasth from the above plot(graph).
From the above graphs we can clearly see, there is a sudden bump for Jabardasth plot(blue color tall line). We call it an insight that we found. If we go through the data and see, there is a logic behind this i.e. it was a special event organised by ETV on “31 Dec, 2015” as part their New year celebrations.
Like trend of Jabardasth vs Extra-Jabardasth shows in the above plot(graph)
Coments trend of Jabardasth vs Extra-Jabardasth shows in the above plot(graph)
Dislike trend of Jabardasth vs Extra-Jabardasth shows in the above plot(graph)
Jabardasth Intro’s vs Extra-Jabardasth Intro’s
Jabardasth’s Promo vs Extra-Jabardasth’s Promo
Skits trend (Artists)
Artist to Artist View
Artists graph with their top viewed skit marked
Time to jump into ML algorithms
We have seen the data insights so far, it’s now time to apply some ML algorithms on the data. Before implementing any ML algorithm, we should thoroughly understand the below details.
- What is the business problem (What is the expectation of client)?
- What kind of data is available with us (Sometimes data volume also plays a major role)
- What type of problem are we going to solve? (Eg: Classification or Regression etc)
- Understand the available data with respect to given problem statement
Assume, our business request is to predict the “Number of views of a skit”
Key point: Business problem is regression type (Linear Regression)
Check the independent variables for correlation, correlation may cause in inaccurate results. Improve the model by adding additional features and proper data transformations. I have applied below different transformations.
- Log transformation
Below are Linear Model summary images
Transform the Artist column as categorical variable and apply the linear regression
We can observe the better Adjusted R-Squared value in the above image.
Find the Artist, based on the Views and Likes
Key point: Business problem is classification type
We will predict the Artist with the help of Logistic Regression
Convert the Artist feature to categorical type. As we see the features are highly correlated, we can drop one of two highly correlated features. (We can drop multiple features based on the correlation between the features)
Since we are solving simple classification problem, I have taken the data of two artists.
Our prediction worked very well. There are no false predictions but, we should be very careful about the over fitting problem. If we over fit the model, it may fail while predicting the unseen data.
Fine, we try one classification problem.
Find the video type, based on the Views and Likes
Not a good prediction, we need to improve the accuracy by fine tuning the features or adding more features into the model. (Simply adding features may not help in improve the predictions)
We will apply the standardization technique and try with the KNN model
Its amazing, we see 100% accurate results. We need to keep improve the model until we get the desired accuracy. (Prediction measures are dependent on business problem with respect to Accuracy, Recall and Precision)
Text Mining — What people are commenting about
A word cloud is a graphical representation of frequently used words. The size of each word(height & width) in this picture is an indication of frequency of occurrence of the word in the entire text.
Why do we do this for comments data?
We have seen few insights of the Jabardasth show using different metrics or ML techniques. If we would like to know “What users are commenting more about this show”. In this case, Text mining will help us in finding the insights.
We apply few pre-processing steps before drawing the word plot.
- Remove punctuation
- Remove numbers (Numbers alone can’t be user)
- Uppercase/Lowercase (Maintain any one)
- Remove stopwords (Eg: a, an, the, he & she etc..)
- Strip White spaces
We can use the ‘removewords’ function to remove the words based on the requirement.
We need to sort the words in descending order with respect to their frequency. We may need to do little more customized pre-processing. (Optional)
Now we see the most frequently used words (top 20).
From the above plot, we can see the most used words in the comments. For insights, if we ignore the generic words like ‘super’, ‘skit’ and ‘nice’ the next popular words are ‘aadi’, ‘sudheer’, ‘chandra’ followed by ‘roja’, ‘anasuya’ and so on.
Sentiment Analysis based on user comments