Sociofuzz: Movie Reviews, Ratings and Trends using Social Media Analysis !!!
Movie buffs rely on rating provider websites before watching a movie. A user based rating system incorporates inherent predilections, while a purely critic based system is biased, as it is heavily observed that even not so critically acclaimed movies going very well with them.
People now-a-days rely on what’s being fed to them over the internet. There might be cases where one can get accurate movie ratings, but there is huge chance to miss out on a great flick just because ten people with different mindset gave it a bad review.
Social Media has become an important community where people likes to express their reviews and opinions. They tweet about the movie, write comments and posts, which is infact comprised of a significant information that can be used to provide a social rating. I started working on this idea of capturing non biased and true opinions of people from social media and creating a complete platform for providing a compact analysis movies.
This platform was not only meant to serve movie analytics insights but also it was an information provider platform with a whole new experience of movie analysis supplemented by the requisite dose of trivia, facts and pertinent information, box-office collections and trending news and theater and show timings information.
Product Features and Product Planning
Creating a scale-able, furnished and complete product is always driven by structured planning from each aspect: technical, requirement, business, domain and timelines. In planning phase, tons of new use cases and add-ons were born such as box office predictions, movie social graph, viral gallery etc.
Data Mining: Python, Twitter API, Facebook Graph API, Youtube API
Natural Language Processing: NLTK, StanfordCore NLP, Regex
Machine Learning: Scikit.Learn, Numpy
Backend Architectures: Flask, Tornado, Redis, Celary, Mongo
Frontend Architectures: Angular.js, Bootstrap, HTML, D3.js, Google Charts
Mobile App: Ionic
Data Extraction Layer
First layer of sociofuzz back-end was a information and data extraction engine. A number of web listeners using REST APIs and web crawlers were developed. These listeners were scheduled using cronjobs and hosted on cloud to stream updated data and push into database layer.
Web Data Mining: To fetch relevant movie information such as meta info, songs information, city wise tickets and showtimes, box-office collections and movie trivia and facts, I created different web mining modules using a seeding, crawling, parsing architecture wrapped in Celary which is a multiprocessing framework. The data was stored in MongoDB and Redis was used as broker.
Social Data Mining: To obtain social media data from different sources, I used pythonic wrappers for Twitter REST API, Facebook Graph API and YouTube Data API. There was a huge challenge of throttles and rate limit associated with these APIs but I resolved this using multiple keys in a rotational and parallel manner. To get the historical data from twitter, I used a hack of generating new tweet ids captured after every page scroll.
Data Cleaning, Standardization and Deduplication Engine
Next layer was a data cleaning and standardization layer, where an engine was created to clean the data, remove the noise, standardize the variables and derive variables in a structured manner. The social data is highly unstructured in nature, contains unknown slangs and follows no proper grammar, full of spelling errors and non standardized locations etc.
For Example :
Tweet: “The muvie ws toooo gud #hppppy”
User Location: “Live in new delhi” , “near your place” etc.
This data was cleaned using text cleaning techniques such as punctuation's removal, stop-words removal, text normalization, word stemming, slags standardization, regular expression based junk removal etc. Noisy and uncleaned user locations were processed using google places API. The structured and parsed JSON was stored in the database.
Data Science and Analysis: EDA, NLP, ML
Exploratory Data Analysis (EDA)
The immense amount of social data is composed of variety of features at different level - Users, Movies, Opinions etc. I rearranged the data in a flat format, dumped into database and performed Descriptive Analysis using mongo’s aggregation engine. Frequency distributions and top trends were calculated for different variables using Univariate and Bivariate analysis.
Natural Language Processing (NLP)
I created different modules to extract user opinions from text data such as tweets, posts and comments. I used grammar based sentence parsing using libraries such as Stanford core NLP and NLTK. A grammar and POS Tag based sentiment analyzer was also created to classify opinion into positive, negative or neutral class.
To find important mentioned entities from text, I used LDA (Latent Drichilet Allocation) for Topic Modelling and NER (Named Entity Recognition).
Machine Learning (ML)
In machine learning part, I used predictive analytics for box office predictions for upcoming days, where I created a Linear Regression model using scikit.learn in Python. I also created a three layered neural net for Movie Engagement Prediction.
Back End Architecture
The complete data was stored in Mongo DB hosted on EC2 cloud. This data was sharded to four clusters in order to make data partioning and querying engine faster.
The complete client side was backed using REST APIs built in Flask. The APIs were responsible for interaction between database and client side — both Mobile Application and Web Application.
Front End Architecture
For the website UI I used twitter bootstrap as CSS framework, for UX I used Angular.js. The data visualizations which were prepared using D3.js and google charts.
I used the Ionic Framework to make the iOS and Android apps of sociofuzz. Ionic is awesome, easy to use and based on Angular.JS and HTML5.
Social Media Marketing
The product is built well for both web and mobile, However any product can not be influenced if it is not marketed well. I used social media again to reach out to movie lovers and different customers. Dedicated pages on Facebook, Twitter were a great source. Also wrote a number of blogs on tumbler and the website as well to increase the reach.
Feel Free to reach out to me on firstname.lastname@example.org, in case of ideas suggestions and questions. I can also share the code and the data (~20GB) that I tracked for this product :)