Exploring Game of Thrones with DSX
The HBO series Game of Thrones has become a cultural phenomenon since its release in April of 2011. The series is full of bloody battles and unpredictable allegiances for the main goal of becoming the ruler of the Iron Throne. Since constant plot twists happen in the series, we thought it might be interesting to create a web application that would help us forecast who would be the final ruler of the Seven Kingdoms. The web application was built based on an algorithm using machine learning and predictive analytics. This was all done using the IBM Data Science Experience (DSX), which provides a collaborative platform for the data science community.
- Predict the final ruler of the Iron Throne by integrating data from multiple sources, including Kaggle and Twitter.
- Explore the data with Watson Analytics to gain better insight on the data collected.
- Showcase the data and final results with a user-friendly web application.
To make all of that happen we used dashDB and the IBM Data Science Experience (DSX). Within DSX we used RStudio and the plot.ly visualization package.
Steps to creating the application:
1. Data collection
Before starting any analytics and predictions, we need to have data. We collected the data from Kaggle and Twitter.
Kaggle provides structured data, and Twitter supplies unstructured data that we could clean and parse. Kaggle is a great resource because it provided an open source data set. Obtaining data from Kaggle was simple:
- Sign up for an account on the website.
- Click Datasets and scroll down to Game of Thrones, then download the data comprising three CSV files.
Twitter provided a supplemental data source to use in our analysis. DSX allowed us to access both data sources easily to quickly begin our analysis. Obtaining data from Twitter was very simple on Bluemix:
- Create a dashDb instance and select Load Twitter, which allows us to obtain live Tweets from Twitter.
- Search the Tweets by characters and labeled tables based on their names.
2. Data ingestion
We took both data sets and integrated them into dashDB. This was a great choice because dashDB seamlessly connects to IBM Data Science Experience and Watson Analytics. We also had to do this because the beta version of Data Science Experience couldn’t communicate directly to Watson Analytics. According to the DSX team, the final product will have a direct link to Watson Analytics. This is important because it will enable seamless collaboration between the business analyst and the data scientist. It also made sense to utilize a relational database since we would use multiple tables to compare the characters of the show.
3. Exploratory analysis
Once all the data sets were loaded into dashDB we imported that data into Watson Analytics. We started exploring here to get a better insight into the data collected and possibly give us a new perspective. In our research we found that, in many cases, business analysts will explore data sets before working with a data scientist to start modeling the data or performing more in-depth analytics.
The following charts were created very quickly and easily using Watson Analytics.
4. Data modeling and visualization
RStudio is integrated into DSX, which is the most popular IDE for the R programming language. RStudio makes it easy to manipulate the data and visualize the discoveries with plot.ly, a third-party visualization tool. Initially, we made a scoring system that was an aggregate of points based on various weighted attributes for each character. These attributes consisted of allegiances (5%), tweets (20%), leads an army (20%), boats (5%), supernatural (20%), army size (5%), throne heir (15%), and battles won (10%). We used these attributes to classify the characters, then plot the results using plot.ly. This let us visualize the data and integrate the model into the web application.
Other business applications or industries can use the same tools to model certain factors within their company and compare results against other companies. In this example, HBO might compare views of shows to shows on Netflix or Amazon Prime, such as the times they are being most frequently watched, or to see if weather impacts their viewing. These insights can help businesses make better decisions.
The following charts were created in DSX using plot.ly to visualize the data from RSstudio.
The final step is to create and deploy a web application to display all discoveries. The web site is the culmination of work done by the data scientist, business analyst, and developer. It all comes together to display the information in real time. We used the Node.js service in Bluemix and displayed the plot.ly charts with the built-in library extensions that can be shown on an interactive web page application. The creation of the web site encourages creativity and gives each user the ability to design what their viewers will see.
This post is a result of a collaborative project.
Originally published at datascience.ibm.com on September 13, 2016 by Cindy Kim.