DataScience for Developers: Build your first predictive model with R
As you have probably heard, ‘intelligent apps’ is the new black, today is easier than ever to enhance an application with features related to cognitive computing, neural networks or DataScience. Among these disciplines there are different levels of complexity and if you want to embrace them to it’s fullest potential you’d probably need to hire an specialist, however… we don’t always need a 100%, sometimes is good with a 60%, right? :) Most of the developers have a engineering/math/physics background and are perfectly capable of understanding and ‘doing’ some DataScience on their own.
This post explains the basics around what is and how to create a predictive model with R. On a future post we will see how to use that model from an application.
To illustrate the points, I’m going to use the domain and data set created for this Build 2016 session by the terrific PlainConcepts team (ibonilm, pablodoval, and more! ) in collaboration with Tara Shankar and Pablo Castro, I recommend you to watch it to learn more ;)
R. A programming language that makes statistical and math computation easy, therefore, super useful for any machine learning/predictive analytics/statistics work. You can also try python, F#, Octave, mathlab…
How can we ‘predict’?. We’re going to use well-known statistical methods (algorithms) to find the function (model) that best describes a dependency between different variables (a.k.a features). For instance…the value would be the price of a house and the variables would be the size, number of rooms, distance from a hospital, etc. Once we find the best model we can feed it with new values for the variables and get the estimated value at a certain confidence level :)
How do we find that relationship? Doing tons of math, but don’t worry, you don’t need to go back to your calculus books :D we just need to choose the right function call for the job. If you are really serious about optimizing or using it for your core business (i.e. stock market…) you definitely need to talk to an specialist (DataScientist / Math expert).
Predictive Analytics…Why? great question. This is one of the multiple ways that you can use to add ‘intelligence’ to your software. Data is great… but decisions and suggestions are even better, your apps can process and analyze tons of data and provide you, your business, and your customers with insights that can make the experience WAY better. I know that sometimes is difficult to come up with the right samples for our own domain, so let me list few ideas around prediction: storage needs, revenue forecast, customer behavior, resource availability, lead quality, web traffic, weather?, recommendations... I’m sure you’ll find something to ‘predict’ on your domain based on the data you have.
Let’s build your first predictive model step by step:
- Download and install the tools. First of all, download and install Microsoft R Open (available on Windows, Ubuntu, RedHat, OS X…). I’m going to do the demo using the R Tools for Visual Studio, but you can use RStudio (how to make it work with Microsoft R Open) or the command line tools.
- Frame the problem. It’s critical to know what are we looking for, on this specific sample, we own a ski rental business, and we want to predict the number of rentals that we will have on a future date. This information will help us get ready from a stock, staff and facilities perspective.
- Ingest the data. For this demo, download this data set that contains info from previous years, and follow these commands to load it into your R context
>mydata = read.table("c:/pathtothefile/RentalFeatures.txt",header=TRUE)
#play with the data set, list the columns
#important...if your first column name is "X..Year",
#then let's change its name to "Year", or nothing will work :P
>colnames(mydata) <- "Year"
#visualize the complete set
#or just few rows
#we are also going to create 3 additional columns on the data set
#this helps when building the model because we're explicitly saying
#that these values are categorical (kind of an enum in other #languages)
>mydata$FHoliday = factor(mydata$Holiday)
>mydata$FSnow = factor(mydata$Snow)
>mydata$FWeekDay = factor(mydata$WeekDay)
#now let's split the data into 2 different sets
#one for training the model and the other one for validating it
>train_data = mydata[mydata$Year < 2015,]
>test_data = mydata[mydata$Year == 2015,]
#...and also save this specific column as a vector
#we will use it for a bulk check of the quality of the prediction
>test_counts <- test_data$RentalCount
When you think about how to apply this to your domain, you would probably have already tons of data on your archive. Depending on the scenario you might need to capture more or different data, and if you do it…remember to think about ‘all’ the data sources that you can use: transactions, customer behavior, telemetry, public sources (weather, traffic…) vs the ones that you’ve traditionally used.
- Cleaning and preparing the data. This data sets are already ‘clean’ and ‘optimized’, but when you apply this to your own scenario you will probably find rows with wrong or empty values. There are several techniques to do this in an optimal way. Also, having into account that the process will require a lot of math there are several best practices that are nice to have (i.e. scale/normalize all the values to simplify operations), especially if we’re talking about large data sets. Probably you will spend most of the time on this phase O=)
- Play and Plot (a.k.a Exploratory Data Analysis) plot your set! based on its shape you will have a good intuition of which algorithm can be a good starting point (linear? logarithmic? high order polynomial?). You might also discover that you can create new features (feature engineering) based on the ones you already have, remove others… the goal is to have the best context we can get for the model.
#let's play with visualizations
#it's super important to invest some time on plotting and playing #with the data, it can help us get a good intuition to choose an #algorithm, or help us identify issues
> plot(mydata$Snow, mydata$RentalCount)
> plot(factor(mydata$Snow), mydata$RentalCount)
> plot(factor(mydata$WeekDay), mydata$RentalCount)
> plot(ISOdate(mydata$Year, mydata$Month, mydata$Day), mydata$RentalCount)
- Create the model and validate the prediction quality. Let’s start by using a linear regression algorithm (lm) to create a model, and use the test data set to validate how good is the prediction from the model.
>model=lm(RentalCount ~ Month + Day + FWeekDay + FSnow + FHoliday, train_data,)
>p = predict(model, test_data)
>plot(p - test_counts)
- Keep experimenting… Probably a linear regression is not the best algorithm on this case, try to do the same substituting ‘lm’ for ‘rpart’ and see how we have a better prediction (run help(lm) and help(part) to read about the difference). The domain and the plot&play will give you good intuitions, but it’s good to do several experiments… a good algorithm cheat sheet is also something nice to have.
>model=rpart(RentalCount ~ Month + Day + FWeekDay + FSnow + FHoliday, train_data,)
>p = predict(model, test_data)
>plot(p - test_counts)
Great! we have a predictive model working, and we have tested it against some data and validated that it works quite well, now it’s time to predict what’s going to happen on a given situation in the future… it’s this easy
#let's try to predict the same data that they use at the session
>>predict(model, data.frame(Month = 1, Day = 1, FWeekDay = factor(7), FSnow = factor(1), FHoliday = factor(0)))
#on a sunday Day(7) the output is 645.7059
>>predict(model, data.frame(Month = 1, Day = 1, FWeekDay = factor(4), FSnow = factor(1), FHoliday = factor(0)))
#on a Thursday Day(4) the output is ... way smaller :)
Now that it’s working, we just need to package it and somehow call it from our application! We will see how to do that on the next post.
Really? is that all? This is easy!
Well.. yes and no… what we did is super basic, it can be applied to many scenarios but as I mentioned at the beginning of the post, if you want to apply it to the core of your business I’d recommend you to get advise from an expert, probably you can fine tune the data better, find a better algorithm, optimize it... So yes, this was easy, but this is far from being ‘all’ :)