By Roman Daukuls for BigData Republic
Python is a great tool for doing machine learning: scikit-learn contains a bunch of useful machine learning tools and a pipeline to combine all data processing steps into a single flow. Of course, you are not limited to scikit’s tools — you can also add your own estimators (see our previous blogpost for a tutorial on how to do that). However, you might still come across a situation where you want to use a function from that one specific R library which is not available in Python. You then face the following options:
- implement the function in Python yourself
- move the whole pipeline to R
- use R and Python separately
- call R from within Python using rpy2
We’ll talk about the last option and show an example of how you can combine Python and R to perform contextual changepoint detection. rpy2 is a Python package that provides access to R from Python. It provides the capability to convert Python objects into R objects and vice versa. Thus, with the help of rpy2, you can integrate R’s functionality into your Python workflow.
To reproduce the example code in this post, you will need Python, R and rpy2 installed. We used R v.3.4.3, Python v.2.7.14 and rpy2 v.2.7.0.
One example where rpy2 may come in handy is contextual changepoint detection. Changepoint detection is used to detect whether a change has occurred in a sequence of observations (usually a univariate time series). A changepoint can be defined as a point in a sequence of observations after which the process that generates data has changed. For example, in the graph below, the process on the left side of the graph undergoes a mean shift, while the process on the right undergoes a variance shift at
R has some libraries that can be useful to detect changepoints in univariate series, for example, Twitter’s BreakoutDetection or cpm. But what if the outcome of our process is affected by many other variables: context? In that case, it makes sense to remove the effect of these context variables first so that they do not interfere with the actual changepoint detection — we can build a model that predicts the outcome of the process and then apply changepoint detection algorithm on prediction errors. This will be shown in more detail in the example below and we will build a predictive model using scikit-learn’s pipeline in Python. We will use rpy2 to call an R library to perform changepoint detection on the prediction errors of our model.
An Example: Truck Fuel Consumption Monitoring
Consider a situation where you would like to monitor fuel consumption of a truck. In our simplified reality, a truck’s fuel consumption depends on its weight, operating area and headwind. Our task is to detect if a truck’s fuel efficiency has started to deteriorate.
First, we define functions to generate a sample dataset:
Let’s generate 365 observations with a change starting on day 300 (fuel consumption starts to increase from that point on, perhaps due to a faulty engine or low tire pressure — something we do not observe in the data):
Now that we have our data frame, we proceed with the standard machine learning steps: we split
Y and form a training set and test set. We leave the last 100 observations for the test set:
If we plot the target variable (liters per 100 km), there are some clearly visible changepoints in the graph of fuel consumption. However, these changepoints have to do with the context of the truck’s operation — it this case, its operating schedule alternates between the city and the countryside, and fuel consumption is higher in the city. Detecting changepoints on y itself without removing the context effects would result in false alarms.
So let’s try to remove the context effects and then apply a changepoint detection algorithm. To remove the context, we can build a model that predicts our truck’s fuel consumption. We can then see if there are systematic deviations (changepoints) in prediction errors. To build a predictive model, let’s use the scikit-learn pipeline with a random forest regressor. We first create some classes to facilitate our prediction pipeline:
We then construct the pipeline and fit it on our training data (of course, in a real-world scenario you should tune the parameters of your model, we assume here that the best model has already been found):
We are now ready to make predictions for the testset and compute prediction errors:
We can now plot the true signal and the predicted signal and show the prediction error:
Next, we perform changepoint detection. In order to accomplish that, we use the cpm package in R (make sure the cpm library is installed in your R environment or have a look here). We load the required classes from rpy2:
FloatVector function converts a Python sequence (a pandas
Series in our case) into an R object, while the function
robj_to_dict converts a list-like R object into a Python dictionary.
importr makes it possible to use an R library in the same way you would use a Python module:
The cpm library contains a function
detectChangePoint. In a nutshell, this function runs through a series sequentially, evaluating whether any point in the observations seen so far is a changepoint. This is decided based on the maximum value of a two-sample test statistic (sample 1 = observations before a candidate changepoint and sample 2 = observations after a candidate changepoint). If the maximum of two-sample test statistics exceeds a precomputed threshold (cpm does this for you), a changepoint is declared.
detectChangePoint takes the following key arguments:
x— the time series vector where we look for a changepoints
cpmType— statistical test to use to compare distributions before and after a potential changepoint
ARL0— average run length (adjusts the sensitivity of the changepoint detector)
We will use the t-test as a basis for changepoint detection, and set the average run length to 1000 samples (if there are no changepoints, a false changepoint is expected every 1000 observations).
We can see that once the context has been removed, we can detect the actual changepoint that was present in the generated data. Below we plot the
During data generation we specified a changepoint at
t=300 so cpm does a reasonable job at detecting the correct timestep of the changepoint.
This is obviously just the tip of the iceberg of changepoint detection, as there are many more algorithms available to perform this task. However, we hope this post serves as a good starting point if you are facing a similar problem. More generally, changepoint detection seems to be a useful tool to continuously monitor the performance of a predictive model.
The author of this article, Roman Daukuls, is a data scientist at BigData Republic, a data science consultancy company in the Netherlands. We try to hire the best of the best. If you are interested in using changepoint detection and other machine learning on practical use cases, feel free to contact us at firstname.lastname@example.org.
Our website: https://bigdatarepublic.nl/