Introduction To Analytics Modeling : Week 3 — Basic Data Prep & Change Detection.
The Thirds Week’s Notes from Intro To Analytics Modeling! Check out the course on Edx
Introduction To Data Preparation
So far we haven’t really looked into the process of preparing a dataset. We’ve mainly focused on the analysis of data because we have been working with cleaned data sets.
In the real world though, most of the data that you will receive — from a client or at work — will most likely need to be preprocessed into a form that can be processed by a computer as well as analyzed for errors.
A couple of common issues to watch out for in data include:
- Scaling — As we have seen, some data , like income, can be orders of magnitude higher than credit score. We need to bring both of these down to the same scale to avoid one variable over powering the other.
- Extraneous Information — Some data sets might contain information that we don’t really need which could make our models overly complicated and harder to interpret.
Outliers
Often, data sets will contain outliers — data points that are extreme compared to the rest. The simple answer would be to identify and remove the outliers from the data. This maybe the appropriate approach sometimes , depending on the context, we may be able to a stronger insight to our process based on how outliers are behaving.
The most obvious form of outliers are point outliers (01). A small group of data points maybe outliers as well (O2) but they might have a good reason for it so it’s worth investigating.
Contextual Outliers
Even though the data point isn’t any lower than the other low points, its appearence in this part of the cycle is unexpected. We can only see this difference in relation to the other points.
We may also find collective outliers. In fig 3, an entire set of data points appear to be outliers given what we expect to see at that point in time.
Finding Outliers
Box-and-whisker plots, often referred to simply as box plots, allow us to visualize outliers in one dimension or feature.
Sometimes it’s not that easy to identify outliers — especially multidimensional outliers. For example, if we were considering data with many features, simply categorizing the data point due to an “outlier-ish” feature value might not be the best idea.
In this case , we can fit a model to the data and observe where the model makes extremely large prediction errors.
Consider the example in fig 5.
If we fit an exponential smoothing model to this data, it would predict a new temperature right at the curve but be way off because the real point is an outlier.
Dealing With Outliers
Keep in mind that not all outliers are created equal and context really matters.
It maybe the case that sometimes the data is just bad . Remember that there is an entire back end process that precedes all of our analysis — data collection!. Some times events like power outages, sensor failures and natural disasters can mean problems for the data collection and can result in unusable data.
The only way to really know is to analyze the data further and ask yourself questions such as :
How was the data collected?
Where did it come from?
What’s the context?
Bad Data
- Erroneous data — We may simply be able to eliminate the suspects from our analysis and move on with our lives.
- Imputing data — Sometimes we may have missing data which generally presents itself as blank spaces in our data frames and spread sheets! In this case we may be able to estimate the missing value from the rest of the datasets.
Real/Correct Data
Real data takes a little more work to specify — especially if there is true randomness in our data which means , by definition, some data points will be outliers.
Consider a normal distribution where ~4% of the data will be outside 2 standard deviations from the mean.
Therefore, for example, if you have collected 100,000 data points, ~4000 of them will be outliers — but wait! this is useful data and could contain very important information about the system!
If we go around removing any outlier before fitting your model, like a dictator — executing people who that don’t fall in line with everyone else, our model will be far too optimistic and tend to make more mistakes in prediction.
It helps to keep the randomness of the system in mind when considering what data we want to use to fit your model. For example, if we fit a model for predicting traffic on a highway without taking accidents into account— which cause backups — what might our result be? It will be far too optimistic in estimating traffic flow because, while they are random, accidents do happen and we must take them into account.
One way to further our understanding of the situation and perhaps improve our estimates is to :
- Estimate the probability of an outlier occurring — We may use a logistic regression to learn how likely it is for an outlier to occur based on different conditions.
- Create another model for estimates under normal conditions — Modeling sans outliers.
Together, we maybe able to get a better idea of what we should expect.
Sometimes, the good data must die.
It maybe the case that we have to leave data out when fitting our model, even it is real data, because some outliers are not predictable at all. These occurrences , if included in our data, will only adversely effect our predictions under normal conditions.
Imagine you own a restaurant. It’s not a bad restaurant but it’s nothing to write home to mum about — don’t get too excited. One fine summer’s day, Justin Bieber, suffering from a massive hangover, happens to pull his tour bus into your parking lot. Many selfies are taken, autographs are signed. Next thing you know, because he’s so grateful for your mediocre steak — following 35 hours of nothing but liquor and bad decisions — he tweets about your restaurant to his 100 million + followers. Your sales for the next day might look like this:
In this scenario, it would only make sense to remove this day’s sales data because the probability of this happening again is , unfortunately, very small and so detrimental to making predictions under normal conditions.
Change Detection
As the name implies, this is nothing more than detecting changes in the system and responding accordingly, if necessary.
And since change is, by definition, in relation to a previous state, the data in question will be time series / time dependent data.
Change detection helps us answer some key questions:
- Do we need to take action? — Changes in the system can be due to expected random variation or unexpected/expected change that requires our attention: Do I need to be concerned that my girlfriend is annoyed or is it just an expected random fluctuation?
- Has some action we have taken had an effect? — Generally, when we make an intervention we expect some aspect of the outcome to change. However, we also need to be able to determine if the outcome changed because of our intervention or simply by chance : is my girlfriend happier because I’ve learned to put the toilet seat down or is it simply due to randomness?
- How have things changed over time? — Paying attention to how things have changed over time allows us to observe patterns and plan for the future: Is my girlfriend still into Deadmau5 or should I not get her his latest album?
Additionally, change detection can help us identify potential problems before they happen.
CUSUM Change Detection
CUSUM (Cumulative Sum) is a method of change detection based on monitoring a variable and an associated threshold. It is useful in that it helps to account for random variation so we’re not alarmed unnecessarily.
We can control CUSUM in such a way that depending on the context of the problem, we can adjust the sensitivity of the change detection.
Say we have time series data that looks like fig 8.
In this figure, X(t) can be any time series data with random fluctuation : price of a stock, blood glucose levels, annoyance-at-me level of my girlfriend, etc.
μ is the expected value of the observation if there is no change — the mean of X if there is no change.
Let’s define S(t) such that:
The equation in figure 10 defines S(t) as either 0 or the sum of the previous value of S(t) and the difference between the current observed value and the mean — which ever is larger. S(t) is an aggregate of all positive differences between the observed value and the expected (mean) value.
Once we have this value we are going to be watching S(t) and asking:
Is S(t) ≥ T?
where T is some threshold that we have defined. As long as S(t) stays below the threshold value, we have nothing to be concerned about — axle temperatures are still fine, blood glucose levels normal , my girlfriend is no more annoyed at me than she usually is — all good. We don’t register a change and thus, can keep on keeping on.
Since we are only concerned about increases right now, any value that is negative gets set to zero. Later on, we can look at changing that if we wanted to watch for decreases.
As we keep watching S(t), if meaningful changes are occurring, eventually S(t) > T and we can register a change.
We can also control how sensitive the model is to fluctuations in the observed value. Depending on how much randomness we expect in the system, we can add and tune a parameter ‘C’ .
Now, we can adjust ‘C’ so that it becomes harder or easier for the sum of : previous S(t)+ (difference between observed and expected) to be greater than 0.
The larger ‘C’ gets, the harder it is for S(t-1) + ((X(t)- μ- C)to be greater than 0. X(t) must be much higher than the expected value (μ) to positively impact S(t) — The model is less sensitive to variation in the observed value X(t). If we expect a lot of random variation in the system, meaning the observed value varies a lot across time, a larger C value will be more appropriate — any change in a system that has high randomness is more likely to be a meaningless variation.
Conversely, the smaller ‘C’ gets, the easier it is for S(t-1) + ((X(t)- μ- C)to be greater than 0. X(t) doesn’t have to be much higher than the expected value (μ) to positively impact S(t) — The model is more sensitive to variation in the observed value X(t). If we don’t expect a lot of random variation in the system, meaning the observed value doesn’t vary a lot across time, a smaller C value will be more appropriate — any change in a system that has low randomness is more likely to be a meaningful variation.
Luckily for me, I don’t have a very moody girlfriend so I can afford to be a little more sensitive than if her moods fluctuated a lot for no apparent reason.
CUSUM — Finding The Right ‘C’ and ‘T’ Values
Deciding on the right values for these two parameters depends a lot on the context and specific application. There may be occasions where we would prefer a system with a higher sensitivity even if that results in some false positives , depending on the cost of the false positives in terms of what it could take to address the alert. — do we have to shut down an entire production line? Should we rush to the emergency room? Should I buy my girlfriend a fancy box of chocolates and apologize profusely?
CUSUM Detecting Decreases
We can use the same approach to detect decreases by simply rearranging the equation. We will still be watching for when S(t) ≥ T but it will now indicate a meaningful decrease.
Now we are subtracting the observed value from the expected value. If μ , the expected value, is greater than X(t), this means that there has been a decrease and we add that positive value to S(t). Again, we use the parameter ‘C’ in the same way to control how sensitive the model is to decreases in X(t).
We could also use both simultaneously to watch for increases and increases by doing both at once.
Control Charts
Control charts provide an excellent way to visualize change detection. We can plot S(t) and the threshold as follows.
Hypothesis Testing
The homework for week 3 touched on hypothesis testing as we needed to use the Grubb’s test for outliers.
For an intuitive explanation of hypothesis testing, watch the Khan Academy video below but you can use fig 15 as guidance.
Also watch this very insightful video which highlights some of the problems with the the scientific method even though, like the narrator points out, it is by far the best method for arriving at the truth that we have managed to come up with!