Last week was hackathon week at work. I decided to do an Azure Internet of Things (IoT) project to learn more about Azure’s IoT offerings.
My son and I built a hygrometer, and then I connected it to Azure IoT Central so we could see its telemetry in a real-time dashboard. A hygrometer is a device that measures both humidity (actually, relative humidity, as I discovered during the course of this project) and temperature. It’s good for monitoring attics, crawlspaces and other hard-to-reach locations. …
App instrumentation generally involves significant manual effort, with application code invoking logging/metrics/tracing SDKs when something interesting happens. This is useful, but not without its challenges. For one, it’s a lot of work. It also leads to a lot of code cruft. The most consequential challenge, however, is that it mostly results in an inconsistent treatment of observability data (e.g., free-form log messages, metrics data embedded in log messages, unconventional metric and dimension names). There’s little leverage, and it’s hard to do anything systematic with the data.
$ docker run -d --name influxdb -p 8086:8086 influxdb
$ docker exec -it influxdb influx
Connected to http://localhost:8086 version 1.7.8
InfluxDB shell version: 1.7.8
> create database mydb
> use mydb
> insert bookings value=102
> insert bookings value=108
> insert bookings value=95
> select * from bookings
This post presents time series from a technical perspective, and presents two key challenges for time series analysis. It is based on the dense theoretical treatment in Mathematical Foundations of Time Series Analysis: A Concise Introduction*, by Jan Beran. But here the treatment is less dense since I aim to make the information more accessible to practitioners like myself.
First we’ll define time series and related concepts. Then we’ll use this foundation to understand the two key challenges for time series analysis.
When we talk about time series, sometimes we’re talking about time series data (observations) and other times we’re…
On teams, decision-making by dictator and by committee both suck. Dictators generate mediocre decisions quickly, and committees generate mediocre decisions slowly if at all. Over time both approaches kill team morale.
I had a manager, Joe Natoli, from whom I learned an effective balanced approach. The idea is that every decision has an owner. If there’s unclarity about the decision, start by identifying the owner. The rest of us support the owner by offering perspectives to inform the decision.
The team lead can override a decision, but this should happen only in extreme circumstances. I’ve never had to do it.
My team at work is building a time series anomaly detection system that automatically creates anomaly detectors to monitor application health. We started with the humble constant threshold detector, which uses a constant threshold to perform the normal-vs-anomaly classification task. We want to create constant threshold detectors for stationary time series, which are, roughly speaking, series whose statistical properties (e.g., mean, variance, autocorrelation) don’t change over time.
When building models for forecasting time series, we generally want “clean” datasets. Usually this means we don’t want missing data and we don’t want outliers and other anomalies. But real-world datasets have missing data and anomalies. In this post we’ll look at using Hampel filters to deal with these problems, using R.
For the Jupyter notebook, see https://github.com/williewheeler/time-series-demos/blob/master/hampel/removing-outliers-from-time-series.ipynb.
A Hampel filter is a filter we can apply to our time series to identify outliers and replace them with more representative values. The filter is basically a configurable-width sliding window that we slide across the time series. For each window, the…
In my post Reducible vs irreducible error, I briefly explained how you can decompose prediction errors into reducible vs irreducible components. This time we’ll push the decomposition a little further, breaking the reducible error into error due to bias and error due to variance:
Prediction errors are closely related to the extent to which a model-building method is sensitive to the details of the training set:
Suppose that we want to predict a value Y based upon a set X = (X1, X2, …, Xp) of variables. For the predictions to have any chance of being good predictions, X needs to contain the core set of variables that drive the behavior of Y. But there will almost always be lesser variables, not included in X, that nonetheless exert some minor influence on Y. We capture the situation as follows:
Here, f is the function describing the relationship between X and Y, and ɛ is an error term that accounts for all the unmeasured influences on Y…