Photo by Russ Ward on Unsplash

Sometimes we need to estimate a probability for which we lack a simple analytic approach to doing so. Maybe we have some component distributions, but we’re combining them in a way that’s complicated. Or maybe we just have a dataset but don’t know the underlying component distributions.

Probability simulations are a useful tool for performing such estimates in situations like these. In this post I’ll show how you can do this for a simple example in which we’re assembling component distributions in a complicated way.

A sample problem

Our problem comes from exercise 5.2 in Regression and Other Stories by Gelman, et al…


After fitting a linear regression model, we want to understand how good its predictions actually are. In this post I’ll explain and demonstrate two common ways of evaluating such models, each corresponding to a different sense of what it means for a model to be “good”: root mean squared error (RMSE) and R².

The metrics

First we’ll look at both metrics: how to calculate them, and what they tell us.

Root mean squared error (RMSE)

The most common metric for evaluating linear regression model performance is called root mean squared error, or RMSE. The basic idea is to measure how bad/erroneous the model’s predictions are when compared…


Photo by Taylor Vick on Unsplash

This afternoon I attended a presentation on virtual machine demand prediction. The idea is to use these predictions to drive near-real-time capacity allocations.

Something that caught my attention was the model evaluation procedure they were using. Traditionally people will use something like RMSE, R² or similar. In this case though they were using under- and over-prediction rates. I hadn’t heard of doing this before.

The reason behind it is that there’s an asymmetry between under- and over-predicting demand:

  • Under-predicting demand leads to load/latency problems.
  • Over-predicting demand leads to wasting resources/money, and also to drawing resources away from where they’re actually…

My son wiring up a hygrometer

Last week was hackathon week at work. I decided to do an Azure Internet of Things (IoT) project to learn more about Azure’s IoT offerings.

Project overview

My son and I built a hygrometer, and then I connected it to Azure IoT Central so we could see its telemetry in a real-time dashboard. A hygrometer is a device that measures both humidity (actually, relative humidity, as I discovered during the course of this project) and temperature. It’s good for monitoring attics, crawlspaces and other hard-to-reach locations. …


Photo by Ferd Brundick

App instrumentation generally involves significant manual effort, with application code invoking logging/metrics/tracing SDKs when something interesting happens. This is useful, but not without its challenges. For one, it’s a lot of work. It also leads to a lot of code cruft. The most consequential challenge, however, is that it mostly results in an inconsistent treatment of observability data (e.g., free-form log messages, metrics data embedded in log messages, unconventional metric and dimension names). There’s little leverage, and it’s hard to do anything systematic with the data.

While manual instrumentation isn’t going anywhere, we can automate more than we typically do…


weatherLayers_002 by Scott Brown (flickr)

1. Start an InfluxDB container

$ docker run -d --name influxdb -p 8086:8086 influxdb

2. Start the InfluxDB shell

$ docker exec -it influxdb influx
Connected to http://localhost:8086 version 1.7.8
InfluxDB shell version: 1.7.8
>

3. Create a database

> create database mydb
> use mydb

4. Insert some time series data

> insert bookings value=102
> insert bookings value=108
> insert bookings value=95

5. Query the data

> select * from bookings
name: bookings
time value
---- -----
1571211243950013300 102
1571211245822776400 108
1571211247850693200 95
>


This post presents time series from a technical perspective, and presents two key challenges for time series analysis. It is based on the dense theoretical treatment in Mathematical Foundations of Time Series Analysis: A Concise Introduction*, by Jan Beran. But here the treatment is less dense since I aim to make the information more accessible to practitioners like myself.

First we’ll define time series and related concepts. Then we’ll use this foundation to understand the two key challenges for time series analysis.

Understanding time series

When we talk about time series, sometimes we’re talking about time series data (observations) and other times we’re…


Photo by Vimal Kumar

On teams, decision-making by dictator and by committee both suck. Dictators generate mediocre decisions quickly, and committees generate mediocre decisions slowly if at all. Over time both approaches kill team morale.

I had a manager, Joe Natoli, from whom I learned an effective balanced approach. The idea is that every decision has an owner. If there’s unclarity about the decision, start by identifying the owner. The rest of us support the owner by offering perspectives to inform the decision.

The team lead can override a decision, but this should happen only in extreme circumstances. I’ve never had to do it.


My team at work is building a time series anomaly detection system that automatically creates anomaly detectors to monitor application health. We started with the humble constant threshold detector, which uses a constant threshold to perform the normal-vs-anomaly classification task. We want to create constant threshold detectors for stationary time series, which are, roughly speaking, series whose statistical properties (e.g., mean, variance, autocorrelation) don’t change over time.

We can use the Augmented Dickey-Fuller (ADF) test to identify stationary series. In this post I’ll show how to do this in R using the tseries package. …


When building models for forecasting time series, we generally want “clean” datasets. Usually this means we don’t want missing data and we don’t want outliers and other anomalies. But real-world datasets have missing data and anomalies. In this post we’ll look at using Hampel filters to deal with these problems, using R.

For the Jupyter notebook, see https://github.com/williewheeler/time-series-demos/blob/master/hampel/removing-outliers-from-time-series.ipynb.

What is a Hampel filter?

A Hampel filter is a filter we can apply to our time series to identify outliers and replace them with more representative values. The filter is basically a configurable-width sliding window that we slide across the time series. For each window, the…

Willie Wheeler

Software developer with an interest in data science

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store