Sometimes we need to estimate a probability for which we lack a simple analytic approach to doing so. Maybe we have some component distributions, but we’re combining them in a way that’s complicated. Or maybe we just have a dataset but don’t know the underlying component distributions.
Probability simulations are a useful tool for performing such estimates in situations like these. In this post I’ll show how you can do this for a simple example in which we’re assembling component distributions in a complicated way.
Our problem comes from exercise 5.2 in Regression and Other Stories by Gelman, et al…
After fitting a linear regression model, we want to understand how good its predictions actually are. In this post I’ll explain and demonstrate two common ways of evaluating such models, each corresponding to a different sense of what it means for a model to be “good”: root mean squared error (RMSE) and R².
First we’ll look at both metrics: how to calculate them, and what they tell us.
The most common metric for evaluating linear regression model performance is called root mean squared error, or RMSE. The basic idea is to measure how bad/erroneous the model’s predictions are when compared…
This afternoon I attended a presentation on virtual machine demand prediction. The idea is to use these predictions to drive near-real-time capacity allocations.
Something that caught my attention was the model evaluation procedure they were using. Traditionally people will use something like RMSE, R² or similar. In this case though they were using under- and over-prediction rates. I hadn’t heard of doing this before.
The reason behind it is that there’s an asymmetry between under- and over-predicting demand:
Last week was hackathon week at work. I decided to do an Azure Internet of Things (IoT) project to learn more about Azure’s IoT offerings.
My son and I built a hygrometer, and then I connected it to Azure IoT Central so we could see its telemetry in a real-time dashboard. A hygrometer is a device that measures both humidity (actually, relative humidity, as I discovered during the course of this project) and temperature. It’s good for monitoring attics, crawlspaces and other hard-to-reach locations. …
App instrumentation generally involves significant manual effort, with application code invoking logging/metrics/tracing SDKs when something interesting happens. This is useful, but not without its challenges. For one, it’s a lot of work. It also leads to a lot of code cruft. The most consequential challenge, however, is that it mostly results in an inconsistent treatment of observability data (e.g., free-form log messages, metrics data embedded in log messages, unconventional metric and dimension names). There’s little leverage, and it’s hard to do anything systematic with the data.
$ docker run -d --name influxdb -p 8086:8086 influxdb
$ docker exec -it influxdb influx
Connected to http://localhost:8086 version 1.7.8
InfluxDB shell version: 1.7.8
> create database mydb
> use mydb
> insert bookings value=102
> insert bookings value=108
> insert bookings value=95
> select * from bookings
This post presents time series from a technical perspective, and presents two key challenges for time series analysis. It is based on the dense theoretical treatment in Mathematical Foundations of Time Series Analysis: A Concise Introduction*, by Jan Beran. But here the treatment is less dense since I aim to make the information more accessible to practitioners like myself.
First we’ll define time series and related concepts. Then we’ll use this foundation to understand the two key challenges for time series analysis.
When we talk about time series, sometimes we’re talking about time series data (observations) and other times we’re…
On teams, decision-making by dictator and by committee both suck. Dictators generate mediocre decisions quickly, and committees generate mediocre decisions slowly if at all. Over time both approaches kill team morale.
I had a manager, Joe Natoli, from whom I learned an effective balanced approach. The idea is that every decision has an owner. If there’s unclarity about the decision, start by identifying the owner. The rest of us support the owner by offering perspectives to inform the decision.
The team lead can override a decision, but this should happen only in extreme circumstances. I’ve never had to do it.
My team at work is building a time series anomaly detection system that automatically creates anomaly detectors to monitor application health. We started with the humble constant threshold detector, which uses a constant threshold to perform the normal-vs-anomaly classification task. We want to create constant threshold detectors for stationary time series, which are, roughly speaking, series whose statistical properties (e.g., mean, variance, autocorrelation) don’t change over time.
When building models for forecasting time series, we generally want “clean” datasets. Usually this means we don’t want missing data and we don’t want outliers and other anomalies. But real-world datasets have missing data and anomalies. In this post we’ll look at using Hampel filters to deal with these problems, using R.
For the Jupyter notebook, see https://github.com/williewheeler/time-series-demos/blob/master/hampel/removing-outliers-from-time-series.ipynb.
A Hampel filter is a filter we can apply to our time series to identify outliers and replace them with more representative values. The filter is basically a configurable-width sliding window that we slide across the time series. For each window, the…
Software developer with an interest in data science