Sitemap
TDS Archive

An archive of data science, data analytics, data engineering, machine learning, and artificial intelligence writing from the former Towards Data Science Medium publication.

Wiggly Distributions and Nonparametrics

6 min readJan 4, 2019

--

Larry Wasserman’s book, All of Nonparametric Statistics, opens by describing the kinds of distributions people tend to focus on when studying nonparametric estimators:

Press enter or click to view image in full size

Julia “@b0rk” Evans has a run-down of this book’s intro, where she asks:

What’s an example of a probability density function that doesn’t satisfy [the integral]? (Probably something with an infinite number of tiny wiggles, and I don’t think any distribution I’m interested in in practice would have an infinite number of tiny wiggles?)

Why does the density function being “too wiggly” cause problems for nonparametric inference?

Whenever I see a constraint like that, trying to bound the smoothness of a function, I always immediately jump to the world’s least-smooth condition: a discontinuity. Any kind of discontinuity in a random variable’s density or distribution — even a single one, even if it’s really tiny — will break Wasserman’s constraint.

Let’s talk about where we might see a discontinuous distribution! And then, muuuuuch more speculatively, let’s talk about why it might be bad for a non-parametric estimator.

Scenario One: Timing database queries

Suppose you run an app that lets people read blog posts. Whenever a reader loads any particular post, your app makes two queries, in parallel, to two separate databases: the first returns the blog-author’s profile picture, and the second the actual text of the blogpost.

You’d like to know which query typically returns first, and by how many milliseconds. Your hunch is there’s an opportunity to use an idle fetch thread to do some other prep-work while still waiting for its sibling-query to finish.

To see what kind of downtime your faster fetch thread is dealing with, you modify the app to start tracking some stats, over which you’re hoping to perform some non-parametric estimates:

  • Track response-time of the profile pic, T_pic
  • Track response-time of the blog text, T_text
  • Calculate the difference, T_delta := T_pic — T_delta, and log the result

Turns out it’s your lucky day: you don’t know it yet, but the distribution
underlying T_delta is Gaussian with mean of 100 milliseconds and
standard deviation of 60 milliseconds.

As Evans shows in her original post, the Gaussian distribution definitely abides by Wasserman’s integral condition:

Press enter or click to view image in full size

So as far as Wasserman’s intro would have it, any nonparametric estimates you wanna make over this data should mostly come out okay. Or at least, you’ll have a good sense of what the performance of that estimator should be: any error bounds and guarantees that Wasserman’s book offers will still apply for your data, since its underlying distribution is so smooth and well-behaved.

To set us up for a comparison later, here’s what that nice, smooth Gaussian looks like, in terms of distribution and density:

Scenario Two: Timing database queries but sometimes something goes squirrelly

Let’s set everything back up as in Scenario One — two queries, two
response-times, log the difference. Except now imagine, every once
in a while, some phantom bug strikes. For some pathological cases, the picture always takes exactly 234 milliseconds longer to load than the text.

(For instance: maybe a well-meaning developer has included a code path that’s only meant to arise during unit or integration testing. And for test purposes, it was nice to be able to control the relative database fetch times. Hence the memorably human magic number of “234.” Except somehow, what was only supposed to happen at test time has leaked out into production!)

In this case, the distribution P(T_delta ≤ t) is going to have a sharp discontinuity at precisely t = 234:

Note that what’s important is not the size of that discontinuity, but that it’s discontinuous at all. Even if only one-in-a-million T_delta’s are pathologically 234 milliseconds, the density chart at right will still have a big, infinitely-tall impulsive spike.

What’s this mean for non-parametrics?

This distribution strikes me as super unsmooth. I’m not a great measure theorist, but I don’t believe this counts as differentiable, let alone as a density that satisfies Wasserman’s constraint. In fact, the very next section of the intro, Wasserman talks about handling the case where a density is absolutely continuous, or when the density is discrete, and it makes me wonder if there’s a plan for hybrid cases like our Scenario 2.

Here’s where I gotta cop to not having actually read this book. Maybe there’s an answer in store!

But that aside, I can take a stab at why a hybrid distribution like Scenario 2 is particularly tricky in terms of playing nice with nonparametrics.

Imagine we’re performing kernel density estimation. This is a lot like taking a histogram, but without the harsh boundary jumps you see between bins. The intuition is that if we made an observation of, say, T_delta = -30.5 ms, we should smear that around a little bit. Treat it as if it’s telling us, “not only is -30.5 ms a possible outcome, but that probably means nearby values are also possible.” We treat the appearance of -30.5 ms in our dataset as a reason to increase our belief that -30.8 ms, and-29.2 ms, and other similar values are likely to arise in the future.

For a nice, smooth distribution, I can see how that would be the case. I bet that for nice, smooth distributions, you can arrive at some nice error bounds for how incorrect your kernel density estimate is after viewing N draws from the distribution. The smoothness really does mean that the probability of observing x is just about the same as the probability of seeing x +/- ε.

Scenario 2 doesn’t work like that, not in the same way. Because there’s that one magic number — 234 milliseconds — where any observation of that value is just of a completely different character than the rest of the T_delta space. The density just changes too rapidly around 234 ms for those samples to power a healthy kernel density estimate. And that’s true no matter the size of the distribution’s discontinuity gap.

To consider another nonparametric estimator, consider bootstrap sampling in Scenario 2. The cases where your bootstrap subpopulation doesn’t contain any bogus 234 ms points looks exactly like Scenario 1 — and maybe those cases are sufficiently different that it makes it harder to combine with the other samples. How often does that 234-less condition arise? What about the exclusively-234 case? What do these crazy-corner-case samples do to your estimation procedure?

So your error bounds for Scenario 2 will now hinge not just on the number of draws N you’ve observed, but also how common it is to draw those pathological 234 ms cases. And the error bounds in Wasserman’s book probably don’t leave room for that degree of freedom.

Although, if the book covers all of nonparametric statistics, there must be something about these kinds of hybrid densities! I should… read the book.

--

--

TDS Archive
TDS Archive

Published in TDS Archive

An archive of data science, data analytics, data engineering, machine learning, and artificial intelligence writing from the former Towards Data Science Medium publication.

Brian Gawalt
Brian Gawalt

Written by Brian Gawalt

.... .- .... .- .... .- / .-- .... .- - / .- / -.-. .-.. . ...- . .-. / .. -.. . .-

Responses (2)