Clip it, clip it good…
Real data is messy. Whether exploring, transforming, visualizing or modeling, outliers often cause headaches. One of the most common ways to deal with outliers is to clip values (aka clamp or trim) within a specific range. For example, clipping price values to the range $0 to $1000 means that any negative price gets replaced by $0 and overly large prices get set to $1000. (Alternative approaches include dropping the record completely, applying a transformation, or binning.) Traditional clipping is fine but tends to produce spikes at the ends of the clipping range since all values outside the range get set to the endpoint value. As part of some airfare visualization work which I’ll dive into in a future article, I explored the idea of “soft clipping” to avoid the visual artifacts these spikes create and wanted to share some of the insights I found along the way.
For motivation, we’ll look at a small sample of Hopper’s airfare pricing data used to create the image above. The full visualization renders a point for every one of the ~50 billion price quotes that Hopper collects during a single day, whereas this image is based on just 1% of our price quotes for a single route: about a million prices for New York-London round-trips.
Our first challenge is how to represent prices. Looking at a histogram of the raw price data (below left), we see that the raw distribution is heavily skewed (below left), ranging from $301 to $58,121. Although a log transformation helps (below right), it doesn’t solve the problem.
In the visualization (top), we actually plot a point for each price quote at a coordinate determined by features of the corresponding trip, so that multiple points often overlap in a single pixel. We think of each point as a light source whose color is based on price, with each pixel getting a color corresponding to a weighted average of the grouped prices. Grouping produces a histogram of weighted log prices (below left) which is still highly skewed: in fact, skew worsens because outlier prices are less likely to be grouped than “common” ones. One option is to apply traditional clipping shown (below center) for the range [6, 8], but this creates sharp spikes at the endpoints. By applying soft clipping instead (below right) we significantly reduce those artifacts.
When we map this price data to the visually uniform “plasma” color scale, we see striking visual differences between the data that is unclipped (below left), hard clipped (below center) and soft clipped (below right):
So what is this soft clipping magic? Mathematically, we can think of traditional “hard” clipping between a and b as a function f(x) that replaces x values outside the range [a, b] with the corresponding endpoint (below left, orange). To make the clipping “soft” we just smooth the corners of that function (below left, green). This “squeezes” all values outside the clipping range inside, as well as slightly squeezing interior values near the ends. What’s the point of that you ask? Like hard clipping, we limit the range of data to avoid problems with outliers, but we also preserve our original data ordering (handy for modeling), and reduce those characteristic spikes at the ends of the interval (nice for visualization). Using a smooth (differentiable) function also makes it easy to optimize the clipping parameters themselves.
A soft clipping function on [a, b] should be a smooth sigmoid curve with f(x) ≃ a for x ≪ a, f(x) ≃ b for b ≪ x, and a near unit slope within [a, b]. The canonical example is f(x) = tanh x which clips to [-1, 1] and is a simple transformation of the logistic function commonly used as a neural network activation function. We can modify tanh to clip to [a, b] by rescaling x to the appropriate range and y to maintain the unit slope, yielding
This is great, but it’d also be nice to control the “sharpness” of the corners. This doesn’t seem easy with tanh x — suggestions welcome! — but we can use a related function also inspired from artificial neural networks. Old school neural nets sometimes used an activation function called a rectifier (above center, orange) which is just one-sided clipping on [0, ∞), and nowadays use a smooth approximation called the “soft plus” function, log (1 + eˣ), which is easily (and infinitely) differentiable for back-propagation learning (above center, green). With linear rescaling, we get a function f(x) = x - c⁻¹ log (1 + exp(c(x-b))) that soft clips below b, with larger c sharpening the corners. Similarly, we can transform the “soft minus” function to clip above a. Combining these yields a flexible soft clipping function:
We just set [a, b] to the desired clipping range and control corner sharpness with c (above right). For a one-sided clip, we just drop the corresponding term (indeed, by dropping both we get back to the identity function f(x) = x). Nb. please don’t code the function this way — it’s not numerically stable — check out this gist instead.
The other component of our visualization, along with choosing colors, is to set pixel intensity levels. We think of each point as a standard light source with an intensity based on quote recency, so that pixels with more and recent price quotes are brighter, and pixels with fewer and older quotes tend to fade away. Plotting a histogram of the raw intensity levels for non-empty pixels (below left) again shows huge skew, with some pixels up to 80 times oversaturated!
Simply rescaling the raw distribution to [0, 1] results in an almost black image (below left), while a traditional clipping to [0, 1] generates a huge spike of indistinguishable white pixels (above center) creating an over-saturated image (below center) that loses relative intensity information for bright pixels. Soft clipping (above right) squeezes the outliers and dampens some of the nearly white pixels, producing a slightly non-linear intensity scale but preserving much more visual information (below right). Combining the soft clipped intensity data with the soft clipped color data creates the visually pleasing image shown at the start of this article.
So that’s it: soft clipping, it’s fun, give it a try! Let me know what applications you come up with.
Bonus content
If you’ve made it this far and still reading, I’m impressed. Here are some related ideas to explore.
More clippers? Are there other interesting families of soft clipping functions? The numerically stable way to implement soft plus computationally (below left, green) is to add a “corner smoothing” function to the original rectifier (below left, orange), i.e. rewriting log (1 + eˣ) as max(x, 0) + log (1 + exp(-|x|)), where the second term smoothes the corner (below left, red). Can we create other corner smoother functions s(x)? We might require symmetry (which tanh x probably lacks!) so that s(x) = s(-x) and we also need s(x) → 0 as |x| → ∞. With symmetry we must also have s'(0⁺) = - s'(0⁻) =½ to blend the rectifier’s jump in slope from 0 to 1 at x = 0.
One simple idea that satisfies these conditions is an exponential decay: s(x) = exp(-c|x|)/2c. This gives a slightly “softer” corner than the soft plus/minus clipping (below center) but is visually similar (below right). Its odd derivatives beyond the first are also discontinuous at 0, but maybe that’s not a big deal? Are there other solutions that are infinitely differentiable?
Fun fact: the derivative of the soft plus function that we used to build our flexible soft clipping function is the logistic function, which itself is just a simple transformation of our canonical tanh x soft clipping function🤯:
Soft square pulses: any function that clips on [0, 1] gives an approximate cumulative distribution function for the unit uniform distribution, so its derivative approximates the probability density for a simple square pulse (which becomes exact for the hard clipping function):
Want to help us solve complex problems? Hopper is hiring! Check out our career listings here: www.hopper.com/careers