Clip it, clip it good…

Patrick Surry
Life at Hopper
Published in
7 min readJan 6, 2020

Real data is messy. Whether exploring, transforming, visualizing or modeling, outliers often cause headaches. One of the most common ways to deal with outliers is to clip values (aka clamp or trim) within a specific range. For example, clipping price values to the range $0 to $1000 means that any negative price gets replaced by $0 and overly large prices get set to $1000. (Alternative approaches include dropping the record completely, applying a transformation, or binning.) Traditional clipping is fine but tends to produce spikes at the ends of the clipping range since all values outside the range get set to the endpoint value. As part of some airfare visualization work which I’ll dive into in a future article, I explored the idea of “soft clipping” to avoid the visual artifacts these spikes create and wanted to share some of the insights I found along the way.

Visualizing a one-percent sample of a single day’s price quotes for New York to London round-trip flights: about a million quotes. Color represents price (yellow = high, blue = low) and position encodes various itinerary features through a sequence of nested coordinate transformations (price recency, departure advance and day of week, length of stay, departure and return time of day and flight duration). Both pixel color and intensity settings benefit from soft clipping as described below.

For motivation, we’ll look at a small sample of Hopper’s airfare pricing data used to create the image above. The full visualization renders a point for every one of the ~50 billion price quotes that Hopper collects during a single day, whereas this image is based on just 1% of our price quotes for a single route: about a million prices for New York-London round-trips.

Our first challenge is how to represent prices. Looking at a histogram of the raw price data (below left), we see that the raw distribution is heavily skewed (below left), ranging from $301 to $58,121. Although a log transformation helps (below right), it doesn’t solve the problem.

Histogram of about a million New York — London round-trip price quotes (1% sample of what Hopper collected for that route during a single day), shown both as raw US$ price (left) and as log price (right).

In the visualization (top), we actually plot a point for each price quote at a coordinate determined by features of the corresponding trip, so that multiple points often overlap in a single pixel. We think of each point as a light source whose color is based on price, with each pixel getting a color corresponding to a weighted average of the grouped prices. Grouping produces a histogram of weighted log prices (below left) which is still highly skewed: in fact, skew worsens because outlier prices are less likely to be grouped than “common” ones. One option is to apply traditional clipping shown (below center) for the range [6, 8], but this creates sharp spikes at the endpoints. By applying soft clipping instead (below right) we significantly reduce those artifacts.

Histogram of weighted log price after grouping by pixel (left). Traditional clipping (center) gives us more resolution in the central mass of the distribution but tends to leave spiky artifacts at the range endpoints. Soft clipping (right) reduces spikes by squeezing outliers while preserving ordering.

When we map this price data to the visually uniform “plasmacolor scale, we see striking visual differences between the data that is unclipped (below left), hard clipped (below center) and soft clipped (below right):

The same pricing data rendered to pixels. The unclipped data (left) only uses a fraction of the color range; the clipped data (center) results in visual artifacts corresponding to the endpoint spikes (too much yellow and blue), while soft clipping (right) reveals more subtlety in the structure.

So what is this soft clipping magic? Mathematically, we can think of traditional “hard” clipping between a and b as a function f(x) that replaces x values outside the range [a, b]​ with the corresponding endpoint (below left, orange). To make the clipping “soft” we just smooth the corners of that function (below left, green). This “squeezes” all values outside the clipping range inside, as well as slightly squeezing interior values near the ends. What’s the point of that you ask? Like hard clipping, we limit the range of data to avoid problems with outliers, but we also preserve our original data ordering (handy for modeling), and reduce those characteristic spikes at the ends of the interval (nice for visualization). Using a smooth (differentiable) function also makes it easy to optimize the clipping parameters themselves.

Soft clipping just smooths the corners of the traditional “hard” clipping function (left). Using the “soft plus” and (“soft minus”) functions, which are smooth approximations to the linear rectifier (center) we can build a soft clipping function where we can control corner “sharpness” (right).

A soft clipping function on [a, b] should be a smooth sigmoid curve with f(x)a for x a​, f(x)b for bx, and a near unit slope within [a, b]. The canonical example is f(x) = tanh x which clips to [-1, 1] and is a simple transformation of the logistic function commonly used as a neural network activation function. We can modify tanh to clip to [a, b] by rescaling x to the appropriate range and y to maintain the unit slope, yielding

This is great, but it’d also be nice to control the “sharpness” of the corners. This doesn’t seem easy with tanh x — suggestions welcome! — but we can use a related function also inspired from artificial neural networks. Old school neural nets sometimes used an activation function called a rectifier (above center, orange) which is just one-sided clipping on [0, ∞), and nowadays use a smooth approximation called the “soft plus” function, log (1 + eˣ​​), which is easily (and infinitely) differentiable for back-propagation learning (above center, green). With linear rescaling, we get a function f(x) = x - c⁻¹ log (1 + exp(c(x-b))) that soft clips below b, with larger c sharpening the corners. Similarly, we can transform the “soft minus” function to clip above a. Combining these yields a flexible soft clipping function:

We just set [a, b] to the desired clipping range and control corner sharpness with c (above right). For a one-sided clip, we just drop the corresponding term (indeed, by dropping both we get back to the identity function f(x) = x). Nb. please don’t code the function this way — it’s not numerically stable — check out this gist instead.

The other component of our visualization, along with choosing colors, is to set pixel intensity levels. We think of each point as a standard light source with an intensity based on quote recency, so that pixels with more and recent price quotes are brighter, and pixels with fewer and older quotes tend to fade away. Plotting a histogram of the raw intensity levels for non-empty pixels (below left) again shows huge skew, with some pixels up to 80 times oversaturated!

Intensity data for our visualization after grouping to pixels. The x-axis measures “lightness” where 0 is pure black and 1 is fully saturated (white), with values above 1 indicating oversaturated pixels were collecting many price points together. The distribution is highly skewed (left), and hard clipping to [0, 1] gives a spiky artifact of white pixels (center) which soft clipping ameliorates while preserving relative intensity (right).

Simply rescaling the raw distribution to [0, 1] results in an almost black image (below left), while a traditional clipping to [0, 1] generates a huge spike of indistinguishable white pixels (above center) creating an over-saturated image (below center) that loses relative intensity information for bright pixels. Soft clipping (above right) squeezes the outliers and dampens some of the nearly white pixels, producing a slightly non-linear intensity scale but preserving much more visual information (below right). Combining the soft clipped intensity data with the soft clipped color data creates the visually pleasing image shown at the start of this article.

The intensity data from the prior figure rendered as greyscale. Rescaling without clipping loses almost all information (left), with hard clipping giving a large central cluster of oversaturated pixels (center) while soft clipping preserves more information by subtly squeezing the near-saturated and oversaturated intensities.

So that’s it: soft clipping, it’s fun, give it a try! Let me know what applications you come up with.

Bonus content

If you’ve made it this far and still reading, I’m impressed. Here are some related ideas to explore.

More clippers? Are there other interesting families of soft clipping functions? The numerically stable way to implement soft plus computationally (below left, green) is to add a “corner smoothing” function to the original rectifier (below left, orange), i.e. rewriting log (1 + eˣ​​) as max(x, 0) + log (1 + exp(-|x|)), where the second term smoothes the corner (below left, red). Can we create other corner smoother functions s(x)? We might require symmetry (which tanh x probably lacks!) so that s(x) = s(-x) and we also need s(x) → 0 as |x| → ∞. With symmetry we must also have s'(0⁺) = - s'(0⁻) =½ to blend the rectifier’s jump in slope from 0 to 1 at x = 0.

One simple idea that satisfies these conditions is an exponential decay: s(x) = exp(-c|x|)/2c. This gives a slightly “softer” corner than the soft plus/minus clipping (below center) but is visually similar (below right). Its odd derivatives beyond the first are also discontinuous at 0, but maybe that’s not a big deal? Are there other solutions that are infinitely differentiable?

A numerically stable way to compute soft plus (or minus) is to add a decaying corner smoother to the rectifier (left). For soft plus, the corner smoother is symmetric, although that’s not true for tanh x, which suggests a simpler exponential decay form (center). The resulting soft clipping functions are very similar visually (solid lines, right), with small absolute differences from tanh x (dotted lines, magnified 10x, right).

Fun fact: the derivative of the soft plus function that we used to build our flexible soft clipping function is the logistic function, which itself is just a simple transformation of our canonical tanh x soft clipping function🤯:

Soft square pulses: any function that clips on [0, 1] gives an approximate cumulative distribution function for the unit uniform distribution, so its derivative approximates the probability density for a simple square pulse (which becomes exact for the hard clipping function):

Differentiating our soft clipping functions give us smooth approximations to the square pulse. The derivative of tanh x (left) is not a great approximation, but both our soft plus/minus clip (center) and soft exponential clip (right) generate increasingly accurate smooth approximations as increase c to sharpen the corners. In fact, our fun fact means that the approximation via soft plus/minus simply combines a sigmoid with its mirror image.

Want to help us solve complex problems? Hopper is hiring! Check out our career listings here: www.hopper.com/careers

--

--