Building the Global Heatmap
Our Global Heatmap is the largest, richest, and most beautiful dataset of its kind. It is a visualization of two years of trailing data from Strava’s global network of athletes. To give a sense of scale, the new heatmap consists of:
- 700 million activities
- 1.4 trillion latitude/longitude points
- 7.7 trillion pixels rasterized
- 5 terabytes of raw input data
- A total distance of 16 billion km (10 billion miles)
- A total recorded activity duration of 100 thousand years
Beyond simply including more data, a total rewrite of the heatmap code permitted major improvements in rendering quality. Highlights include twice the resolution, rasterizing activity data as paths instead of as points, and an improved normalization technique that ensures a richer and more beautiful visualization.
From 2015 to 2017, there were no updates to the Global Heatmap due to two engineering challenges:
- Our previous heatmap code was written in low-level C code and was only designed to be run on a single machine. It would have taken months for us to to update the heatmap with this restriction.
- Accessing stream data required one S3 get request per activity, so reading the input data would have cost thousands of dollars and been challenging to orchestrate.
The heatmap generation code has been fully rewritten using Apache Spark and Scala. The new code leverages new infrastructure enabling bulk activity stream access and is parallelized at every step from input to output. With these changes, we have fully conquered all scaling challenges. The full global heatmap was built across several hundred machines in just a few hours, with a total compute cost of only a few hundred dollars. Going forward, these improvements will enable updating the heatmap on a regular basis.
The remaining sections in this post describe in some detail how each step of Spark job for building the heatmap works, and provides details on the specific rendering improvements that we have made.
Input Data and Filtering
The raw input activity streams data comes from a Spark/S3/Parquet data warehouse. Several algorithms clean up and filter this data.
Most importantly, the heatmap only contains public activities and respects all privacy settings. Go here to learn more.
Additional filters remove erroneous data. Activities at higher than reasonable running speeds are excluded from the running heat layer because they are most likely mislabeled. There is also a higher speed threshold for bike rides to filter data from cars and airplanes.
The intent of the heatmap is to only show data from movement. A new algorithm does a much better job of classifying stopped points within activities. If the magnitude of the time averaged velocity of an activity stream gets too low at any point, subsequent points from that activity are filtered until the activity breaches a specific radius in distance from the initial stopped point.
Many devices (most notably the iPhone) will sometimes “correct” the GPS signal in urban areas by snapping it to known road geometry rather than a raw position. This can lead to an ugly artifact where heat is only one pixel wide along some streets. We now correct for this by adding a random offset (drawn from a 2 meter normal distribution) to all points for every activity. This level of noise is large enough to be effective at preventing this artifact without noticeably blurring other data.
We also exclude any “virtual” activities, such as Zwift rides, because these activities contain fake GPS data.
After filtering, latitude/longitude coordinates of all data points are translated into Web Mercator Tile coordinates at zoom level 16. This zoom level consists of a tessellation of the world into 2¹⁶ by 2¹⁶ tiles each consisting of 256 by 256 pixels.
The old heatmap rasterized each GPS point as exactly one pixel. This simple approach was often hindered because activities can record at most one data point per second. This frequently caused visible artifacts in areas of low heat because the speed and record rate of activities often led to multi pixel gaps . It also caused a bias in heat where activities generally move slower (consider uphill vs downhill). Since the new heatmap renders an additional zoom level (pushing the highest spatial resolution of a single pixel from 4 meters to 2 meters), the problem was made even more visible. The new heatmap instead draws each activity as a pixel-perfect path connecting subsequent GPS points. On average, the length of a segment between two points at zoom level 16 is 4 pixels, so this change is very noticeable.
To accomplish this in a parallel computation, we needed to handle the case where adjacent points in an activity map to different tiles. Each such pair of points is resampled to include intermediate points along the original line segment at the boundary of each tile crossed. After this resampling, all line segments then start and end on the same tile, or have zero length and can be omitted. Thus we can represent our data as (Tile, Array[TilePixel]) tuples, where Array[TilePixel] is a continuous series of tile coordinates that describe the path of a single activity over a tile. This dataset is then grouped by tile, so that all of the data needed to draw a single tile is mapped to a single machine.
Each sequential pair of tile pixel coordinates is next rasterized onto a tile as a line segment using Bresenham’s line algorithm. This line drawing step must be extremely fast, as it runs trillions of times. The tile itself at this stage is simply an Array[Double](256*256) representing the total count of lines to hit each pixel.
At the highest zoom level, we populate more than 30 million tiles. This poses a problem because directly holding every tile in memory as double arrays would require at least 30 million x 256 x 256 x 8 bytes = ~15TB of memory. While this amount of memory is not unreasonable to provision in a temporary cluster, it would be rather wasteful considering that the tiles can on average be highly compressed since the majority of entries are zero. For performance reasons, using a sparse array was not a viable solution. In Spark, we can greatly reduce the maximum memory requirement by ensuring that the parallelism of this stage is many times higher than the number of active tasks in the cluster. Tiles from a finished task are immediately serialized, compressed, and written to disk, so only the set of tiles corresponding to the active set of tasks need to be kept decompressed and in memory at any given time.
Normalization is the function that maps raw heat counts for each pixel after rasterization from the unbounded domain [0, Inf) to the bounded range [0, 1] of the color map. The choice of normalization has huge impact on how the heatmap appears visually. The function should be monotonic so that higher heat values are displayed as higher “heat”, but there are many ways to approach this problem. A single global normalization function would mean that only the single most popular areas on Strava would be rendered using the highest “hottest” color.
A slick normalization technique is to use the CDF (cumulative distribution function) of the raw values. That is, the normalized value of a given pixel is the percentage of pixels with a lower heat value. This method yields maximal contrast by ensuring that there are an equal number of pixels of each color. In photo processing, this technique is known as Histogram equalization. We use this technique with a slight modification to prevent quantization artifacts in areas of very low raw heat.
Computing a CDF only for the raw heat values in a single tile will not work well in practice because a map view typically shows at least 5x5 tiles (each tile is 256x256 pixels). Thus, we compute the joint CDF for a tile using that tile and all tile neighbors within a 5 tile radius. This ensures the normalization function can only vary at scales just larger than the size of a typical viewer’s screen.
To actually compute an approximation of the CDF that can be quickly evaluated, you simply sort the input data, and then take some amount of samples. We also found that it is better to compute a biased CDF by sampling more heavily towards the end of the array of samples. This is because in most cases, only a small fraction of the pixels have interesting heat data.
An advantage of this approach is that the heatmap has perfectly uniform distribution over the colormap. In some sense, this causes the heatmap to convey maximum information about relative heat values. We also subjectively find that it looks really nice.
A disadvantage of this approach is that the heatmap is not absolutely quantitative. The same color only locally represents the same level of heat data. We do offer a more sophisticated product Strava Metro for planning, safety and transportation departments with a quantitative version of the heatmap.
Interpolate Normalization Functions Across Tile Boundaries
So far we have a normalization function for each tile that represents a CDF of pixels within the local neighborhood of a few tiles. However, the CDF will still discontinuously change at tile boundaries in a way that is noticeable as an ugly artifact, especially in areas where there is a large gradient of absolute heat.
To address this, we used bi-linear interpolation. Each pixel’s actual value is computed from a sum of its four nearest neighbor tile bi-linear coefficients given by max(0, (1-x)(1-y)) where x, y are the distances from the tile’s center. This interpolation is expensive because we need to evaluate four CDFs for every pixel instead of one.
Recursion Across Zoom Levels
So far we have only talked about generating heat for a single zoom level. To process lower levels, the raw heat data is simply added together — four tiles become one tile with 1/4th the resolution. Then the normalization step is rerun as well. This continues until the lowest zoom level is reached (a single tile for the whole world).
It is very exciting when stages of a Spark job need to process an exponentially decreasing amount of data, and thus take an exponentially decreasing amount of time. After waiting about an hour for the first level stage to finish, it is a dramatic finish to see the final few levels take less than a second.
Serving It Up
We finally store normalized heat data for each pixel as a single byte. This is because we display that value using a color map, which is represented as a length 256 array of colors. We store this data in S3, grouping several neighboring tiles into one file to reduce the total number of S3 files required. At request time, the server fetches and caches the corresponding precomputed S3 meta-tile, then generates a PNG on the fly from this data using the requested colormap. Our CDN (Cloudfront) then caches all tile images.
We also made various front-end updates. We are now using Mapbox GL. This allows for smooth zooming, as well as fancy rotation and pitch controls. We hope that you will enjoy exploring this updated heatmap.