Sitemap
strava-engineering

Engineers building the home for your active life.

Incorporating privacy into counts in Metroview

--

At Metro, we provide aggregated and de-identified data to municipalities, NGOs, and advocacy groups to promote bicycle and pedestrian infrastructure. One of the most valuable pieces of data that we provide is counts of activities traversing particular sections of road, or originating from a particular section of the map. This data provides valuable insights into the highest-traffic arteries of a region or the most popular neighborhoods to commute to/from.

Basic counts are given for each edge, represented as the shade of the edge
Our Origins and Destinations product counts trips starting/ending in certain locations

In the Metroview interface, we also provide filters to break down the counts further. For example, a city planner might want to look only at commutes on weekdays, or only look at data from February, or only look at usage of a particular artery in the morning.

Some filters that Metro partners can apply to our counts data

Retrieving count data: easy mode

Let’s consider just the street counts data. (At Strava Metro, we call each section of street an “edge”. See here and here for more information on the Strava basemap). If we were providing this data in an analytics platform internally, these filters are easy to implement. We could write a query against our data warehouse to represent the filters:

SELECT
count(*)
FROM
activity_aligned_edges
WHERE
commute = TRUE
AND activity_type = 'ride'
AND datepart('hour', traversed_at) < 11
AND datepart('hour', traversed_at) > 5
AND datepart('weekday', traversed_at) < 5
AND datepart('month', traversed_at) = 2
AND datepart('year') = 2022;

This is all fine if we can sit around and wait for the query to complete on our aligned edges table containing on the order of hundreds of billions of rows (terabytes of valuable database storage). We’d run this query once per each edge within the viewport on the client. The problem is that we don’t have the luxury of time to wait for the query to complete. We aren’t actually providing this data in an analytics platform: we are actually feeding a real-time Metro web interface. Users want to see this data, and quickly. In fact, using the above query is out of the question; storing all the data we have in activity_aligned_edges in our data warehouse (Snowflake) would be too expensive to even consider, especially if it only has this use case. The table has over a hundred billion rows.

Retrieving count data: medium mode

Let’s consider another approach to storage. First, let’s note that our team preprocesses Metro data in monthly batches to refresh the Metroview UI. One way we can get around the poor performance of the previous query is by preprocessing the counts. Let’s preprocess the data and store a count for every permutation of possible filters. Then we’ll have a key-value pair for every permutation of filters, for every edge in our basemap. We can use a key-value datastore for this like DynamoDB or the in-memory PalDB. With this solution, however, the key space is very large:

(edgeId, tripPurposes, timeOfDays, activityTypes, dayOfWeeks, timespan) => count

450 million edges * (2 choose n1) trip purposes * (4 + 1) time of day (times of day plus “any time”) * (5 choose n2) activity types (run, walk, hike, bike, ebike) * (2 choose n3) day of week modifiers * (48 + 4) timespans (month granularity and year granularity over Metro’s 48 month lookback) = 32,643 billion permutations.

Those familiar with Amazon Web Services without unlimited financial resources will balk at that number of keys needing to be updated monthly on DynamoDB. We can certainly store multiple permutations of the same edge in the same key-value pair, but that is still cost prohibitive, with DynamoDB costs on the order of hundreds of thousands of dollars per year.

Before implementing the solution I discuss below, we used this approach to build the filters. Because of this hefty data problem, we chose to reduce the number of filters available on our Streets view when we launched Streets in 2019. We limited selection to just the broader categories of “bike” and “pedestrian”, and didn’t allow selection by time of day or day of week. We chose to store counts in PalDB, an in-memory optimized binary storage format for key-value pairs. These PalDB files are loaded to the server upon server startup, meaning that service deployment times are constrained by the size of the counts files and the network bandwidth.

Let’s add some more data

Since the launch of Streets, we’ve launched the Origins and Destinations product on Metroview. When creating the filters for that product, we were brought to revisit the filters available in the Streets view. We’d like to enable the same set of filters on both feature sets, perhaps in the eventual hope that we could condense the map visualizations to one screen.

Of course there is a better way to structure the data to enable more filters. We don’t actually need to store all the possible permutations of the filters, we just need to store the count for every smallest unit of filter. We can combine rows to calculate the total count for any filters that require a combination of rows (eg, commute and leisure activities on edge 123 in the morning on the weekend by bike in June 2022 will add the counts from 2 different rows). This necessitates a relational database, ideally optimized for analytics and aggregation queries. With this method, we can reduce the number of rows above, getting rid of the combinatorial elements.

(edgeId, tripPurpose, timeOfDay, activityType, dayOfWeek, month) => count

450 million edges * 2 trip purposes * 4 times of day * 5 activity types * 2 day of week * 48 months = 1,728 billion permutations.

This amount of data is palatable. In a packed binary format in S3, this came out to be about half a terabyte of data. I’ve dubbed each row in this dataset an “atomic edge count” to indicate that it is a count for a specific “atomic” filter.

One more thing…

Done, right?

Not quite. We have one more wrench to throw into the problem.

What if we have one Strava athlete in a rural or suburban neighborhood, called Sonya, that likes to record their neighborhood dog walks every day? What if no one else uploads data in the same neighborhood? The count will be high on the edges in her neighborhood, but that data is very specific to her. It might be overly representative of her habits, which is not good for the Strava athlete nor our Metro partners. This single contributor to the Metro dataset should not be counted.

In all Metro datasets, we require that enough athletes contribute to the aggregate in order for it to appear at all in the platform. This is essential to ensure the privacy of our athletes as well as the usefulness of Metro as a platform. It doesn’t tell anything significant about a town’s infrastructure if one person is walking their dog frequently on the neighborhood streets.

With this in mind, if we go back to the edge counts, we see we have a problem. If we’re using atomic edge counts to create aggregate filters (to answer questions like “how many people rode a bike across this edge, on any day of the week, at any time of day, in June 2022?”), we can’t determine from the counts how many athletes are represented. We only have the activity aligned edge count. Our friend Sonya’s activities will still be represented in the count, even if she was the only one to use that section of road. So the question becomes: How do we augment our dataset to ensure we only show counts when there are greater than n athletes contributing to a count?

The most obvious answer is to store the count of athletes alongside the count of activities in the same row. Then we add the athlete counts, and if the athlete count is greater than n, show the data. That is, for each atomic filter key, have a tuple of values:

(edgeId, tripPurpose, timeOfDay, activityType, dayOfWeek, month) => (activityCount, athleteCount)

But this doesn’t work, because if Sonya walks her dog every month of the year, the athlete count on a full year query will be 12 even if she is the only person recording her walks in the neighborhood. In other words, storing the athleteCount doesn’t keep any information about who that athlete is, and so counting the total number of athletes as a sum of athleteCounts is ineffective.

The next solution that comes to mind is to store all of the athleteIds contributing to a row in the key/value pair. We can just count the number of distinct athleteIds on the server:

(edgeId, tripPurpose, timeOfDay, activityType, dayOfWeek, month) => (activityCount, list[athleteId])

It’s probably clear that this is quite space inefficient. For popular edges like Market Street in San Francisco or the Central Park loop in New York or Victoria Embankment in London, storing all athlete ids would give row sizes on the order of hundreds of kilobytes. Given we have 450 million edges, this sort of storage footprint is untenable. Not only that, but the variable size of the column would lead to page load inefficiencies within any database.

But let’s recall the problem again. There’s actually only one thing we need to know: are there more than n athletes traversing an edge?

The clean and imperfect solution

We instead use a basic hash function on the athleteId and leverage the Hamming weight of the result of a bitwise OR operation across the rows aggregated within a filter to determine a low fidelity approximation for the number of athletes traversing an edge.

Let’s break that down. We essentially store an athleteIdBitmap that’s a representation of the hashed values of athlete ids.

  • Use a basic hash function on the athleteId.
  • Take the athlete id, mod 64, map to a Long bitmap:
// 64 -> 0th bit -> 1 -> 0b1, 65 -> 2 -> 0b10, 66 -> 4 -> 0b100, 67 -> 8 -> 0b1000, etc.
val bitwiseAthleteId = 1L << (athleteId % 64)
  • Bitwise OR the bitwiseAthleteIds and store that Long in addition to the activity count that those athletes contributed to
val athleteIdBitmap = athlete1.bitwiseAthleteId | athlete2.bitwiseAthleteId | …
// For example: 00..0010 BITWISE OR 00..0100 = 00..0110 => 2 flipped bits
  • Store these rows:
(edgeId, tripPurpose, timeOfDay, activityType, dayOfWeek, month, activityCount, athleteIdBitmap)
  • On the server side, when summing the count of activities for a combination of rows, we also bitwise OR the athleteIdBitmaps to get a complete athleteIdBitmap for the combination of rows.
  • We determine the Hamming weight (the number of 1s in the binary representation of the bitmap) of the athleteIdBitmap to get the number of athletes represented in a count, and ensure that the number is greater than n before returning to the client

There’s a flaw in this solution, too. We hash athleteIds into a hash space of only 64 slots. The probability of hash collisions is fairly high for relatively small n, as we understand from the birthday problem.

The formula for the probability of getting a collision with 64 slot hash function and k distinct ids hashed is

For P(1) = 0, P(3) = 4.6%, P(5) = 14.8%, P(10) = 52.3%.

This implies that if we have 10 distinct IDs, there’s a 52.3% chance that our scheme will undercount the number of athletes.

This sounds bad, but we have to zoom back out to the problem. We want to know when the count is composed of activities from at least n athletes. We are doing this because we want to protect the privacy of athletes while improving the aggregation of the Metro dataset. No one is going to be complaining to us if we show a zero count for an edge that has an already-small n athletes traversing it. In fact, our solution guarantees that any count we show will have at least n athletes. It doesn’t guarantee that all edges with more than n athletes are not zeroed out, but the chances of zeroing out an edge become vanishingly small as the number of athletes grows above n. Put another way, we are biasing towards zeroing out edges that have low athlete counts in the first place.

Let’s take n = 10 for example. Calculating the probability of an edge getting zeroed out with k input ids is a variant of the coupon collector’s problem. The expectation we have for the number of trials k in order to fill j buckets can be formalized as:

For j = 10, the expected number of trials is approximately 10.78. That is, on average, 10.78 distinct athlete ids will flip 10 bits in the athleteIdBitmap.

To calculate the probability distribution function of the random variable kⱼ, we can use the probability that exactly j buckets are empty, per the formula on the occupancy problem found here:

Where k is the number of ids hashed, m is the number of buckets (64 in our case), and Xₖ,ₘ is the random variable indicating the number of empty cells. We assume the hashing function gives a uniform distribution of output (it should, as our athlete ids are evenly distributed). The probability that less than 10 buckets are used is equivalent to the probability that more than 54 buckets are empty. For the discrete random variable, this is given by:

Where p is the lower bound of the number of empty buckets. Below are results for varying values of k. Note that P( Xₖ,ₘ > 54) represents the probability that more than 54 buckets are empty (ie, less than 10 buckets are filled), and we zero out an edge erroneously (ie, when enough athletes have traversed the edge, but we still show zero counts):

Probability distribution function for the likelihood that less than 10 buckets are full

This confirms what we already know intuitively: the chances of zeroing out an edge become vanishingly small as the number of athletes grows above n. We ultimately determined that this room for error is acceptable, especially given the use case, where we would rather take the conservative approach and zero out more data rather than less. Our partners generally should not care whether ten or thirteen people traversed a particular edge in our basemap. With the benefits gained from using this approach, going forward with this migration was a no-brainer.

Storage and loading

The final set of rows looks like this:

(edgeId, tripPurpose, timeOfDay, activityType, dayOfWeek, month, activityCount, athleteIdBitmap)

All in, our dataset is about a terabyte in a bitpacked format in S3. Since all of these rows are pregenerated in a batch job in Spark, we can apply a sorting and partitioning scheme to the data within S3 and forgo storage of the full terabyte in a relational DB. We can then leverage the sorted and partitioned properties of the dataset to create an efficient S3 + in-memory cache retrieval of the data. See here for more details on this strategy, which we use on activity aligned edges in Metro.

The improvement of edge count filters involved two innovations. First, we moved from storing counts per permutation of filters to storing atomic edge counts via the athleteIdBitmap method. Second, we moved from an in-memory PalDB file to an S3-backed in-memory cache with good cache locality properties. The results of these two changes together are:

Latency improvements after shift to the new data store (first line is the old backend)
  1. Up to 2x latency improvement on edge counts request
  2. 20x improvement in service deployment times, thanks to not needing to load the PalDB key-value file into memory upon startup
  3. Enabling 700x more permutations of filters (not including dates). Technically speaking, over 150 quadrillion new permutations of filters are enabled.
  4. Reduces memory footprint of server by 20GB per server instance.
  5. Replaces the heatmap in areas that are well mapped in Open Street Map, and allows for arbitrary filters to be placed on the aligned heatmap.

We’re really excited about the possibilities that this approach enables in the future and how it could help our partners unlock new valuable insights into bicycle traffic flows.

Apply for a Metro partnership here.

Strava is hiring!

Resources

Scaling activity aligned edges

The Occupancy Problem

Variations on the Birthday problem

The Coupon Collector problem

Hash analysis

--

--

strava-engineering
strava-engineering

Published in strava-engineering

Engineers building the home for your active life.

Derick Yang
Derick Yang

Written by Derick Yang

Constantly learning. Running, cycling, and going places.

No responses yet