How We Built the World’s Biggest Real-Time DB of Mobile Devices
As a world-leading attribution company, we know in Appsflyer that clients all around the globe count on our data to be accurate at all times. That means the data we present must be bulletproof, and above all — real. We were aware that fraud data exists, and draws an incorrect picture to our customers. We also knew that we need to stop this, and if we’re going to this — we’re going to do something big, something huge — something that nobody did before us and will change the way the ecosystem fights fraud. And so we built DeviceRank.
DeviceRank was a first of its kind project in Appsflyer — changing the way we looked at our data. Its purpose is to scan the activity of each and every device passing through our data pipelines, and identify patterns in its present and past behavior which might indicate its being used for fraudulent actions. We’ve neglected all of our standard segregations, based on apps, media-sources and time-zones, and for the first time examined the data as the end-user — the person holding the phone — creates it; how it really occurs in the real world. The conclusions and findings were absolutely astonishing, and quickly made DeviceRank an Appsflyer best-seller.
But knowing which devices are fraudulent and which aren’t, only solves half of the problem — we need to block actions coming from these devices in real-time, or we did nothing. Making the data accessible in real-time becomes an even more complex problem when facing the scale of the database we’re about to create — each day approximately 10M new devices are being added, and many others’ authenticity-rank changes; one year after DeviceRank’s launch, its database holds over five billion devices IDs and ranks, growing and changing every day.
First attempt: Bloom-filters
Each record accessible to the real-time pipeline is pretty short: it only contains an (encrypted) device ID and a character symbolizing its rank (‘A’, ‘B’, ‘C’, etc). So, we have a very (very) long list of very short pieces of data.
We decided to give Bloom-filters a chance here — our experiments showed we can reduce our dataset to only 4.5% of its original size, and allowing only 0.1% chance of false-positives. The plan was to create a separate Bloom filter for each possible rank, and these will be uploaded to Redis, and queried using the SETBIT and GETBIT commands. Our ground-truth data, the list of the devices and ranks, the output of the rank-calculation process, was stored on S3 as a parquet file, as we use Spark for that process. It only makes sense to use Spark to translate it to Bloom-filters — as seen in the below drawing.
While this solution seemed to be working, we quickly decided to replace it; each Bloom-filter query required several requests (for every bit), and we had to query several Bloom-filter at once for each rank-request. This was too much weight on our network, and we weren’t sure it will survive the massive scale we were headed towards.
Our next solution was to replace Redis with S3, and have the the real-time API read the Bloom-filters from S3 when they are updated and store them cached in-memory, as seen in drawing below
We quickly ran into our next bump in the road: The new architecture of our Spark job made the Bloom-filter data travel from the slaves to the master and back again, which means they must be serialized. This, apparently, wasn’t an easy task for Spark. After much trial and error (and mostly errors), we’ve managed to push Spark’s Kryo serializer to its limits, and were able to create Bloom-filters of up to 40 million devices each, meaning several Bloom-filters for each possible rank. Each Bloom-filter contained devices based on their hash-code, so we didn’t need to scan all the Bloom-filters of each rank, but only one — so the querying action was unharmed.
Still, we weren’t happy with this solution — the memory required to cache the Bloom-filters will scale, and we didn’t like the fact we had a production process that literally pushes its framework to its absolute limit. So we changed it again.
Second attempt: Aerospike
You can call it great timing, you can call it love at first sight — call it however you’d like, this was exactly what it was. Aerospike was something new in Appsflyer at that time, and we’ve decided that DeviceRank will be its first baptism of fire. For those of you who are not familiar with it, Aerospike is a key-value database that has since became a resident in Appsflyer. Using a real-time designated database will, other than solve our storage issue, allow us a 100% accurate response at all times (unlike the 99.9% we got from the Bloom-filters).
While we were very satisfied with Aerospike’s real-time querying capabilities, we had to change the way we update the data in it — Aerospike doesn’t support batch-writing, but only one record at a time. Rewriting our entire dataset became a very long and inefficient task, and of course added a lot of load to the database, while only a fraction of the data really changed. So this time too — Spark to the rescue! As now the data published by our Spark job wasn’t bundled together, we could easily stop writing the entire dataset every time, and simply update the data already on Aerospike. So now our Spark job read not only the last output of DeviceRank’s scoring mechanism, but also the one before it — and using the except method of Spark’s DataFrame, figured out what were the changes and pushed only them to Aerospike. So now our system looked like this:
While it worked, we still weren’t pleased. As the writings are performed record-by-record, they were much slower than the batch-writings we are so used to from working with Kafka. Also, as all our Spark slaves run on AWS spots, which technically could be lost at any given moment, slow writings meant more chance for failures. We also felt we do not have full control on the throttling between the two. And so we found our final design -
Third attempt: Kafka and Aerospike
We’ve decided to add an extra buffer between Spark and Aerospike — our beloved Kafka. Spark will launch a Kafka producer on each of its slaves, and write the data updates in bulks to designated topic. Consumers will then read the data and write it to Aerospike:
This made our system far more resilient and effective. Spark’s slaves writing time decreased dramatically — from about half an hour to around 5 minutes, what also allowed other jobs run sooner, and if by chance a Kafka consumer would fail, no data would have been lost.
The final architecture was designed and implemented not so long after the release of DeviceRank, and is still in production one year later — handling flawlessly the massive growth of the dataset, which have tripled (!!) itself since our first testings, growing from ~1.4 billion devices at launch to 5B today. And if that’s not enough, we proudly admit this is one of our most stable and resilient systems. Planning on buying a new phone any time soon? Don’t worry about us — we’re covered.