Our antivirus backend needs to handle many concurrent requests which probe the Android package database to determine whether a given package is safe or not.
The server needs to process up to 100 million packages per day, which lead to approximately 1,200 packages being queried per second. These requests are sent from mobile devices which check the status of installed packages on the device and the response should be returned within 100 ms. Packages which are not yet known are inserted into the database, and the response contains information about both existing packages and any new ones. We also include information about the safety of each package based on results of the manual evaluation as well as a threat index determined by a machine learning algorithm (currently under development).
Initially, we tried to use PostgreSQL as a database which stores information about packages, and it was also used for querying such information. This turned out to be an impossible task because the SQL engine cannot handle thousands of rows being inserted AND queried at the same time. We tried to use all the usual remedies, such as using proper indexes and making sure the inserted data is minimal. With 1,000 packages inserted per second, some queries took more than 20 seconds, and the performance keeps degrading. Requests start to time out and eventually, the processing slows down to snail speed. Furthermore, if you try to delete any data while the database is in this state, the database engine may completely halt processing on the table for a long time period.
Clearly, there must be a better way: Meet Redis, the in-memory data store/cache.
Note: Our solution is based on this blog entry about insert writes.
Persisting New Records Into Redis
Instead of inserting each row directly into the SQL database, each record is at first inserted into Redis. New records are inserted using bulk operations for better performance. Note that many duplicate records may be received from mobile clients. However, only one record per each key is needed. Thanks to the key-value store in Redis, there is only one record for insertion for each key.
We use the MSET operation to insert multiple records at the same time.
In Spring, this translates to the following code:
Let’s go quickly through the code. The PackageInfoEntity class is the value entity with information about packages. The PackageInfoKey class contains key values. The key can be serialized into String by calling the getKey() method. We store all packages which need to be inserted into SQL database with packageInsert:: prefix and the record key suffix converted into String. The code inserts records into Redis using a bulk MSET operation, which is blazing fast.
Moving Data from Redis into PostgreSQL and Distributed Locking
At this point, we already have a super-easy way to insert records into Redis. However, we also need to make sure the insert buffer is processed periodically and inserted into the SQL database. This way, we both make sure the data is persisted in a reliable data storage, and it is available for analytical purposes as soon as possible.
Initially, we had multiple writer threads inserting data from Redis into SQL database. Based on our experiments, we found out the best performance is achieved if only 1 writer inserts data at the same time. We also need to make sure to handle concurrency properly. Luckily distributed locking can be implemented easily using Redis.
The implementation is quite simple:
Each writer has a unique lock name. Using the SETNX Redis command (setIfAbsent in Java) the thread either succeeds to acquire the lock or fails to acquire it. This operation is atomic — only 1 writer can obtain the write lock. The writer which acquires the lock processes the insert buffer — all new packages are stored in the database and data is transferred in a concurrency friendly way (see chapter below).
Once all data is persisted into the database, the lock is released using the following command:
Handling Concurrent Record Inserts and Data Transfer
We need to make sure that while data is being processed by a writer thread (transferred from Redis into PostgreSQL), any new records can be inserted at the same time. For this reason, we introduced a distributed counter in Redis, which allows separating each round of inserted records.
A distributed counter value is increased using the INCR command.
The Java implementation is super easy:
When inserting new records into Redis, we also store the counter value for each record key:
All these new records can be easily found by querying the insertBuffer ZSET, which is used later when processing insert buffer for performance reasons. Thanks to the counter value it is clear which records should be processed — only records up to current counter value should be inserted into PostgreSQL, all newer records are currently being inserted and will be handled in the future.
The ZSET is also great for performance reasons — it would be too slow to go through all packageInsert::* keys. Note that the tuple value is the score used for ZSET, the counter value is just converted into double.
Querying Records for Transfer
Once the new records have been inserted into Redis, we need a fast and easy way to obtain all records with the counter value up to the current value. This means obtaining keys of all records which have not been processed yet while ignoring records inserted by other threads with higher counter values.
This can be done easily using a Redis ZRANGEBYSCORE command.
The Java implementation is again very simple:
This piece of code extract keys from insertBuffer record whose score values are up to the insertCounter value.
Removing Records from Redis
The Java implementation of both delete operations should be easy to understand:
Putting it all Together
Let’s put all the pieces from previous chapters together and summarize the data insert process.
- Whenever new data is received using REST API the data is inserted into Redis.
- For each round of inserts, the insertCounter is atomically increased, and keys for new records are stored together with current counter value into the insertBuffer ZSET.
- New record keys and values are saved into Redis as the packageInsert::* records. Each row can be easily identified using record key serialized into String. The record value is serialized using JSON and contains minimal record data.
- The current thread tries to obtain the distributed lock using the dbWriteLock record in Redis to start transferring new records into SQL database.
- In case the distributed lock is acquired, the current thread can start transferring the data into the SQL database. Otherwise, the processing is over, and some other thread will transfer new data into SQL database in the future.
- The data is transferred by querying all keys in insertBuffer with a score which is lower or equal to the current counter value. Newer records concurrently inserted by other threads are ignored.
- Once the keys are obtained, the data can be extracted and persisted into the SQL database. Thanks to the distributed lock and distributed counter, we make sure concurrency is handled correctly during data transfer.
- The transferred data can be deleted from Redis both for the insertBuffer and packageInsert::* records.
- Finally, the distributed dbWriteLock lock is released so that other threads can transfer new records in the future.
This article described how we use Redis as an insert buffer for PostgreSQL in a scenario which requires both high insert performance and returning fast responses to mobile clients. Redis is used to handle inserts during high load, and new records are inserted into PostgreSQL with a small delay using batch insert operations.
Instead of overloading PostgreSQL with too many insert queries from different clients, Redis insert buffer allows both extremely fast inserts into an in-memory data store and it handles merging redundant records, too. The new records are inserted into SQL database asynchronously using one thread at a time, so the database engine does not have to spend much time by handling lock contention. This way we give PostgreSQL time to breathe in a high-performance scenario and allow other queries to be performed while data is being inserted because the database is not overloaded by inserts.
We use Redis for many additional purposes in our Antivirus software, we also cache all records to avoid querying the database when it is not absolutely necessary. This topic is already well described in many existing articles — it is a common use case. We also leverage the record expiration in Redis to keep only relevant records in Redis cache. We consider the insert buffer use case of Redis to be the most interesting and still undiscovered by many developers.
Please let us know about your experience with Redis and PostgreSQL in comments below this article.