Aasaanjobs is a job listing platform for blue-collar jobs, and as many other visibility platforms, ours is a read heavy system. About 96% of API calls that we receive are GET methods, and as the business grows the database schema gets more complicated, more fields are added, more table relationships are added, and APIs get slower.
We heavily use Django & Django REST Framework(DRF) with PostgreSQL as database in our backend services and helps in quickly building robust APIs. We highly rely on SQL database due to its ACID properties and lot of our major business entities (like Applications, Invoicing) rely on Finite State Machines. However when fetching models with nested relationships we encounter huge performance issues as database size increases.
Hence, we came with a solution to go with something in between. Sync our major entities (like Jobs, Companies, etc.) in Elasticsearch using PostgreSQL functions.
Elasticsearch is a distributed, open source search and analytics engine for all types of data build on Apache Lucene. For more info refer official Elasticsearch website.
We are using Elasticsearch 6.* for this article
What are PostgreSQL Function (PG Functions)?
A PostgreSQL function or a stored procedure is a set of SQL and procedural statements (declarations, assignments, loops, flow-of-control etc.) that are stored on the database server and can be invoked using the SQL interface. It facilitates you to carry out operations that would normally take several queries and round trips in a single function within the database.
All POST, PATCH or PUT APIs now along with updating data in the database, needs to synchronize the data with Elasticsearch. This can be achieved with the help of Django’s transaction module. First the database changes are done, then we index the document to Elasticsearch, if the latter fails, database changes are also rolled back.
For bench-marking the above architecture, let’s take the example of the below
You can find the full implementation of the above here:
You can't perform that action at this time. You signed in with another tab or window. You signed out in another tab or…
For the purpose of bench-marking we exposed below two APIs
GET /api/job/elastic/Lists jobs as per the architecture described above.
GET /api/job/database/Lists jobs using standard DRF serializers.
POST /api/job/database/Post a job with the standard DRF Create View
POST /api/job/elastic/Post a job with our overwritten Create View which indexes the job document along with inserting it into the database.
We will be using Apache Bench tool to load test our APIs.
We are going to first execute 500 GET API calls for the above URLs with a concurrency of 10.
Now, let’s see how much performance drop we get after introducing synchronizing of Elasticsearch with database. We are going to execute 100 POST APIs with a concurrency of 10.
From the above benchmarks it is quite evident that we see significant performance improvement in the GET APIs (around 690%) and see a drop in the speed of POST APIs (around 42%). As, mentioned before, since our system is read-heavy one, the drop in the performance of the POST APIs becomes negligible in front of massive of improvement of the GET APIs.
Now, let’s list down some obvious pros of this architecture, along with some cons.
- Huge read performance improvement.
- Since Elasticsearch provides a robust filtering, querying and sorting DSL, we don’t need to build any filtering or search backend for the APIs.
- Technical Debt. For example, you want to add a field in the
Jobtable, now, you would need to declare the field in the Django model, the corresponding PostgreSQL function and the Elasticsearch mapping.
- Managing of Elasticsearch Cluster. Now of course as your scale increases, a lot of services gets introduced in the infrastructure, but if you are deploying a small scaled service managing an Elasticsearch service (and it is a no joke to manage an Elasticsearch Cluster) becomes an overhead.