Part — 1 The beginning

Published in

Doubtnut

5 min readJan 31, 2020

Almost all startups have quite modest beginnings, but only time tells how much and how quickly they grow or perish.

We’re serving more than a million users everyday. Which makes us the most used educational app in the country. Multiples ahead of the 2nd most used.

In our case, it’s been a tremendous, high speed growth so far. We have grown by 30 times in the past 12 months. That obviously means we had to not just adapt to the growing traffic and complexity, but deliver better products at a quicker pace to reach out to more users.

Current Stack

Backend App

Our primary APIs are built using NodeJS. The app is deployed using pm2 across 2–4 EC2 instances (autoscaling).

Databases

For primary databases, we’re using RDS MySQL(AWS’s managed database). We’re running in a master-slave configuration, with 3–4 read only slaves for the master DB. All the read queries are routed to the reader endpoint, while the write queries are routed to the writer endpoint.

There’s also one read-only replica, dedicated for analytics. The analytics replica is under a separate parameter group to enable higher timeout for the analytics type queries.

Caching

There’s a lot of caching that we do on different levels, for which we use Redis.

There are couple of things that we always consider while using redis:

Always put a timeout on keys (even if they are almost static), because years from now, when we’d be analysing why the Redis memory usage is so high, we’d have millions of keys, which are not being used, but still exist because they were created with no expiry.
For keys that are consumed by frequent transactions, never clear the cache… always rebuild it. The difference is subtle but important. Clearing the cache means that all the requests that hit right after you clear the cache will use the DB (until one request populates cache), which means that in the meantime, DB is serving all the requests, which will cause DB spikes and even downtime, hence compromising user experience. However, rebuilding the cache means that a separate process is updatring what was in the cache, so that the required key’s cache value is never null and hence the transactions will never hit DB.

Asynchronous Processing

We use SQS + Lambda (serverless) for a lot of asynchronous processing. Typical flow is that the backend application will send data in SQS, and SQS will trigger Lambda.

Telemetry

We intend to gain a lot of insights from our running infrastructure and figure out points of failure as soon as possible. For the same, we have a few dashboard screens (with the intent to add more). Currently we use Geckoboard, Cloudwatch and Grafana with Prometheus for infrastructure monitoring.

To gain custom insights on key events and failures in real-time, we use Grafana and Influx DB (time series data). From the application, we send the events to InfluxDB via an SQS buffer. This data in Influx DB is then used in Grafana to create visualisations and dashboards, and also to setup alarms based on events. For example, let’s say in an event of payment failure, we push a datapoint into Influx DB by the name “payment_failed”. Now we can see on the Grafana dashboard how many such events are happening and even configure alarms saying if this event occurs more than 10 times in an hour, raise an alert.

We also use NewRelic quite heavily to measure the health of the APIs as well as end user experience (via Apdex score).

Analytics

For running OLAP queries, we use Redshift. To send data to Redshift, we are using AWS DMS service.

Redshift can query TBs of data in a fraction of time. It’s quickly replacing MySQL for analytics, which also gives us the opportunity to remove indices from MySQL tables and hence improve performance.

To collect data at high throughput and low latency, we’re using AWS Kinesis, which helps us deliver analytics data directly to S3 or Redshift.

A/B Testing

We’re quite heavy on A/B/N testing and rely on it for almost all our decision making.

A/B testing can run differently on frontend (app) and backend. We use different stacks for both the cases.

If an experiment is backend driven, we use an open source microservice — Flagr for the same. Flagr is written in Go and takes variants decisions with sub millisecond latency.

An example of where we use flagr: to distribute traffic and measure efficiency across our match algorithm variants. For every request, based on configuration specified in Flagr, it decides which algorithm to use to match the user question with our catalog question. Now, switching on and off for experiments is just a few clicks away.

Microservices

We are just getting started with micro services and are going forward quite pragmatically on it. Microservices are amazing when used for the right reasons and in the right situations. For example, we created a microservice whose job is to find question matches given the search string and the algorithm to be used. This has enabled us to move independently on Search iterations.

Collaboration

Project Management

We use Jira for project management. There are multiple small teams which have their own Jira projects and boards where they collaborate.

Scrum/Sprints

Scrum is a valuable technique and works well for Software Development domain. The idea is to plan well and in advance, to avoid changes later (which is most expensive). They also give people control over their own lives. As long as I am doing whatever I promise as a part of the sprint, it doesn’t matter how and when I do it.

Weeklys

We encourage collaboration by sharing ideas. We conduct Backend, Web and Android weekly discussions where the teams sit together to discuss improvement and optimization opportunities.

Talk to an Expert

Of course not all knowledge can lie within the organization. There’s in fact so much more knowledge outside the organisation, that we can learn from. For the same purpose, we invite experts to come and speak on some topics for the team. We recently had one such talk by an Engineering Manager at LinkedIn (HQ).

Learning

There’s no better investment than learning. We strongly encourage and support learning. We have a growing library of Software Engineering books and also encourage people to take up courses that the company will pay for.

What the future holds?

DB Bottleneck

Of course DB would eventually become a bottleneck, and we are in the process of moving most things away from the DB, either offloading to async processing (and doing bulk inserts) and caching data as much as possible.

Eventually, we’ll need to start sharding data across multiple clusters.

Microservices

Create cookiecutter for Microservices so that creation of microservices becomes a breeze and they are comprehensive with automated deployments and testing from the get go.

RFCs

We are going to adopt RFCs as a prelude to any significant project, so that even before we jump into development, we already have multiple perspectives and options on how to do something and chose the most optimum way forward.

AI and ML

There are 3 broad areas where we know that AI and ML will give us huge wins:

Vision and OCR
Search Relevance
Recommendations

Hence, naturally we are quite bullish on the same and doubling down on investment on AI and ML.

Also, we are hiring.

Do check the open positions here.

Part — 1 The beginning

Current Stack

Backend App

Databases

Caching

Asynchronous Processing

Telemetry

Analytics

A/B Testing

Microservices

Collaboration

Project Management

Scrum/Sprints

Weeklys

Talk to an Expert

Learning

What the future holds?

DB Bottleneck

Microservices

RFCs

AI and ML

Also, we are hiring.

Written by Gaurav Toshniwal