Koo Engineering — Organised Chaos

Published in

Koo App

10 min readJun 14, 2022

Koo is solving one of the most important and interesting problems that needed to be solved for India.

Getting folks from all walks of life to participate and embrace in a thoughts and opinion platform across all languages.

who are comfortable in their mother tongue.
who don’t have to worry as what other folks think when they comment or post.
who are able to relate the content and the language the community speaks.
who are able to connect with other like minded folks.
follow content and people they want to.

Basically, language should not be a barrier — this is our core mission as well. We are bringing in a shift and creating a product for the broader Bharath users to experience what is currently out there as a first class citizen for English audience across the globe.

As an engineer, Koo as a product/platform offers a huge canvas for folks who want to learn and develop features across a wide spectrum of things — personalisation, search, video engineering, data platforms, chat, algorithms, mobile apps, graphs, NLP, scale, unique problem statements around language translation, fraud, trending topics, feed ranking, mobile UX and much more (as we venture in to web 3.0 into NFTs, blockchain, decentralised systems etc …) and multitude of other things. The right technology stack is used to solve these problems ; so as an engineer one is exposed to various data stores, platforms, languages and frameworks — right tool for the right problem.

In this blog, we will talk about

Koo’s Engineering Culture → organised chaos.
Few core problems we are solving → why, how.
How to do more will less → infra cost, engineering bandwidth.

Organised Chaos

Blue → Yellow → Green : refer to the image below for definition

As start-ups grow both in terms of numbers, volumes and employees → many engineering teams grapple with

are we doing it the right way.
how fast can we ship these features — MVP ?
if this feature A scales, what are the circuit breakers in place so that the main flow isn’t affected
how can we enable ourselves so that all critical discussions, architecture thoughts and other nuances are well captured and broadcasted across all teams.
are we leaving behind a large tech debt.
do we have right folks with the attitude and aptitude for the problem

To have a well rounded solution we need to look at these problems holistically rather than in an isolated way. We cannot have a master plan that covers all spectrum of scenarios and start over engineering ; business demands are as chaotic as scaling a monolith at times. The solution or the mindset that is essential in engineer teams during this phase is more of

come with an open mind
ready to write throwaway code at times till we are sure we need to iterate more OR this has hit some scale
the heart of your code which probably does the magic and mostly would be the IP, is where one needs to invest more. Others that are on the fringe can go fast. Eg: In social → feed and graph are the core, In travel → inventory freshness is core, In Payments → Transaction Flow is core, In eCommerce → Inventory + Personalisation is core and so on and so forth.
invest in battle tested frameworks, data-stores that scale well with your growth.
spend time to choose a framework → marry the scenarios (user flows and use cases) with few positive / negative scenarios to see it is water tight before choosing that.
share flow diagrams, flow charts and simple documentation across teams so that visualisation is easier and folks understand the flow and context across systems.

We call this SDLC process as — organised chaos. Simply because, it is chaotic at times; but all of us know what we are getting into and have the attitude / skills to get past this. Of course, as we move forward this might not sustain and we will get in to more mature process that demands the right time investment and mindshare. Currently, we are somewhere in the middle and moving towards a mature SDLC lifecycle.

Core problems we are solving

Here we will try to articulate some of the interesting problems
(1–5) we are solving at Koo — why, how and what we measure ? Part-2 will cover the rest.

Search
Feed — Content
Feed — People
Graph — social graph / connections
On-boarding experiments
Topics
Trending Hashtags
Video / Image Uploads
Notifications
Location — Near By
Language Translation

Search

It is our discovery platform. Used by users and media houses to look for original and interesting content — koos ; search for folks. One thing that happens on our search infra is, many folks do vernacular search as well.

why → discover new users to follow and content

how → elastic search

we index all information about our users ; handle, name, headline etc.
transliterate to other languages available on our platform so that we can enable vernacular search
index content (koo’s)
enrich the koo’s and tag them with Topics (NERs) so as to enable people search within Koo

top metric → we measure

searches that result in 0 results
position of the results that result clicks → quality of our results

technical metrics →

3 master +10 data node elastic search cluster
Peak indexing rate / m → 15K [definition : The number of indexing operations per minute. For example, a single call to the _bulk API that adds two documents and updates two counts as four operations, plus four additional operations per replica. Document deletions do not count towards this metric].
Search rate / m → 6K [definition : The number of search operations per minute for all shards in the cluster. For example, a single call to the _search API that returns results from five shards counts as five operations.]
Search Latency → 20ms

We are just getting started on search and plan to introduce a lot of interesting techniques that could be interesting. Take a peek in to analysers on ES.

enrich our ES database with other data points and labels that are generated by ML team.
translate and transliterate content and improve results for our vernacular users. (eg: user in Hindi when they search for विराट कोहली, we should be able to get content that is generated across all other languages)

Onboarding Experiments

In our onboarding flow, we try to understand our users’ preferences so that we can curate a strong personalised feed for them post their selection.

why → Onboarding users is a challenging product problem especially for a category building product. It is always about fine tuning things to give users what they are looking for, at every step.

how → adding the right UX elements, right information architecture, strong analytics and more importantly have a robust experimentation set-up both on the client side and on the server side.

metrics →

completion of the funnel % at every step
click and select events
post on-boarding, how strong is the engagement of the users

challenge →

we need to quickly create a feed for first time users who have expressed some interests. This is challenging from a compute standpoint as we need to fetch latest Koos

a) across users they follow

b) content / topics / interests they like

We have a strong caching framework built on top of Aerospike. The [keys] in this cache infra are carefully curated with users, user_lang, topic, vip etc.. so that our overall response time is fast. More details here

Feed Content

Feed Content is the first screen of the app. When a logged-in user opens the app, home feed is the first thing a user sees. Our home feed comprises of koos from the user followings and other activity (like, comment, rekoo) on koo.

why → To show the activity of the followings of a user

how → Aerospike and Graph DB

The home feed is a re-rank the timeline (chronological) feed of user so that we surface more relevant content on top from the time he / she was last active on the App.
We generate it on runtime for a user when he is eligible for rank feed.
Rank feed is basically some computation on the user timeline feed based on signals and weightage.
To avoid latency for fetching the signals value we use high read throughput low latency data store.
Majorly we use — Aerospike and ArangoDB (Graph-DB)

Top Metric We measure → Reactions and TSOA

All the content which the user can like should be in the first page of the feed.

Technical Metrics →

5 node Aerospike Cluster
3 node ArangoDB Graph Cluster
Peak write throughput on Aerospike -> 75k TPS (75 writes operation per second)
Peak read throughput on Aerospike -> 10K TPS
Latency -> 150 ms (average)

We have just started on Rank Feed. Some plans which team is looking to execute next

Introduce ML model to modify the weightage based on user inputs / interests.
Using UDF in Aerospike to compute the rank feed on Aerospike server layer rather than application layer.
Integrate Aerospike with big data platforms (Snowflake/Pinot) etc to create data pipelines for more analytical and real time event driven based applications..

Social Graph

This system houses our social graph i.e. connections among people and groups. Whether you like watching content from someone, follow someone, view content of other languages or show interest in a topic — all this information is kept in our social graph.

why —> To understand the user’s affinity towards other users, people, languages, topics and use it to improve their experience on the platform.

how —> ArangoDB

We store our social graph as a property graph using ArangoDB.
All the entities are stored as vertex documents.
The relations between the entities are stored as edge document.
Updates to the social graph are asynchronous, we do periodic cleanup/reconciliation to maintain data consistency across the stores.
To speed up the AQL queries we use ArangoSearch Views and primary sort.
We use satellite graph to do optimized joins in a clustered setup.
We maintain multiple replicas of social graph for different use cases and SLA e.g. point lookups, graph traversals, iterative graph processing/Pregel.

Top Metrics —>

Incoming/outgoing connections.
Influencer nodes — nodes which are connected with a lot of other nodes.
Query latency for intersections and ranking.
Traversal time per level

Technical Details

A 3 node enterprise cluster hosted using high I/O compute instance nodes.
Several community clusters for analytical workloads like iterative graph processing.
Data is persisted in EBS GP3 volumes with provisioned IOPS for predictive performance.
Spark connector is used to run graph analytical on ArangoDB data

Throughput

Peak ingestion RPM 72K
Read RPM 150K

Latency

General lookups -> ~20ms
Traversal queries -> 100ms-500ms

The social graph has given us rich query/traversal capabilities at a linear scale and reduced operational complexity. It’s especially useful for understanding our highly connected data.

There is a lot of interesting work that we plan to do in future

Build a knowledge graph which can answer generic & complex questions about the network.
Community Detection — Applying various clustering techniques to understand the interconnectedness in the graph and find communities and hidden patterns.
Run graph analytics to surface trends and sharpen our recommendations with near real time data.

People Feed

Users trying to discover folks pertaining to some interests / profession is quite a common thing in a social network like Koo. Our people recommendation feed kicks in to recommend our users to follow easily. The manifestation results in

Profession carousel (sports, media, politicians etc ..)
RFY (recommended for you carousel — more sharper list)
Location based carousel (nearby)

why → Discovery of the creators for a user to so that they can subscribe to relevant (personalise) content.

how → ArangoDB and Postgres

In ArangoDB we store the documents of a creator with all the necessary attributes which can be asked as filters on the runtime.
Before serving the people one of the challenges is negation of already following/blocked and other unwanted people which is powered by Arango again
Our ML team runs ML models which compute user specific carousel (RFY) and data pipelines are set up in the flow to consume the data from the ML model and eventually show the recommendation to the consumer.
Once the user comes on the app depending on the app flow we fetch all types of carousel data from different sources.

The next set of optimisations/improvement team is working on for people feed are follows

Improve the RFY model data flow to make recommendations more fresh.
Migrate all the feed people data from the shared DB to true microservice architecture based system.
Use ArangoDB graph functionalities to go more deep in the user network ; enrich the graph data with more signals and weights.
Improve the friends/contact based flow for inviting/suggesting to user and get it powered via ArangoDB

Do more with less

In this section, we will talk a bit about cost. All the engineering magic and user delight come with cost of infra. The engineering teams along with DevOps can label their infra, track and optimise etc.. These are well established BAUs that many teams follow. However there should also be one driving metric which makes sense from unit cost.

We have developed a simple math formula to understand the same.

CMUS — Cost (per) million user seconds

CMUS = ICU / (TSOA * DAU/1,000,000)

ICU → Infra Cost

TSOA → Average time spent on app per user

DAU → Daily active users

As we optimise infra and also build new things simultaneously, we try to keep a tab on this number. It gives us an idea as to how efficiently we are utilising our infra on a MoM and at the same time how are the business metrics playing out. It is still a bit early to say what should this number ideally be, but we know we are building efficiently as long as the numbers don’t increase linearly and CMUS remain relatively within bounds.

As a closing thought, engineering should be fun — period !

Few links

Aerospike summit → Building a near time personalised Feed https://docs.google.com/presentation/d/1D0JvKsJxMbXSZrPF8UmGKGsBZ_zHFxMr/edit#slide=id.g12d61f9c83b_0_6