Koo Foo Bar

Published in

Koo App

13 min readOct 30, 2023

A lot of action has happened at Koo this year — 2023. We have had few challenging moments, hair raising events and hair pulling moments too. Our engineering team (currently about 45) has been battle tested well and every individual has lapped few years ; purely based on the problem statements, constraints, outcomes and overall tech dope they had to soak in.

Some background and context leading to 2023

We ended 2022 on a high with our entry in to Brazil market and clocking 1 million new users on our platform within 48 hours of launch. Those 2 days merits a blog on its own. Our tech stack at that time was good (I would not say the best optimised one). We were on AWS and using

Kubernetes with a decent cluster size of around 2000/2400 cores managing roughly ~200 services
Our core back-end was built on Kotlin (Java stack).
AuroraDB — PostgreSQL ; This housed our main DB that consisted of all our core domain objects (Koos → content, Reaction → Likes, Comments, Users, Topics)
ArangoDB — Graph ; The social graph mapping the follow/follower network of our ~60 million users + 1 Billion edges formed this data store.
DynamoDB — Content tables, Reaction stores and registration systems.
Aerospike cluster — caching infra ; This was in a good shape and our main feed runs off this store and well architectured for horizontal scale.
Data Platform — was evolving and we were mapping the pipelines on to Apache Hudi. The ML pipelines, Spark jobs, EMRs were running on top of this.
Android App — we had few issues around performance, jank during scroll.
ML and model building — we had tackled many use cases in the NLP space, topic classification, feed recommendation, algorithms for trending etc ..
Video engineering — a good part of our content are videos. Transformation and analytical pipelines were in place to a good extent, but came at some cost.

The onset of 2023 demanded us to have

build a world class trust / safety platform which is unmatched.
social graph needed to be ready for high scale when millions of users come during a short window.
we wanted to smoothen out the ML and the Data Engineering interface and keep it easier for the Data Scientists / ML teams to use the data easily for model building and training.
optimise on the cost front across our infra at all levels.
strengthen the data platform for larger capacity, mutations of data sets and control the TCO (total cost of ownership)
stop the bots and improve our overall WAF configuration.
build the monetisation rails / features → ads, premium content, boosted profiles and super likes.
retention and growth features → jackpot, gamify the overall experience by bringing value to the users
build new experiences like stories, sharper recommendations, news content, self-verification
better video engineering pipelines→ compression, conversion to various formats using ffmpeg open source libs, video content embeddings for recommendations

This pipeline itself was daunting looking at what we had as resources + overall macro scenarios. But as an engineering unit, we looked at these things as a once in a life time opportunity to get our hands dirty on and actually build ; heads down.

As we went into the execution of these things, we had to address the elephant in the room — Infra costs & optimisation had to be done and overall TCO had to be reduced. We took a couple of calls and looked at the long term as to how we position ourselves on the TCO part. After doing some intense cost projections, vendor negotiations and growth estimations we concluded on 2 things

Use multi-cloud set-up and have workloads migrated as necessary across the vendors.
Migrate our WAF and CDNs from Cloudflare to Akamai.

Yes, we did all the items planned + migration ! Broadly for migration we thought

how to migrate fast,
what layers to migrate first
db migrations → this is the most complex of all

The next part of this blog, we will deep dive into execution of these things + tech stack.

Trust / safety platform, Content Moderation

We have a detailed post on the objectives of this subject and what we have in place here. To accomplish this,

every content — be it an image, video, koo, profile name, profile image had to be scanned for bad/NSFW content. The accuracy needed to be very high.
ever text content — had to be scanned for foul words.
news — we had to fact check with trusted APIs and backtrack and remove fake content.

All these had to be accomplished across languages and almost in real time to ensure the engagement and other metrics aren’t affected across the platform.

Tech Stack → Kafka, Flink as we needed to do in near real time. Every message passed through this pipeline where we scanned this for various rules and the corresponding status against the resource was updated. To facilitate this we needed a strong event driven framework with high consistency and no drops. The combination of Kafka <-> Flink worked well.

Social graph

Any social platform needs to understand the user network (topics of interests, follower networks, groups, reaction vs. passive users, creators) and the various relationships that happen between these entities. Using this rich and dense network enabled the platform to build

strong recommendation systems
target ads
prioritise new content notifs

The previous version was built on a simple data model on PostgreSQL and the architecture served the purpose well till a point in time. This model started breaking when some of our VIP users started having high followers (> 100K). This caused massive loads on our DB due the query pattern, updates and other mutations. The underlying toast tables of PostgreSQL were frequently locking the main tables for vacuuming and causing many issues. We had identified this issue and moved towards ArangoDB and this worked well — again till a point in time.

What we realised is, the graph component had 2 distinct data / access patterns

real-time updates to the following network
OLAP use cases to traverse the graph for classification (say → give me all users from Bangalore who are active in the last 30 days and following Indian Politics as a topic and are separated from my 1st degree network by 1 hop)

Understanding this was critical and we built separate systems to handle this and eventually moved back to PostgreSQL with an efficient and highly scalable data model. For OLAP use cases, we built the pipelines and moved the data to s3 and offloaded the patterns to data platform.

Data Platform — Apache Iceberg

While building our data platform we aimed for two key objectives:

Establish a unified data store serving as both a data warehouse and datalake. This minimizes data duplication, streamlines engineer onboarding, and mitigates data inconsistency issues stemming from maintaining multiple data copies.
Implement a transparent system that manages the intricacies of data platforms out of box, including partitioning, schema evolution, and transactional operations.

Iceberg has emerged as an industry-standard solution adept at addressing these challenges. Initially, we utilized Apache Hudi for our transactional datalake, which performed well for many use cases but lacked support for crucial requirements like schema and partition evolution. In a dynamic environment like Koo, this rigidity prompted us to switch to Iceberg. It also gave us good leverage as both AWS and GCP offer native Iceberg support in their ecosystems.

Monetisation Platforms

This is our money making product. Per our business plan this was planned a bit later and our focus was always — how to grow big. With macro scenarios, we had to immediately pivot towards executing monetisation features. The product and business teams were fast in narrowing on how to approach this and what to build. We introduced the following on our platform in record time

Ads → GAM integration across our videos
Jackpots → The reward center we have on our platform incentivises the users for using on our platform and we as an intermediary bring in the Brands and give them to the users. This involved segmenting the users on our platform based on the activity they perform. We built a sophisticated algo (almost like an Operation Research problem) to optimize for users vs. retention. To quickly experiment with the algo and its various avatars, we needed a strong data platform to capture the impressions, activities and suggest the payout for every jackpot run in realtime for million+ users.
Premium Content → We have a very healthy number of creators who keep putting unique content on Koo. The plan was to give them a platform whereby

the creators keep putting interesting and unique content on Koo
we bring in users to subscribe to their content across various plans which the creators can configure (all self-serve)
build teasers about their content and promote
secure the paid content
pass on the earning to the creators
increase the followers to the creators

These features needed a strong OLTP system and we used PostgreSQL.

Premium Content — Creator Economy — Premium Content for creators — monetisation in play

Bots — Self Verification, Fingerprint

Self-Verification — using fingerprint and other properties

During our International launch in Brazil, we did encounter some % of bots and had to quickly set-up a mechanism to prevent the same. We also wanted to ensure Koo can be accessible across all Geographies and for this having a robust bot prevention system was critical. We tackled this via 2 things

self verification feature
pattern matching / fingerprints

We used a combination of enterprise captcha (Google APIs), Fingerprint libs, Fingerprint score from WAF providers, IP address (honey-pots), black-lists and came up with a confidence score to detect whether a user was a bot or not. We productised this feature where-in the actual human user can goto their profile and self-verify themselves on a single click.

We also looked at certain behaviours and created algorithms / rules based on the events etc.. to flag suspicious accounts. We have been able to thwart a large number of bots especially in the INTL regions ; In India we had taken a conscious call in the beginning whereby users needed to use a Mobile number for registering on Koo. This has helped us to a large extent.

ML — Recommendation, Topic Classification, Moderation

Features

Topics
Recommendation Systems → For You Tab (content reco) and People recommendations (discover new creators)
Content Safety

Some of these features were built earlier with an older stack and a not so mature ML-Ops. We upgraded our stack for scale and perf by adopting the best in class practices.

BentoML, Ludwig and BERT models were used to solve these problems. For Topics Classification we went with BERT. We had been labelling our data corpus quite well and as a result we were able to open source our own model on HuggingFace (Koo-BERT). We trained this model on top of our data sets (1B tokens and 68M koos). We have been able to achieve a high accuracy on topic classification across Indic languages. We heavily use this model internally for content classification and in some places for toxicity detection as well.

We introduced VectorDB s— Milvus for some of our use cases to store the Koo embeddings. This enabled us to introduce a lot more variables while building our recommendation pipelines.

Video Engineering

Video content is quite popular and we as a social platform do produce a good amount. To ensure the conversion etc .. is done well, we were earlier running on AWS Media Convert pipelines. We saw an opportunity to optimize the cost here as the % of videos increased on our platform.

We built our own ffmpeg transcoders and we ran this on AWS Lambda. This reduced the pipeline cost by almost 90%. We had to ensure the corner cases are well handled as there are few nuances around video types etc .. But for any company dealing with a lot of videos, it makes economic sense to run it in house.

MediaConvert charges are as per video length so it is expensive.
We have migrated to a custom ffmpeg based pipeline with AWS Lambda which works well.
Custom pipeline is 1/10th the cost of managed transcoders.

To ensure the quality etc .. are the same, we ran through standard benchmarks and overall for our size and cost, we felt this is good enough.

We started out measuring our video watch time with 3rd party tools. Soon learnt that we building our own would be much more efficient ; we built our video analytics platform with Clickhouse and Grafana.

Android — moving to Jetpack Compose

Our Android App has seen a lot of experimentation and was also having issues around performance, code maintainability and to some extent feature bloat. Modules weren’t reused efficiently and overall we could sense it was slowing us down.

We had few objectives and goals in mind when we took on overhauling our Android code base

use Jetpack Compose and reduce the XML bloat
reuse the Views and the child components as efficiently as possible → this is one of the main problems many App Developers face for slowness.
how to use events across the App (Rx vs. Flow) // dependency injection (Dagger vs. Hilt)
CI to enable PR gates. (checkstyle, pmd, spotbug, ktlint)
CD to distribute builds to QA on a daily basis to be tested. This will remove the culture of having a feature branch

Summary

If someone had told us these are the things we will be tackling and building at the beginning of this year, we would have been of the opinion it’s not gonna happen. This journey and the outcome would not have been possible if not for the group of engineers who enjoy working together and are curious about building things. It is that simple and nothing else is what we have come to realise. There is still a lot to accomplish at Koo and we very much want to continue what we have been doing so far.

Collectively within the engineering leadership team we feel we have gravitated towards certain way of executing things — you can call it as guiding principles / engineering philosophy etc ..

It is totally fine if one engineer can do all things related to that feature — end to end. Seniors/peers can come and support and help the developer to integrate to existing platforms and frameworks. This reduces context switch while executing the feature, better understanding of all systems and integration points for the developers. Questions are asked at the right time. Engineering teams strength need not be bloated as at no point we will be executing large number of features in parallel.
Build platforms on which we can enable other features and workflows fairly quickly. Invest on the tech platform and use the right tech stack to solve the problem in front. We have seen time and again a lot of duplicate code gets built just to do basics of the things. All it takes is one curious engineer to say — let me make that as a service / SDK / plug-in. Encourage this and drive for making this perfect.
Data Model and database design is an art. We should not hurry up in setting this to the stone and start implementing. The engineering teams should have experts here who can review them and enough bandwidth is made available to them so that we get this right. A small mistake or an oversight will have a catastrophe effect on timelines and cost. Build this skill across the broader engineering teams and evangelize why it was done in that fashion.
Channel (android, iOS and PWA) developments need folks who can understand the internal workings on say — react, android internals and iOS frameworks. It is a must to have folks who possess these skills. The other folks can rally around them and usually they all will scale similarly. If not done correctly, it slows down the development, needs more developers to ship and performance is often overlooked.
Developers need to think like actual users — this is the hard part. The great developers cultivate this habit quite early and you can make out based on the questions they ask. As an hiring tip, we look for this and usually this trumps everything else — their github repo, leetcode etc .
Data Engineering is underrated though everyone wants to use data for ML, build intelligent systems etc .., many folks make the mistake of building the application layers first prior to the data platform. This causes delays in shipping ML systems, not enough time for data scientists to tune their models and apply their theory. The investment/tech stack on the data platform layer shows the maturity and collective thought process of the engineering teams. The TCO of a data platform needs to have one single engineering owner. Buck stops at this person. True ROI can be understood based on the workloads being run vs. the features being used/built.
Security, DevOps, SRE, DBA, MLOps → Unicorn. As an engineer it is hard to be expert in all these aspects. But with a small team how do we solve this in an organization. That’s where we need to hire generalists ; these are not sharp shooters but very fast learners and have their basics strong. They can apply their collective experience and put that in action to solve a new problem with ease. Making this silo causes delay and we end up in a deadlock and nothing moves. So, it is quite important to keep this group small and cross train folks with some SMEs.

We enjoy building things and if you are keen to explore the openings we have // just have a chat on how we do things // want to know more as a technologist — drop a note at ta@kooapp.com.

Koo Foo Bar

Written by Phaneesh Gururaj