The Data Story of Powerplay (part-2)

Shubham Goyal
Powerplay
Published in
3 min readJan 8, 2022

Data has been core to our product development strategy since its initial days. This blog covers how Powerplay built data engineering when it was growing further 50k+ users (20k+ businesses).

To learn how we started using data for product development in the initial days. Read The Data story of Powerplay(Part — 1)

At 50K+ businesses, we were dumping all our user’s journey data to BigQuery and storing persistent data in MongoDB(NoSQL DB). On top of them, we were using Redash for analytics and building dashboards. That’s when we hired our first Data Analyst, who raised the bar further.

Unified data platform

The different databases had different data points on the users. That required analysts to unify data from multiple DBs to get a holistic picture of the user data that had its own challenges like -

  1. Redash doesn’t allow to write data queries across multiple DBs.
  2. Data Analysts usually don’t know NoSQL.
  3. It’s not straightforward to combine or convert NoSQL data to SQL data.
  4. Every user event had its table in the database, and solemn we need to write queries over multiple events. Due to this, we ended up writing up queries with many join operations.

To solve this, we needed a single place to store all our data and do some post-processing on it. After some research, we arrived at the AWS S3 bucket for data warehousing and AWS Athena to write SQL queries on top of it. To collate all data in the S3 bucket -

  1. We wrote ETL logic in python to convert our MongoDB documents into SQL entries and store them in the S3 bucket.
  2. We redirected our user events from Segment to another S3 bucket. On top of that bucket, we wrote some data pipeline in AWS EMR to convert events data to easily query-able format.

Third-party integrations

We integrated with many third-party services for different use cases like WebEngage for User Engagement, Sendbird for the Chat feature, Adjust for Ads Tracking etc. Unfortunately, our persistent DB and user events did not capture all these data points which were readily available with these third parties.

When we partnered with third parties, we ensured that they provided us with the data export webhooks. We could use these webhooks to channel the data from their platform to ours. We created API end-points to listen to their outbound webhooks and scheduled jobs to fetch data from third parties who provide APIs.

To ensure that most of these APIs and Jobs are scalable, we deployed them on AWS Lambda functions(AWS managed service). These Lambda functions channel data from the third party to our S3 bucket. Further on, we wrote some ETL logic to make this data more query-able.

What’s next?

As we have grown further, we are ready to take on a new set of challenges.

  1. The increasing number of data sources have made writing queries and building dashboards complex and time-consuming. Therefore, we require collated data from different data sources. As a result, that will make our data more accessible to all teams.
  2. The sanity of the data is another huge challenge. So, we need to ensure that all our sources are in sync with each other. As a result, it builds have high confidence in the data.
  3. Data size and analytics operations increased rapidly over time which slowed down data queries. So, we have done some optimisation by indexing DB to solve that. But, there is a long way to go!

We are pumped up to solve these challenges. And, To mark this journey, we are also looking for stellar data engineers to join our team.

--

--