Starship | The Data Lake for our next PetaBytes of data — A Prologue

NoBroker.com
NoBroker Engineering
4 min readNov 24, 2020

If the only tool you have is a hammer, you tend to see every problem as a nail

— Abraham Maslow

There was a time when Nobroker.com was thought to be just a brokerage free house hunting platform. Today, we are not only the biggest player in this segment but also in a spectrum of associated services in real estate technologies.

Our gate management solution NobrokerHood is one of the best of its kind products in the country. Our HomeServices provides a wide range of solutions including painting, house cleaning, carpentry, handyman services, etc. Our Legal Services, Rental Agreements, Packers & Movers, HomeLoans, Rent Payments etc. have made us a one-stop solution for all your housing-related needs.

We are now a holistic ecosystem for all your real estate needs, with products and innovations that change the way real estate works.

With so many products and business functions in place, Data flows massively into our backend. Transactions, click events, uploads, updates, geospatial events, etc. flows in the system in large volumes and varieties. Constant monitoring of KPIs for all the product and business functions is an integral part of our success.

“Not everything that can be counted counts and not everything that counts can be counted”

— Albert Einstein

With this data becomes a first-class citizen and analytics becomes a primary function internally within the organisation. With many products and countless data points residing in variety of databases, analytics was becoming a nightmare within the organisation. Data scientists were heavily relied on by business & product heads for standard tasks like extraction from databases, lookup, and merging two data sets etc. Hence we thought of a new paradigm for analytics within the organisation.

An Analytics friendly Data Lake with both batch & stream processing capabilities

We call it, Starship.

Before we go into further details on Starship, It would also make sense to see how the pre-Starship era of analytics looked like in the organisation. Most of our analytics & BI happened on Kibana & Elasticsearch stack. Developers made sure that any application they built has a replica in Elasticsearch with a pipeline that performed necessary transformations to the dataset. Product Managers would think about what data they needed in Kibana and developers would make sure the data becomes available.

Kibana is an amazing tool. With powerful query performances due to the power of Elasticsearch and amazing visualisation capabilities, it was becoming the go-to analytics tool for the entire organisation. However, as we grew, the number of databases grew, the number of elastic clusters grew, the number of Kibana interfaces grew, and then came trouble.

How seamless seemed love and then came trouble!”

— Khaled Hosseini

Elastic may be thought of to be a NoSQL-like store with extensive indexing. With this, filters and aggregations worked like charm provided we gave enough resources for the cluster. It was good for counting and filtering, provided we organise the data in that form in Elasticsearch. It miss capabilities where you can merge two data sources or write advanced queries for reporting. Also from an analyst's standpoint — there were only two ways one could speak to elastic — the notorious elastic query language or using Lucene queries on Kibana. A Data Scientist wrote lengthy code to bring data into Pandas or Spark environments for things that could not be done with Kibana. Also, it was a developer's pain point to make sure that data gets replicated in Elasticsearch the way it is needed for Analytics.

When someone wanted to look at say — how chat and calls were correlated, there was no way in which it could be done other than downloading large datasets into excel or raising a request to the Data team.

This was lame and then we built the Starship.

Starship — The Ideology

When we thought of an enhanced analytics experience platform, we wanted it to have the following capabilities,

  • A Data Warehouse — that is easy for non-technical people to interact with.
  • The components of the pipelines to be usable for event-driven stream processing
  • Pipeline scalable to accommodate any volume of data flow.
  • ACID Transactional capabilities like inserts, updates & deletes on the data store in near real-time
  • ANSI SQL compatible query execution capabilities.
  • Capabilities to work in Spark & Hive environments for Data Science & Machine Learning Workloads
  • Client agnostic data store which is cheap and HDFS like, which can function as a warehouse as well.
  • A stateless, decoupled query execution engine, that can speak to this datastore and is horizontally scalable to be able to interact with PBs of data.
  • Data organised in a way that follows the principles of Tidy Data.
  • Ad hoc query execution which is painless.
  • Near real-time syncing of the data into the data store.

We achieved them using the following set up.

  • An event-driven pipeline with an event stream at the source and modern cloud-based object storage at the sink. This enables both event-driven analytics and batch analytics. We did this with Apache Kafka at the source and Google Cloud Storage at the sink of the pipeline.
  • A compressed columnar file format at the warehouse with near-realtime updates and transactional capabilities. We did with Apache Hudi.
  • A Distributed SQL query engine that can sit on this warehouse and can be scaled seamlessly from GBs to PBs. We did this with Presto running on Kubernetes.
  • A user-friendly interface for communicating to the store, where anyone with a reasonable understanding of data can ask questions on the data store and create reports. We did this with Metabase.

Each of the component is an entire topic in itself for discussion. In the following series of blogs, we will go deeper into the principles and details of each component. Watch this space and follow us for more on the amazing things we do at NoBroker.com.

--

--