Big-Data vs. Small Start-up

The chase after the pink unicorn of cheap big-data processing.


Last week I had a meeting with a small and promising start-up which was looking to offer a SaaS kind-of-web-analytics product. They already have a working site, a few beta users (medium sized companies) and a full stack of software they wrote. They contacted me after having performance issues with their site. The simpler data was not an issue, but more complex queries and drill-down were the source of their pain. They've chosen to go with Google BigQuery as their data backbone for the complex query, but they ended up paying a lot of money and not getting the performance they needed. Don’t get me wrong, there’s no fault in Google BigQuery or their business modal, it’s just not the right tool for the job they needed. Their data breakdown was not in an ideal format/layout to be processed by Google BigQuery.

The start-up already started a few different small POCs to review which solution is best to migrate to. One of the main candidates was MongoDB and based on my experience working with it for The joola.io Framework I was asked for advice. We had a very long conversation about their data, needs, roadmap and more. We reviewed the different alternatives there out there that can be considered as good tools for their job. For example, MongoDB would be very useful for dealing with their specific data (multi-level JSON documents), but due to its Sharding and ReplicaSets can be quite expensive. We spoke about Cassandra and the merits it has. All in all, a good and productive talk.

BUT what surprised me most was the lack of basic understanding that the equation here is axiomatic:

Very data + wow performance = Much money

I've realized that the topic of the conversation (from their perspective) was not how to balance the equation, but rather how to keep the money spending as low as possible. This makes perfect sense whether you’re small or big, it’s just good business. However, unlike in the software industry where you can increase margins by optimizing your code, with data it is a trickier thing to achieve.

Their business was based on users pushing extremely large volumes of data in and later querying the data for a tailored analytics view, which was expected to appear in a split seconds. Let’s review what are the main concepts today for providing fast big-data:

  • In-memory stores which are very expensive and IT demanding.
  • Map-reduce which means that raw data goes in and aggregated (reduced) data goes out.
  • Sharding splits the data between stores/servers/nodes/etc… and offers better performance because each unit deals with fewer data points.
  • Traditional data-warehousing with Indexes.

There are more concepts to tackle big-data, but the major proven techniques are listed above. Regardless of what technique you choose, there is another axiomatic fact: analyzing fewer data points is faster than analyzing more data points. So, the more data is need to process in order to answer a query, the more resources you’ll need, be it shards, RAM, storage or whatever. Simples.

So what can a small start-up aiming to tackle big-data do to keep costs levels under control? Start-ups have something that big companies rarely have, agility. My suggestion to the company I met with was to focus on their data ETL (Extraction-Transformation-Loading) process. If the “trick” is to end up with small data, then we funnel the process is a way that map-reduce the extremely large volumes of data into smaller chunks of meaningful insight that can be accumulated, aggregated and delivered very fast. It’s not as easy as it sounds, because the more we reduce the data the less flexible it becomes, we move from data into an answer. If the raw data includes an IP address, but the reduced data does not, then there is no way to summarize analytics data by IP.

The answer we found was offering their users a very detailed breakdown of the last 24 hours, then a less detailed breakdown of the last 3 days, less for the month and onward. They handpicked the dimensions and metrics not relevant (or less relevant) after each period and dropped them from the map-reduce of the next period. But, what if a user does want to drill-down and see a very detailed review of data. This is also possible, by offloading old information to a less expensive server and storage system. Since users will not be using the slow system for analytics, but only for very specific time bound queries, a proper mySQL server with correct indexes can serve that detailed breakdown in split seconds.

They’re now working on building a small POC of the new flow and hopefully we’ll be able to see them go to production and releasing a better performing product.