Analyzing Bigdata

With recent changes and development in our product we now have thousands of new users using our product on a daily basis. With huge influx of traffic, came another thing — huge data. We have millions of records flowing into our database daily. With such scale, storage was not the primary problem as Google’s Cloud Datastore (NDB) handled it efficiently. In our case, the problem was analyzing this data to extract meaningful information. We wanted to analyze millions of rows within seconds. So we came up with two solutions.

Our problem: Analyzing the data stored in our database on a regular basis (not daily).

Solution 1: Querying NDB using MapReduce — creating a job for each query, that would use sharding and slicing techniques to query millions of rows

Drawback: We use MapReduce for migrations today. Using that technique to query our data meant waste of resources, time and money. Writing jobs for each kind of query meant reinventing the wheel. Further, it was not fast enough to provide results as per our requirements. And especially, performing a JOIN operation meant busting our head around it for quite a while. Conclusion — Out of our table!

Solution 2: GAE provides an awesome tool for querying big data — BigQuery. It is crafted so beautifully that it perfectly satisfies all data-querying requirements. For us our only need was to analyze the data. So we went with BigQuery.

BigQuery is an efficient tool for analyzing patterns in data — huge data. With simple SQL like syntax we were able to query terabytes of data and get results within seconds. Further, it’s server less. So we do not require any dedicated instances running for it. One of the best things about BigQuery is that loading and exporting data in and out of BigQuery is absolutely free. We are charged only for our queries and streaming inserts. Moreover it provides a simple API that can easily be integrated with our systems.

Putting it all together

As we require the data to be analyzed at regular intervals, we regularly backup our NDB models that need to be analyzed and store them in Cloud Storage. Once the backup is finished, we simply provide the storage bucket to BigQuery UI and it handles everything from schema creation to loading the data. Loading process is fast and by fast I mean it completes within seconds. Once the data is up for querying, we use simple SQL like syntax for performing joins over tables and analyzing the results.

To visualize our query results, we developed an API on top of BigQuery APIs and passed our results to Google Charts.

Thus with a simple work-flow and BigQuery we solved our entire problem of “Analyzing BigData”.