Free, Open Source Analytics: Leveraging Unstructured Data and the Cloud

Jorge Chang
3 min readMar 8, 2016

--

With the maturation of the mobile apps economy, there are several trends that point to a pressing need to provide a free, open source solution to moving unstructured data into the cloud for analysis and data science. At OneFold, this is exactly what we are doing, because we are developers who feel the pain that all developers feel, and recently we’ve added the ability to move MonogDB data into Google BigQuery, so NOSQL data can be queried using SQL.

Trend Number 1: Capture Everything. Mobile app developers are capturing everything. Yes, they literally are instrumenting the data to capture every single thing users do. Whether it will ever be analyzed is a secondary question. The first principle is to capture everything. The sheer amount of data points being captured everyday is enormous. Some apps we know capture over 1 billion data points a day.

Trend Number 2: Capture Everything in JSON. Semi-structured data is exactly that. Its not normalized and structured like SQL data. The new standard for semi-structured data is JSON. Frankly, its easier to capture semi-structured data as JSON data (or even unstructured data) and convert it into JSON through an adapter than to “plan” in advance about how it will be analyzed and try to capture it in a structured format predicting this analysis. Who knows how it will by analyzed, and even if we think we know, we are probably not thinking of every analytical query that we’ll some day come up with.

Trend Number 3: Analyze Everything using a Vendor Software Package or Service. Here comes the tough part. It all goes into a cloud, either Amazon, or Google or Azure, or maybe a vendor-run private cloud, and then you get charged for the analytics software based on the amount of data you collect. If you collect only a little, its $X. But is you collect a lot its 10X or 100X. This really gets you. Many developers we know start out at X, and then as their app or service starts to scale, they get hit with huge monthly costs for the analytics software. In a way, that’s to be expected because the vendor is trying to make money. But so many developers we know start to worry about the costs, because the amount of data they collect keeps increasing as they scale. Even as the cost of cloud computing is decreasing rapidly, the cost of analytics in the cloud is increasing rapidly.

One way out is to buck Trend Number 1: Stop collecting everything. Try telling that to a start-up founder. It kills you. Its called sampling. But as a service is scaling, you don’t want to sample. You want to know everything. Because if you miss something important through sampling, then you miss a big opportunity your competitor might not miss. And that could mean the difference between success and failure.

Nobody we know wants to sample. But they are forced to sample. Because of Trend Number 3. So the solution is simple. The same thing happened with expensive databases, when MySQL came around. Open Source.

OneFold open sources the end-to-end analytics stack. Collection, JSON API, Automated ETL, Cloud Storage and SQL Query and Visualization. It collects data from NOSQL databases like MongoDB, services like Stripe, Mobile app data and supports SQL query on Amazon RedShift, Google BigQuery and Hadoop/Hive. We also built most of chart.io like functionality using Facebook React for visualization. So far only the OneFold team have been doing this.

As you can imagine, the data collection and the loading onto any cloud is an m*n problem. Recently, one of our users, Raghav Rastogi, who is a data scientist at GotIt! (a top EDU app in the app store) added support for MongoDB data to be loaded into Google BigQuery. That’s his use case, but I’m pretty sure its a common use case. I’m really excited by this contribution. This year, I’m hoping we will see more contributions, and we plan to let developers use the software free, but if people want support, we will use the standard open source business model of a support fee.

--

--