Is Array Database is the answer to faster analytics on Big Data?
While Hadoop’s epitaph is being written and the industry is repositioning itself with new sets of paradigms like Hive based execution engines on top of HDFS and data Processing framework like Apache Spark etc, are we not paying enough attention to something very significant — the emerging new age database systems that are packed with 100 times faster processing power than Hadoop . Somehow in today’s market Hadoop has become synonymous with Big Data. But with our real experiences and the better understanding of its limitations are we ready to look beyond Hadoop eco-system ? And could the new avatars of databases be the answer to our Big Data Analytics woes?
The problem arises from the belief that Hadoop is the silver bullet to knock down any kind of Big Data problem. Hadoop was conceived to address a very specific problem of parallel processing of documents in distributed architectures. The idea was to divide and conquer and it still works great with problems where data can be divided into independent chunks or subsets for processing efficiently in parallel. But divide and conquer can’t be applied to all types problems especially when it comes to running complex analytics .That is the core of the problem .
Mike Stonebraker, leading authority on databases, the father of Postgres, Ingres, Vertica, VoltDB, and SciDB , and Professor at MIT,has been always very vocal about Hadoop’s shortcomings .He says “Hadoop is extremely good at only embarrassingly parallel jobs, its performance is disastrous when it comes to any other kind of processing.”
Hadoop was originally designed as a batch processing system.There is no doubt that it is useful in ingesting and preparing Big Data, but MapReduce is extremely slow when it comes to data crunching and could take days or even weeks to get results back. If you want to get quick answers and run complex queries that require extensive optimization Hadoop definitely is not the right tool.
Today the industry is finding that out and looking at alternatives to replace MapReduce layer. Apache Spark is trying to solve some of the problems in this area and is gaining a lot of prominence. But Spark comes with other limitations . All Hadoop proponents like Cloudera, Hortonworks ,Facebook are focussing on defining and building processing engine that processes Hive without using the MapReduce layer. Cloudera’s Impala which is an execution engine, implements Hive and is built on top of HDFS which again might not be a great choice for storing your data for direct analytics depending on your problem.
Last decade has seen a lot of interesting ideas and innovations in the database world. There are a lot of new products to address different kinds of problem. There are columnar databases to address data warehouse issues, main-memory databases for faster OLTP processing. And then there are the promises of Graph databases and Array databases for faster analytics on graph based and array based data. Looking at the interesting development in the modern database space, we are bound to explore and see if there is a better answer to run faster analytics on Big Data.
Big data is an amalgamation of all different types data in massive scale. And everyday businesses want to integrate more and more data sources .If we look at this data they are primarily machine generated or machine logging of user actions (browsing data, tweeter data, sms data etc).These kinds of data are inherently represented by multidimensional arrays and two-dimensional subset matrices. While we can choose to store these data in file systems like HDFS or traditional row store or column stores, array model is definitely an easier way to store, retrieve and run faster queries on these data.
The competitive advantage of Big Data lies in our ability to run complex analytics — predictive models, clustering, principal component analysis etc on it. Complex analytics enables us to look at the changes in a system over time, location, position,price or any other ordered dimension and compare and analyze those changes relative to changes in other people, devices and information. To do this, we need to find the next and previous or a vicinity of values along whatever dimensions we are analyzing.
These analytics operations can be considered as computations on arrays as a collection of linear algebra. While traditional databases can definitely address these kinds of queries , they are quite expensive in terms of performance and complexity. If we fundamentally have to do array calculation , using Array database as native storage object makes things much simpler and efficient. Array databases’ built-in math functions make it very easy to run complex analytics on Big Data . There are systems that provide array SQL loaded with quant oriented linear algebra built in , which is much more faster and powerful than Hadoop based analytics. Systems like Scidb also provides a user defined functions enabling users to write their own fancy analytics .
As we are getting more and more data savvy and moving towards more complex world of analytics , and at the same time getting impatient for more processing speed and power ,Array database is definitely one direction we cannot ignore . This world will get traction as we move towards more sophisticated data science and analysis and this is a direction worth exploring.