Cloudera RecordService — The answer to Big Data Security?

One of the great things about the Hadoop ecosystem is that it has evolved the capability to apply a range of scalable computing patterns to a wide range of scalable data stores. Hadoop started by allowing users to write Java programs which would process data in arbitrary files in the Hadoop File System, using the MapReduce pattern. But now it also allows users to write Hive and Impala queries in SQL, or to write Spark programs in Python/Scala/R to perform Machine Learning analysis, etc. And it enables the user to apply those programs to data stored in ORCFile, Parquet format, HBASE, Cassandra, and many other formats and data stores.

This Hadoop ecosystem has developed rapidly. The 1.0 release of Hadoop was actually in December of 2011 (after a 6 year gestation in 0.X releases). But over that period, the platform has evolved primarily to allow small groups of people to process huge quantities of data. This is in contrast to the evolution of corporate databases which, throughout the 1990's and 2000's, extended computing to allow online access (direct and/or through corporate OLTP systems) to broad audiences with diverse needs and diverse rights of data access. So while the corporate databases have evolved sophisticated record-level (row-level) and attribute-level (column-level) access control mechanisms, these capabilities have been almost completely lacking in the Hadoop ecosystem. One counterexample is the Accumulo database — a wide column store which emphasized access control from the get-go. But access control which is applicable to the full ecosystem (all compute engines, all data stores) has so far been largely absent from the premier open source Big Data stack.

Which brings us to now. Cloudera recently released a beta of RecordService— an access control enforcement layer for Hadoop — as an Apache incubator project. When combined with Apache Sentry for policy management, and linked into a corporate identity management system via LDAP, RecordService provides a uniform layer, intermediating between the compute engine and the data source to enforce record-level and attribute-level permissions in Hadoop. I have not had a chance to try it out yet — so I can not endorse the implementation. But I do applaud the effort, and hope it lives up to the potential.

Beyond RecordService, though, I think we have a ways to go to achieve what I believe to be truly satisfactory Big Data security. Heck — even corporate databases are still woefully inadequate in providing protection to important records. If Big Data stacks are to become trusted not just as a place for some specialized analytics teams to mine data dumps, but rather as a centralized data store around which to build corporate systems, it will take more than just strong access control mechanisms. It is my view that pervasive encryption of data at rest (not just in transit), integrated with the access control system so that the two reinforce each other, will be a big step in the right direction. I’m interested to see the next developments in this critical area of Big Data technology.