Retrospective on ODPi: Does Hadoop Really Need a Drum Major?

Greg Chase
Business and Marketing
4 min readApr 4, 2016
Photo credit: Wikipedia

How do you buy Hadoop? Well according to the Apache Hadoop website, there are over 30 vendors who include Apache Hadoop, derivative works, or provide commercial support.

In fact, Hadoop isn’t a single technology. It’s a vibrant ecosystem of dozens of technologies handling different aspects of data ingest, storage, management, querying, security, etc. Each of these technologies is collaboratively built within its own open source project in The Apache Software Foundation on its own cadence with its own dependencies. And this doesn’t include the hundreds of other software and hardware technologies that try to host or integrate with the Hadoop ecosystem.

In short, anyone trying to create a solution out of these various technologies has a systems integration challenge on their hands. This has led to the rise of commercial distributions of Hadoop, which if that was all that an end user needed, would do just fine. The problem is Hadoop is just part of a usual end-user data and analytics solution which also involves incorporating other non-Hadoop data sources, data infrastructure, visualization technologies, and living within cloud or legacy IT environments and operations.

This complexity is part of what an organization called the ODPi was founded to help mitigate. Last week, this “shared industry effort” published the first official runtime specification and corresponding test suite that allow makers of Hadoop distributions to prove compliance with the ODPi spec. Interestingly, this announcement seemed to dominate the news coming out of Strata Hadoop.

Much of the commentary was positive about this work, but some people were asking why a “project for projects” was really necessary for Hadoop?

Does a customer of Hadoop really need the ODPi? Not really if you think about it. They can continue to choose a preferred commercial distribution provider, and shove the delivery responsibility onto a system integrator or cloud provider. This will increase costs for the customer and stifle adoption, but it’s not the end of the world. There’s a lot of money that could be made by the people who deliver these deployed solutions.

The hundreds of new and existing technology providers that want to add onto the existing Hadoop ecosystem are who really needs ODPi. As I was walking through Strata the other day, I was ticking off the many software developers that would greatly benefit from improved cooperation within the Hadoop ecosystem. Wouldn’t it be nice if a single version of a connector could work with all distributions of a certain version level of Hadoop? In fact, the only company I could think of that would not benefit was as well known driver developer since they can currently make tons of money developing distribution-specific connectors.

ODPi is a bit of a different animal than the other open source projects that make up the Hadoop ecosystem. As a project through The Linux Foundation, it is a collaboration between companies agreeing on release standards. The work of ODPi is open source as well, so there’s no secrets. In fact, the technical committee developing the test suite contributed a lot of that code as part of the latest release of Apache Bigtop. Apache Bigtop, if you aren’t familiar with it, is a coordinating project for packaging multiple big data technologies. So it only makes sense for ODPi to contribute to this project.

A company does not have to become an actual member of ODPi to take advantage of this standardization, only to influence future iterations of spec, or “certify” compliance.

So far results from the ODPi have come purposefully slow, taking over a year to release its runtime spec since the idea was first announced in February 2015 jointly by Pivotal and Hortonworks. The reason for this is because we hope for ODPi to be a cooperative force, not a ruling force. To be really good, this force means some providers might have to give up their temporary hegemonies in favor of enhancing innovation in other parts of the ecosystem.

The truth is, however, that much of the new development in the analytics world is happening with regard to analyzing data stored in Hadoop, or before the data gets into Hadoop. Even the core Hadoop providers are all actively developing new analytics and querying technologies, rather than making drastic changes to Hadoop itself. Some people are even suggesting that technologies such as Spark might displace a lot of Hadoop workloads, relegating the elephant to cold storage of data.

The purpose of ODPi is not to stifle future innovation that might eventually disrupt Hadoop. Really the ODPi will just help ensure that the Hadoop ecosystem doesn’t become a victim of its own complexity.

If you work for a software vendor or solution provider, and are interested in participating in the work of the ODPi, check out the members’ link.

If you are interested in analyst and press reaction about the recent announcement of the ODPi runtime spec release, check out this list of articles.

--

--

Greg Chase
Business and Marketing

Big Data Communities: @Pivotal, @ApacheGeode, @ApacheHAWQ, @ApacheMADlib, @Greenplum. Dog slave. Wine maker. Blog http://geekmarketing.biz