What is Apache Hadoop?

John Thuma
DataSeries
Published in
6 min readJul 9, 2018

A LITTLE HISTORY: This is the second installment of the ‘For Non-Unicorns’ series and the last article focused on What is Apache Spark. This article will focus on the big elephant in the room: Apache Hadoop. Why name something ‘Hadoop?’ Doug Cutting was inspired by his son for the name and in his words: “The name my kid gave a stuffed yellow elephant. Short, relatively easy to spell and pronounce, meaningless, and not used elsewhere: those are my naming criteria. Kids are good at generating such. Googol is a kid’s term.” Cutting founded Lucene and, with Mike Cafarella, Nutch, both open-source search technology projects which are now managed through the Apache Software Foundation. Cutting and Cafarella are also the co-founders of Apache Hadoop.

WHO:

Over the past 10 years we have witnessed the phenomenon of ‘big data.’ We will discuss big data in the WHAT section of this document. Hadoop was almost like a synonym for big data. Roughly about 35% of major enterprises have implemented a form of Apache Hadoop. According to data from Forrester, the Hadoop market has grown from $435 million dollars in revenue in 2015 to $768 million in 2017. Apache Hadoop job titles include: Hadoop software developers, architects, and administrators.

Another aspect of Apache Hadoop is its distributors. The three major distributors are Cloudera, Hortonworks, and MapR. Amazon Web Services is a significant player in the market with its Elastic MapReduce or EMR. Microsoft has also entered the market with HDInsight. So buckle up as the who is who in the zoo is still yet to be determined.

BOTTOM LINE: Apache Hadoop was inspired by the ‘Google File System’ in 2003. It was named after Doug Cutting’s child’s stuffed elephant. It is a big data platform and is used by developers, administrators, and big data architects.

WHAT:

Hadoop is a collection of open source programs and procedures which anyone can use as the “backbone” of their big data operations. It runs on commodity servers. Commodity is a fancy name for affordable. Open source means that anyone can use them at their own risk. It works on a massive amount of data and encompasses storage and compute. It is made up of roughly four components: a distributed file system (HDFS), MapReduce, Hadoop Common, and YARN.

HDFS, Hadoop Distributed File System, allows data to be stored across a linked set of computers. File systems are simple things: they allow data to be stored and easily found by users.

MapReduce is a system that lets you write programs that read/write data stored in HDFS. A Map filters or sorts data and a Reducer performs a summary, or grouping, on that data. The big advantage here is that when you run your program through MapReduce, it is automatically run on all nodes in a Hadoop cluster so you can process data in a “divide and conquer” approach.

Hadoop Common is a collection of common utilities and libraries that support other Hadoop modules. Hadoop Common provides all Java libraries, utilities, OS level abstraction, necessary Java files and scripts to run Hadoop. (In case you’re not familiar with Java, it is a popular programming language released by Sun Microsystems in 1995.)

YARN is the resource management and job scheduling technology in the Hadoop distributed processing framework. It is responsible for allocating system resources and scheduling tasks within the Hadoop ecosystem. In shared environments like Hadoop, you don’t want any program to hog resources and overwhelm the system, so you use YARN as a way to play nice with the other programs.

BOTTOM LINE: Apache Hadoop is a compilation of open source tools and components that enable people to manage big data. It includes storage (HDFS) and compute (MapReduce) managed by YARN with a set of interfaces and components called Hadoop Common.

WHERE:

Apache Hadoop can be installed on premises, on your laptop, or in the cloud. You can download and install Hadoop from any of the major distribution providers (Cloudera, Hortonworks, and MapR) to a virtual machine and get the basics down very quickly. You can also spin up an environment in Amazon Web Services or Microsoft Azure.

BOTTOM LINE: Each distribution provider has something a little different to offer. I would recommend trying them out on your local laptop if you are so inclined and going through their sample exercises. If you want to put your toe into the Apache Hadoop environment I would recommend going to MapR, Cloudera, or Hortonworks web page and giving them a try. Each one will require about an afternoon to get running and a bit more time to explore their exercises. There is also a ton of help available on the web.

WHEN:

If you want to process large amounts of data (terabytes or petabytes) then Apache Hadoop might be a good solution for you. If you want to offload the extract-transfer-load (ETL) operations from a traditional SQL based solution, Hadoop might be a decent solution.

Hadoop is great for storing and processing a variety of data including flat files, complex data, images, and video data. You can also process data within the Hadoop platform with MapReduce and other utilities. This enables you from having to move, or make copies, of data across your network.

MapReduce enables you to process massive amounts of data in batch fashion over a parallel set of linked computers.

BOTTOM LINE: Landing data and processing data are two different things. Landing data is simple in Apache Hadoop whereas processing data in Hadoop can be challenging. The skill sets required to program in MapReduce may not be available to your organization so you must take that into consideration. Tools like Arcadia Enterprise are evolving to enable your organization to leverage data in Hadoop easily and rapidly.

WHY:

Apache Hadoop was constructed to provide a toolset that could handle big data use cases. You might have heard about the three Vs of big data: volume, variety, and velocity. Volume implies the pure size of data which can be measured in many different ways. We’re generally talking in terms of terabytes or even petabytes.

Variety implies that data comes in many different forms including traditional or flat data. Flat data fits nicely into a table or an Excel spreadsheet. Complex data includes data in XML or JSON format. Think of complex data having many dimensions which could include data about a customer, and their orders all inclusive. This data does not fit nicely into rows and columns. It could include relationships and relationships can be complex. Hence, complex data.

Velocity means that data travels at different speeds and given volume that can make things pretty interesting. As an example, think of automotive telematics. The sensors on a single automobile can transmit thousands of data points per second. Now multiply that by how many cars on the road that could be connected. The velocity and volume of that data is significant and would choke most modern SQL-based platforms. For some examples in high speed streaming data take a look both Oil & Gas IoT and Connected Vehicle examples.

BOTTOM LINE: Traditional relational database management systems can be very expensive for landing big data and preparing it for use. Offloading data to Apache Hadoop allows the enterprise to scale affordably through commodity hardware.

HOW:

To get started with Apache Hadoop you have several options. Go check out MapR, Cloudera, and Hortonworks. They offer free versions of their products to test and learn. Cloudera has a cloud-based service called: Cloudera Altus which lets you automate massive-scale data engineering and analytic database compute workloads in your public cloud, without the headache of managing the infrastructure yourself. Watch this video to see how easy it is to get Hadoop running with Cloudera Altus.

BOTTOM LINE: It is easy to get started with Apache Hadoop with one of the distributions. Going it alone can be very challenging but there are cloud providers that will enable you to get there quickly and painlessly.

Check out Arcadia Data which is the only business intelligence tool built for Hadoop and big data. You can download a free version of our tool on which you can explore our visualization capabilities: Arcadia Instant. If you are curious to learn more, ping me and I will show you around!

--

--

John Thuma
DataSeries

Experienced Data and Analytics guru. 30 years of hands-on keyboard experience. Love hiking, writing, reading, and constant learning. All content is my opinion.