Hadoop Basics

Aditya Tambe
Jul 20, 2017 · 4 min read

Hadoop is an open source, Java-based programming framework that supports the processing and storage of extremely large data sets in a distributed computing environment.

Hadoop is developed by doug cutting.
Hadoop is developed using Java
Hadoop works on a distributed environment and is build to store, handle and process large amount of data set (in petabytes, exabyte and more). Now here since i am saying that hadoop stores petabytes of data, this doesn’t mean that Hadoop is a database. Again remember its a framework that handles large amount of data for processing.

Advantages of Hadoop:

1. Open-source — Apache Hadoop is an open source project. It means its code can be modified according to business requirements.
2. Distributed Processing — As data is stored in a distributed manner in HDFS across the cluster, data is processed in parallel on cluster of nodes.
3. Fault Tolerance — By default 3 replicas of each block is stored across the cluster in Hadoop and it can be changed also as per the requirement. So if any node goes down, data on that node can be recovered from other nodes easily. Failures of nodes or tasks are recovered automatically by the framework. This is how Hadoop is fault tolerant.
4. Reliability — Due to replication of data in the cluster, data is reliably stored on the cluster of machine despite machine failures. If your machine goes down, then also your data will be stored reliably.
5. High Availability — Data is highly available and accessible despite hardware failure due to multiple copies of data. If a machine or few hardware crashes, then data will be accessed from other path.
6. Scalability — Hadoop is highly scalable in the way new hardware can be easily added to the nodes. It also provides horizontal scalability which means new nodes can be added on the fly without any downtime.
7. Economic — Hadoop is not very expensive as it runs on cluster of commodity hardware. We do not need any specialized machine for it. Hadoop provides huge cost saving also as it is very easy to add more nodes on the fly here. So if requirement increases, you can increase nodes as well without any downtime and without requiring much of pre planning.
8. Easy to use — No need of client to deal with distributed computing, framework takes care of all the things. So it is easy to use.
9. Data Locality — Hadoop works on data locality principle which states that move computation to data instead of data to computation. When client submits the algorithm, this algorithm is moved to data in the cluster rather than bringing data to the location where algorithm is submitted and then processing it.

Disadvantages of Hadoop:

1. Security Concerns

Just managing a complex application such as Hadoop can be challenging. A classic example can be seen in the Hadoop security model, which is disabled by default due to sheer complexity. If whoever’s managing the platform lacks the knowhow to enable it, your data could be at huge risk. Hadoop is also missing encryption at the storage and network levels, which is a major selling point for government agencies and others that prefer to keep their data under wraps.

2. Vulnerable By Nature

Speaking of security, the very makeup of Hadoop makes running it a risky proposition. The framework is written almost entirely in Java, one of the most widely used yet controversial programming languages in existence. Java has been heavily exploited by cybercriminals and as a result, implicated in numerous security breaches. For this reason, several experts have suggested dumping it in favor of safer, more efficient alternatives.

3. Not Fit for Small Data

While big data isn’t exclusively made for big businesses, not all big data platforms are suited for small data needs. Unfortunately, Hadoop happens to be one of them. Due to its high capacity design, the Hadoop Distributed File System or HDFS, lacks the ability to efficiently support the random reading of small files. As a result, it is not recommended for organizations with small quantities of data.

4. Potential Stability Issues

Hadoop is an open source platform. That essentially means it is created by the contributions of the many developers who continue to work on the project. While improvements are constantly being made,
like all open source software, Hadoop has had its fair share of stability issues. To avoid these issues, organizations are strongly recommended to make sure they are running the latest stable version, or run it under a third-party vendor equipped to handle such problems.

Hadoop is mainly composed of 2 main component.

1. HDFS [Hadoop Distributed File System](for Storage)
2. Map Reduce (for processing)

)
Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade