Introduction to Hadoop and HDFS

Tuba Ozdemir
BilgeAdam Teknoloji
2 min readJul 15, 2022

Apache Hadoop

As a definition of Wikipedia; Apache Hadoop is a collection of open source software tools that make it easy to use a network of many computers to solve problems involving large amounts of data and computation.(*)

Additionally, let’s clear up a common misconception: Hadoop is not a database.Hadoop is an Open-Source Project (Sub-project) of Apache.

Apache Hadoop ensures Scalable and economical data storage, processing, and analysis. It is fault tolerant and distributed.

Hadoop’s core projects are HDFS, MapReduce,YARN. In this article, we will talk about HDFS.

Also there are some Hadoop distributions like Cloudera, MAPR, Hortonworks…

Common use cases are; Data storage and analysis, collabrative filtering ,text mining, ETL, index building etc

HDFS

HDFS is a file system written in Java and based on Google File System. Provides redundant storage for massive amounts of data. It is used readily available, industry standard computer.

Lets talk about a few features of HDFS;

HDFS performs best with large files, not small ones.

Files in HDFS are write once i.e you cannot update files and no random writes to files are allowed.

Rather than random reads, not suitable for rdbms.

How to access HDFS and use it;

It requires Linux operating system. Hdfs works on Linux OS , not Windows. With command line or in spark (By URI: hdfs://nnhost:port/file…) or Java API

Also used by MapReduce, Impala, Hue, Sqoop

We have specified the definitions and requirements of hadoop and hdfs. Next article, we will access hdfs with command line and transfer data.

If you have any advice about this story, I would be very happy if you contact me. Thank you for your time.

--

--