Introduction to Hadoop and HDFS
Apache Hadoop
As a definition of Wikipedia; Apache Hadoop is a collection of open source software tools that make it easy to use a network of many computers to solve problems involving large amounts of data and computation.(*)
Additionally, let’s clear up a common misconception: Hadoop is not a database.Hadoop is an Open-Source Project (Sub-project) of Apache.
Apache Hadoop ensures Scalable and economical data storage, processing, and analysis. It is fault tolerant and distributed.
Hadoop’s core projects are HDFS, MapReduce,YARN. In this article, we will talk about HDFS.
Also there are some Hadoop distributions like Cloudera, MAPR, Hortonworks…
Common use cases are; Data storage and analysis, collabrative filtering ,text mining, ETL, index building etc
HDFS
HDFS is a file system written in Java and based on Google File System. Provides redundant storage for massive amounts of data. It is used readily available, industry standard computer.
Lets talk about a few features of HDFS;
HDFS performs best with large files, not small ones.
Files in HDFS are write once i.e you cannot update files and no random writes to files are allowed.
Rather than random reads, not suitable for rdbms.
How to access HDFS and use it;
It requires Linux operating system. Hdfs works on Linux OS , not Windows. With command line or in spark (By URI: hdfs://nnhost:port/file…) or Java API
Also used by MapReduce, Impala, Hue, Sqoop
We have specified the definitions and requirements of hadoop and hdfs. Next article, we will access hdfs with command line and transfer data.
If you have any advice about this story, I would be very happy if you contact me. Thank you for your time.