Apache Hadoop: A Review on Security Issues and Solutions for HDFS
A deep dive into the security issues occur in HDFS structure, and the available technologies to protect it.
Big data is trending. Smart devices, Internet and technologies allowed the unlimited generation and transmission of data, and from the data, new information is gained. The big data generated are in various form, it can be structured, semi-structured or unstructured data. The traditional data processing techniques like Relational Database Management System (RDBMS) are no longer capable to store or process the big data, as it has wide variety, extremely large volume, and generated at a high speed. Here’s where Hadoop come into the loop. Hadoop (Highly Archived Distributed Object Oriented Programming) capable to process all type of data in a very fast speed which close to real time and with minimum cost .
Hadoop is an open source Apache framework, written in JAVA programming language. Hadoop is designed to support distributed parallel processing of large scale of datasets across clusters of computers using simple programming model. Two main components of Hadoop are Hadoop Distributed File System (HDFS) for big data storing and MapReduce for big data processing as shown in Figure 1. Both mentioned components implemented a master and slave architecture, every cluster contain of one master node and various slave nodes. In HDFS, the master node is Name Node, and the slave node is Data Node. Name Node is responsible to store the metadata, for example, the filename, file attributes and location of the blocks where the data is stored. Data nodes are responsible to store the file itself, and the file will be duplicated in the blocks located in other racks. In MapReduce, the master node is the Job Tracker, and the slave node is the Task Tracker. Job Tracker are responsible to distribute the job, while Task Tracker is responsible to perform the job  .
It is a challenge to manage big data in distributed programming framework. Various issues can arise when handling big data, for instance, management issues, processing issues, storage issues and security issues. The security issues begin when the mammoth volume of data stored in a database which are not in regular format or not encrypted. Moreover, some of the tools and technologies used for handling large dataset are not developed with a proper security and policy certificates. System hackers and data hackers can steal the data and copy it to any type of storage devices like hark disk by attacking the data storage . The type of attacks can be sent by the hackers are Denial of Service (DoS), Snoofing attack, Brute Force attack and CROSS-SITE scripting (XSS)  .
The flaw or weakness in the design, implementation, internal control or system security procedures caused the distributed systems to be vulnerable. Security breaches can be caused by the intentionally exploited or accidently triggered vulnerabilities. The vulnerability caused Hadoop environment components like Storm and Flink are likely to be attacked. In , vulnerabilities can be separated into three categories, which are infrastructure security, data management and data privacy. Then, the three categories can be further divided into three dimensions, which are architecture dimension, data value chain dimension and life cycle of data dimension. Infrastructure security are in architecture dimension and are referring to hardware and software vulnerabilities. Then, data privacy is in data life cycle dimension, and involving the data in transit and data at rest.
In , vulnerabilities of Hadoop are grouped into three categories, which are software / technology, web interface / configuration and security /network policy. According to the author, Hadoop is written in Java, which is a programming language that has been exploited by the cybercriminals and compromise for various security breaches. This is known as technology or software vulnerabilities. Then, Hadoop is configuration vulnerable as it has various default settings. For example, the default ports and IP addresses which make it vulnerable and have been exploited. Then, the Hadoop web interfaces like Hue (an open source SQL Cloud Editor, licensed under the Apache License 2.0), are weak against XSS scripting attack. Moreover, the Hadoop framework contain of multiple databases, which deployed different policy. These policies that are not configured properly lead to vulnerability.
Consequences for lack of security can be critical to an organization. HDFS is the base layer of Hadoop Architecture as shown in Figure 1, HDFS main functions are data storing and data processing, and hence, its more sensitive to security issues. Without an appropriate security measure, unauthorized data access, data theft, and unwanted disclosure of information could happen. Losing profit because of the proprietary information is stolen, or losing some data stored can bring trouble to the organization, but the consequences is small, and recoverable. However, if the data theft disclosed the private information of the customers, the image of the organization can be harmed, and caused the customer losing trust to the organization. Private information leakage in Financial Agency can cause a series of unfortunate happened to the customer themselves. For example, financial scam, using the customer information to acquire the fund, or impersonating them to borrow from their acquaintances. The consequences could be worse when the affected organization holds confidential information, for example, government department, which might create a chaos even if the hacker just manipulating the data. As Hadoop Technology security level is not satisfying, most of the government department and organizations do not use Hadoop environment to store valuable data .
HDFS security is crucial to organization that store their valuable data in Hadoop environment. HDFS is vulnerable to various form of attack., such as the DoS attack, which accomplished by causes a crash of data or flooding the target with traffic. Name Node in HDFS is vulnerable to DoS attacks . The Name Node in HDFS will be coordinating to Job Tracker in MapReduce to execute data processing tasks. The DoS attack on Name Node can stop the read-write operation of HDFS and then affect task of data processing. In , three approaches are proposed to secure the data in HDFS, which are Kerberos Mechanism, Bull Eye Algorithm Approach and Name Node Approach. In this paper, the three security solution mentioned in , and the other security solution or tools that are applicable to secure HDFS will be researched and discussed on their pros and cons. With a comparison among the solution, the HDFS user could decide on which solution to used based on their context.
II. Available Security Tools
There are various tools and solutions available to secure the HDFS environment, and each of it have different features and different effectiveness under different context.
The security tools and solutions can be divided into four categories, which are encryption, authentication, authorization and audits. Authentication refers to verification of system or user identity for accessing the system, or in other words, it is the procedure of confirming whether the user is the person they claimed to be. Two common authentication technologies are Lightweight Directory Access Protocol (LDAP) for directory, identity and other services, and Kerberos   .
Authorization is the process of determining the access rights of the user, specifying what they can do with the system  . As Hadoop mix various systems in its environment, it required numerous authorization controls with different granularities. In Hadoop, the process of setup and maintain the authorization control are simplified and can be done by dividing users into groups by specifying in the existing LDAP or Active Directory (AD). Other than that, authorization can also be setup by giving role-based access control for connection methods that are alike. The popular tool for authorization control is Apache Sentry .
Data encryption is referring to the process of converting the data from readable format to an encoded format that only can be read or write after it is decrypted . Encryption is to ensures the privacy and confidentiality of the data, and to secure the sensitive data stored in Hadoop . There are two types of data encryption which are encrypting data in transit and encrypting data at rest. For HDFS, encrypting data in transit can be done by configuration, but Kerberos must be enabled before the configuration . Transparent encryption for HDFS introduced in Cloudera apply a transparent and end-to-end encryption of data read from and written to HDFS blocks across the cluster. HDFS Transparent Encryption apply Key Concepts and Architecture where a key will be used to encrypt and decrypt the file .
Audit is referring to verification on the entire Hadoop ecosystem periodically and deployment of log monitoring system. HDFS and MapReduce provide basic audit support. The security breaches can be caused by intentionally exploited or accidently triggered. Hence, audit is important to meet security compliance requirements.
The following section discussed the security tools and solutions available for HDFS.
A. Kerberos Protocol
The most popular tool for authentication is Kerberos, which is also the primary authentication for Hadoop developed by MIT  . Kerberos protocol provides secure communications over a non-secure network by using secret-key cryptography . The protocol of Kerberos is shown in Figure 2. The client will first need to request Ticket Grant Ticket (TGT) from Authentication Server (AS) from Key Distribution Centre (KDC). After client received the TGT, the client will have to request Service Ticket (ST) from Ticket Grant Server (TGS). Client can use the ST to authenticate a name node. the TGT and ST will be renewed after long running of jobs. The greatest benefit of Kerberos is that the ticket cannot be renewed if it was stolen . Kerberos provides powerful authentication for Hadoop. Instead of using password alone, the cryptographic mechanism is used when requesting services .
B. Bull Eye Algorithm
Bull Eye Algorithm in another approach proposed in  which claimed its able to provide security for sensitive data in 360° from node to node in HDFS. According to the author, the approach is using by Dateguise’s DGsecure and Amazon Elastic Map Reduce. The algorithm concentrates on sensitive data only, it will scan the data before the data is stored into blocks by Data Node. Then, the algorithm will scan the blocks to determine whether the sensitive data are stored in block properly and free of risk. It only allowed authorized person to read and write the data, and during the read-write operation, the algorithm will ensure the relation between the racks are safe. In a nutshell, the algorithm enhanced the security level of Data Node in HDFS.
C. Name Node Approach
Name Node Approach is the third solution mentioned in , which proposed to use two Name Nodes in HDFS. Single Name Node in HDFS makes it more vulnerable, the system is down when the Name Node is down. This approach provided a “back up plan” for the system. The system will be still running when one of the Name Node is down. The two redundant Name Nodes are provided by Name Node Security Enhance (NNSE), which prevent the rise of new issue when there are two Name Nodes. When both Name Nodes are alive, one of the Name Node will act as the Master Node, and the other will be Slave Node. the Slave Node which will cover data unavailability and time lagging in secure manner when the current Master Node is crashed, with the permission from NNSE.
D. Apache Knox
Apache Knox Gateway (“Knox”) is a single access point to single or multiple Hadoop Cluster. Knox provides perimeter security which allowed the organization to extend Hadoop access to more user while complying enterprise security policies. Kerberos complete the security of Hadoop cluster, but it is complex for client-side configuration. Knox is encapsulating Kerberos, which eliminates client-side configuration and simplifies the model. Furthermore, Knox can authenticate user credentials against AD/ LDAP with its REST API-based perimeter security system. Knox support for multi-cluster security management and integrated with existing IdM Systems, such as SSO for Hadoop UI (Ranger)   .
E. Apache Ranger
Apache Ranger is an associate authorization system that allow authenticated users to access Hadoop cluster resources like Hive tables and HDFS files. Ranger provide comprehensive security across the Hadoop elements. The goals of Ranger are to centralize the security administration, provide standardize and fine-grained authorization, enhanced authorization methods support and centralize auditing of security related administrative actions and user access across all Hadoop components. For data protection, Ranger uses wire encryption.   .
F. Apache Sentry
Apache Sentry is an open source project by Cloudera, which is Hadoop authorization module. Apache Sentry supports role-based authorization, multi-tenant administration and fine-grained authorization. Sentry provides unified administration for metadata and shared data for access frameworks like HDFS and Hive. Apache Sentry is pluggable authorization engine for HDFS, Hive and other Hadoop elements. In other words, Sentry is used to define what users and application can do with data. Different users have different authorization   .
Table 1 shown the comparison of the security tools/solutions mentioned in previous section in terms of features and functionalities     .
There are six security tools / solutions discussed in this paper, which are Kerberos Mechanism and Apache Knox for authentication, Apache Sentry and Apache Ranger for authorization, Bull Eye Algorithm and Name Node Approach for audit.
For authentication, Apache Knox is better than Kerberos Mechanism. The Knox not only eliminated the configuration, which is so complex in Kerberos, but also simplified the model. Moreover, other than authentication, Knox is also capable for authorization control and audit, while Kerberos only support authentication.
Apache Sentry and Apache Ranger is both capable for authorization and authentication, and Apache Ranger also support for audit. However, these information in not enough to justify which tool are better. Hence, the service provider is determined. For Apache Sentry, it is support by Cloudera which is a leading company in Big Data Solution , while Apache Ranger has no formal support. Considering the support availability, Apache Sentry is a better solution.
For audit, the two solutions suggested in this paper are Bull Eye Algorithm and Name Node Approach. Bull Eye Algorithm is capable to audit the entire HDFS to prevent any security breach from happening. It scans the data before it is stored into blocks, and then scans the blocks to check whether the data is stored securely. The Bull Eye Algorithm enhanced the security for Data Node in HDFS while the Name Node Approach secure the HDFS continuity by ensuring the continuous service from Name Node, where there will be a second Name Node which will be replacing the current Name Node when the current Name Node is down. If there is only one Name Node, and it is down due to attacks, the whole HDFS is corrupted. Hence, the Name Node Approach is the system’s second chances. It would be good to have both Bull Eye Algorithm and Name Node Approach in HDFS, but Bull Eye Algorithm is better than Name Node Approach for auditing the whole HDFS system. This is because Bull Eye Algorithm focus on the entire data read-write operation and the authorization of user for read-write operation, while Name Node Approach only focus on Name Node itself.
The data encryption solution is not discussed in this paper as the encryption can be done by configuration in HDFS. However, it is good to noticed that the vulnerability and effectiveness of data encryption is depend on the key’s security. In most of the case, the keys are stored in local disk drives, thus it has high chances to be stolen by data hackers. This problem can be avoided by using key management service to distribute the keys and certificates. It will be more effective when combined with HDFS encryption zones, where different keys will be used for each user, application and tenant. The combination might require extra steps for setup, but it is essential .
For the future of Hadoop Security, the security tools or solutions is not necessary to be able to cover all security aspects of Hadoop. It can be focus on one aspect only, for example, authentication, and improve the security level from time to time. As mentioned in previous chapter, the vulnerability of Hadoop is because of the improper configurations of the different policies from different databases. In future development, the developer can focus on developing the proper configuration for policies from different databases.
In business context, the security tools and solutions shall also improve in the model and configuration by simplifying the process of setup and maintenance to extend the tools to more users. When the process is too complex, the organization that deployed the security tools might need to hire an expert for setup and maintenance, and this might be the drawback of the organization to use the security tool. Other than that, the availability of the documentation for the latest updates of the security tools, and the physical technology support will be critical when the organization is selecting the security tools. The developer of Hadoop security solutions and tools shall always consider how to attract more users to use their products because the product is meaning less when nobody is using it.
The traditional database is insufficient to handle big data, so Hadoop is introduced. However, the vulnerabilities of the system increase when the size of the data increase. This paper explained the type of vulnerabilities for Hadoop, which are software / technology vulnerability, web interface / configuration vulnerability and security /network policy vulnerability. The consequences for lack of security is vary with the data hold. Insufficient security to the Hadoop ecosystem not only can lead to loss of data, but also unwanted disclose of users’ privacy data. Then, HDFS as the base of Hadoop Architecture, is more sensitive to security issues, and HDFS malfunction can caused the malfunction for other Hadoop elements like MapReduce. Thus, it is essential to improve on the security of HDFS.
There are four security aspects to be consider when setting up the security solutions, which are authentication, authorization, audit and data encryption. For authentication, the security tools discussed in this paper are Kerberos Mechanism and Apache Knox. For authorization, the security tools mentioned are Apache Ranger and Apache Sentry. For audit, Bull Eye Algorithm and Name Node Approach is studied. Data encryption tools is not discussed as it can be done by configuration in HDFS, and the effectiveness is directly related to the key’s security. The tools are compared based on the four security aspects, the complication of usage and support availability. Based on the comparison, Apache Knox, Apache Sentry and Bull Eye Algorithm are better security solutions for authentication, authorization and audit respectively. The future development can be done on Hadoop security are the more focus security solution with more powerful security, and more technical support to extend the security solution to more users.
 P. Vijay and B. Keshwani, “Emergence of Big Data with Hadoop : A Review,” IOSR Journal of Engineering (IOSRJEN), vol. 06, no. 03, pp. 50–54, 2016.
 B. Saraladevi, N. Pazhaniraja, P. V. Paul, M. S. Basha and P. Dhavachelvan, “Big Data and Hadoop-A Study in Security Perspective,” Procedia Computer Science, no. 50, pp. 596–601, 2015.
 G. S. Bhathal and A. Singh, “Big Data: Hadoop framework vulnerabilities, security issues and attacks,” Array 1, p. 100002, 2019.
 Y. H, C. X, Y. M, X. L, G. J and C. C, “A survey of security and privacy in big data,” in 16th international symposium on communications and information technologies (ISCIT), Qingdao, 2016.
 P. P. Sharma and C. P. Navdeti, “Securing Big Data Hadoop: A Review of Security Issues, Threats and Solution,” (IJCSIT) International Journal of Computer Science and Information Technologies, vol. 5, no. 2, pp. 2126–2131 , 2014.
 J. Natkins, “Authorization and Authentication In Hadoop,” Cloudera Inc, 20 March 2012. [Online]. Available: https://blog.cloudera.com/authorization-and-authentication-in-hadoop/. [Accessed 1 July 2020].
 “Authentication,” Cloudera Inc., 2020. [Online]. Available: https://docs.cloudera.com/documentation/enterprise/latest/topics/sg_authentication.html#xd_583c10bfdbd326ba--5a52cca-1476e7473cd--7f90. [Accessed 1 July 2020].
 “Authorization,” Cloudera Inc, 2020. [Online]. Available: https://docs.cloudera.com/documentation/enterprise/latest/topics/sg_authorization.html. [Accessed 1 July 2020].
 “What is Data Encryption?,” Kaspersky Lab, 2020. [Online]. Available: https://www.kaspersky.com/resource-center/definitions/encryption. [Accessed 1 July 2020].
 “Configuring Encrypted Transport for HDFS,” Cloudera Inc, 2020. [Online]. Available: https://docs.cloudera.com/documentation/enterprise/latest/topics/cm_sg_hdfs_encrypt_transport.html. [Accessed 1 July 2020].
 “HDFS Transparent Encryption,” Cloudera Inc, 2020. [Online]. Available: https://docs.cloudera.com/documentation/enterprise/latest/topics/cdh_sg_hdfs_encryption.html. [Accessed 2 July 2020].
 “Apache Knox Gateway,” Cloudera Inc, 2020. [Online]. Available: https://www.cloudera.com/products/open-source/apache-hadoop/apache-knox.html. [Accessed 1 July 2020].
 “Apache Ranger,” The Apache Software Foundation, 7 August 2019. [Online]. Available: https://ranger.apache.org/#:~:text=Apache%20Ranger%E2%84%A2%20is%20a,across%20the%20Apache%20Hadoop%20ecosystem.. [Accessed 1 July 2020].
 “Apache Sentry,” Cloudera Inc, 2020. [Online]. Available: https://www.cloudera.com/products/open-source/apache-hadoop/apache-sentry.html. [Accessed 1 July 2020].
 A. Lane, “Securing Hadoop: Security Recommendations for Hadoop Environments,” Securosis, L.L.C., Arizona, 2016.
 G. Kapil, A. Agrawal, A. Attaallah, A. Algarni, R. Kumar and R. A. Khan, “Attribute based honey encryption algorithm for securing big data: Hadoop distributed file system perspective,” PeerJ Computer Science, vol. 6, p. e259, 17 February 2020.