How to Set Up a Big Data Analysis Cluster
We provide a demonstration of how to set up a cluster for Big Data analysis with tools such as Hadoop and Spark. Thereby, we like to highlight some pitfalls which occurred to us during the installation process hoping that this document helps other researches to set up or maintain their Big Data cluster.
Since Spark and Hadoop are open source frameworks, vendors have developed their own distributions with new functionalities and an improved code base. Resorting to such a vendor distribution has some advantages such that the vendors usually give technical support for the installation process and provide further tools addressing specific tasks. We decided to use Cloudera as distribution, as they are the market leader with a large community and a proper installation documentation.
Since a Big Data analysis cluster usually consists of several computing nodes, we use virtualization technologies to properly manage these nodes. The rationale for virtualization is that one has to set up only one machine which then can be easily duplicated several times to erect an entire cluster from scratch. Also, virtualization increases security. When a node is under attack, the node can just be shut down and substituted with another clone in almost real time. In addition, the attacked node can be used for forensic analysis. We decided to use Oracle’s VirtualBox for this purpose, as it is freeware and can be run on either Windows or Linux.
The first part of this document focuses on how to set up a virtual machine which is used as a computing node for the cluster. In the second part, we demonstrate how to set up the physical machines of the cluster where one or more virtual machines can run.
We use Cloudera version 5.8.0 and VirtualBox version 5.0.32. For both, the physical machines and the virtual machines, we use Ubuntu 14.04, as this was the most current version supported by Cloudera when writing this document.
While installing the cluster you might face some issues. First, make sure you have enough root disk space (“/”) on your virtual machine before installing Cloudera. Under some circumstances, Cloudera does not display an error message or obvious visual hints when hitting the disk space limit resulting in an erroneous installation. Please see  for details how to set up the disk space on the virtual machines. Second, once the baseline virtual machine is set up you may want to clone it. For that, make sure you use the functionality of VirtualBox to clone it (just copy and paste the virtual machine is not the correct way). Third, make sure you reinitialize the MAC addresses of your virtual machine during the cloning process. Otherwise the cluster’s network will not work properly while not providing a proper error messages. Fourth, all machines need Internet access for the Cloudera installation. In our case, only one machine has Internet access over one Interface and has access to the other cluster machines via another interface. You can use this machine as a proxy for the other ones by
echo 1 | sudo tee /proc/sys/net/ipv4/ip_forward
sudo iptables -t nat -A POSTROUTING -o Eth0 -j MASQUERADE,
where Eth0 is the interface that has access to the Internet.
Some nice installation guidelines for Cloudera can easily be found using Google. In particular, Cloudera provides a detailed documentation . Also, we like to highlight a youtube video from “masterschema”  who shows in almost real time how to set up a cluster on one machine with virtual instances in about 30 minutes!
First Part: Set Up a Virtual Machine for Cloudera
1.) Set up a new virtual machine with VirtualBox and install Ubuntu version 14.04. After Ubuntu is installed make sure you install VirtualBox’ Guest Additions  as it enables for example a proper display resolution (see ). Update and upgrade Ubuntu as soon as possible.
2.) For the virtual machine, at VirtualBox under Settings→Network, make sure you give the virtual machine a Bridge Adapter which refers to the network interface of the physical machine which has access to your network containing all cluster nodes. Thereby, the virtual machine acts like a physical one in the network.
3.) On the virtual machine, install an ssh server and further tools you need or prefer for your work. For example, we like to work with zshell instead of bash, the text editor joe and git:
sudo apt-get install openssh-server
sudo apt-get install git-all
sudo apt-get install zsh
sudo apt-get install joe
chsh -s /usr/bin/zsh
The last comment permanently changes the shell to zshell.
4.) For easy maintenance, it is a good idea to generate an ssh key which is then contained in the list of authorized keys. Thereby, an easy access to all machines from all machines is possible.
cp .ssh/id_rsa.pub .ssh/authorized_keys
Additionally, do not strictly check the ssh key. In the file /etc/ssh/ssh_config use StrictHostKeyChecking no.
5.) We like to have some cosmetic changes for the shell such that we add the following lines to ~/.zshrc
alias l=’ls -Glh — color’
autoload -U promptinit && promptinit
alias sudo=’sudo ‘
6.) Write the hostname of your machine to the file /etc/hostname.
7.) Change file /etc/hosts to
This step is crucial for the Cloudera installation. Of course, you can choose your own IP range.
8.) Give your machine a fix IP, subnet mask and gateway. For that, change file /etc/network/interfaces to something like
iface eth1 inet static
dns-nameservers 18.104.22.168 22.214.171.124
# interfaces(5) file used by ifup(8) and ifdown(8)
iface lo inet loopback
Substitute <IP> to your desired IP address. This step is crucial for the Cloudera installation.
9.) Your machine must not ask for a password when using sudo. This is crucial for the Cloudera installation. Hence, create a new right-management file via
sudo visudo -f /etc/sudoers.d/myOverrides
and change the line
<USER> ALL=NOPASSWD: ALL
where <USER> is the current user of the machine with sudo rights (see ).
10.) In our experience, you sometimes might like to have access to the GUI of all the cluster’s nodes. We use Remmina for that. To set up Remmina, follow the instructions of . For easy maintenance, you can also disable the lock screen of your machines under the Ubuntu setting “Brightness & Lock”.
Second Part: Set Up Physical Machines Where Virtual Machines Run
Your cluster consists of physical machines where one or more virtual machines run. The first part of this document focus how to set up the virtual machines. The next step is to clone the virtual machine and to copy it to your physical machines (make sure you reinitialize the MA addresses, see above). Here, we give some tips how to configure your physical machines.
1.) Give your machine a fixed IP by changing file /etc/network/interfaces (see Point 8 above). Thereby, maintenance of your cluster becomes much easier.
2.) Sometimes you will restart your physical machines. Then, VirtualBox should automatically start the virtual machines as soon as the physical machine is running again. Unfortunately, we were not able to start the virtual machine only by command line (VBoxManage startvm <NAME VIRTUAL MACHINE> — type headless). We assume that there might be a bug with the 3d support, since after starting the GUI of VirtualBox, it was possible to use the command line. To start the GUI of VirtualBox automatically after a restart, create a file in .config/autostart with the following content:
3.) Due to space limitations it is likely that your virtual machine should have access to some further parts of the host’s disk space. For that run
VBoxManage sharedfolder add <NAME VIRTUAL MACHINE> — name “hdd” — hostpath “<PATH TO YOUR HOST DISK FOLDER>” — automount
sudo usermod -a -G vboxsf USER
sudo usermod -a -G vboxsf USER
such that you gain access to your host’s disk space.
Once all physical machines are set up as cluster nodes (second part) and one or more virtual machines (first part) are running on each cluster’s note, Cloudera can be installed on the cluster by following for example .