Mastering Data Ingestion: Apache NiFi Setup on Ubuntu 22.04 EC2

Buse Kaylan
8 min readMay 3, 2024

--

Welcome back to our ongoing exploration of AWS and cloud technologies! As we continue our journey through the realms of data engineering and data science, we delve deeper into the essential tools and platforms that power modern data workflows. In this installment, we shift our focus to Apache NiFi, a robust data integration and processing tool renowned for its flexibility and scalability. Building upon our previous discussions on AWS EC2 instances and cloud infrastructure, today we’ll guide you through the process of deploying Apache NiFi on Ubuntu 22.04 instances hosted on AWS EC2. Whether you’re a seasoned data engineer or just dipping your toes into the world of cloud-based data solutions, join us as we unlock the potential of Apache NiFi in the cloud and pave the way for streamlined data management and analysis.

Let’s start!

To begin, let’s ensure our AWS EC2 security group settings are configured to allow traffic on the necessary ports for Apache NiFi. Navigate to the AWS Management Console and select EC2. From there, go to the Instances section and choose the relevant instance to view its summary. Scroll down to find the Security section and click on it to access the security group settings. Here, you’ll want to edit the inbound rules to include Apache NiFi’s port. Add a new rule for port 443, which is commonly used for HTTPS communication, and set the source to 0.0.0.0/0 to allow traffic from any IP address. This step ensures that incoming traffic on port 443, required for accessing the Apache NiFi web interface securely, is permitted. Additionally, consider restricting access to only trusted IP addresses or ranges for enhanced security. Once the inbound rules are updated, you’re ready to proceed with deploying and configuring Apache NiFi on your Ubuntu 22.04 EC2 instance.

After connecting to the EC2 instance, the first step is to ensure that the system is up to date and has the necessary dependencies installed. Run the following commands in the terminal:

sudo apt update
sudo apt install openjdk-11-jdk
sudo apt install unzip

The sudo apt update command refreshes the package index and ensures that your system has the latest information about available packages. Following that, sudo apt install openjdk-11-jdk installs the OpenJDK 11 development kit, which is required for running Apache NiFi. OpenJDK 11 is a widely used Java Development Kit and provides the necessary runtime environment for Apache NiFi to operate efficiently. Finally, sudo apt install unzip installs the unzip utility, which is essential for extracting Apache NiFi installation files from compressed archives. Once these commands are executed successfully, our EC2 instance will be equipped with the essential components to proceed with installing and configuring Apache NiFi.

To proceed with installing Apache NiFi, we switch to the superuser (sudo su) to ensure that we have the necessary permissions to perform system-wide operations. Then, we change the directory to /opt, a commonly used location for installing third-party software on Unix-like systems. Next, we use wget to download the Apache NiFi binary distribution package (nifi-1.25.0-bin.zip) from the official Apache NiFi website.

sudo su
cd /opt
wget https://dlcdn.apache.org/nifi/1.25.0/nifi-1.25.0-bin.zip

After downloading, we use the unzip command to extract the contents of the ZIP file. Once the extraction is complete, we rename the extracted directory from nifi-1.25.0 to nifi for simplicity. Finally, we change into the conf directory within the Apache NiFi installation directory (/opt/nifi/conf) to access the configuration files for further setup and customization.

unzip nifi-1.25.0-bin.zip
mv nifi-1.25.0 nifi
cd /nifi/conf

To configure Apache NiFi, begin by opening the nifi.properties file using the nano text editor:

nano nifi.properties

Within this file, set the nifi.web.https.host property to local.nifi.com, defining the hostname under which NiFi will be accessed. Additionally, assign port 443 to nifi.web.https.port, ensuring secure HTTPS communication. Next, specify the proxy host by setting nifi.web.proxy.host to your AWS EC2 instance's public IP address.

To facilitate access, add an entry mapping your AWS EC2 instance’s IP address to the hostname local.nifi.com in your local computer's hosts file. This file can typically be found at C:\Windows\System32\drivers\etc\hosts on Windows systems. By adding this entry, your computer will resolve local.nifi.com to the correct IP address, allowing seamless access to Apache NiFi's web interface.

In our quest to set up Apache NiFi securely and efficiently, we embark on a journey of user management and permissions configuration. To lay the groundwork, we begin by creating a dedicated user account for Apache NiFi. With the useradd -m nifi command, we establish the nifi user, ensuring a dedicated home directory for storing configuration files and data. Following this, we meticulously set permissions, ensuring that Apache NiFi has the necessary access to its files and directories. By executing chown -R nifi:nifi /opt/nifi, we grant ownership of all relevant files and directories to the nifi user and group. This step is crucial for maintaining security and control over Apache NiFi's resources.

useradd -m nifi
chown -R nifi:nifi /opt/nifi

To streamline the access to Apache NiFi and ensure seamless communication between our system and the Apache NiFi instance, we make modifications to the /etc/hosts file. This file acts as a local DNS resolver, mapping hostnames to IP addresses. By opening the /etc/hosts file using the nano text editor, we can add an entry that associates the hostname local.nifi.com with the loopback IP address 0.0.0.0. This mapping allows us to access Apache NiFi using the specified hostname, bypassing the need for external DNS resolution. It's essential to note that modifying the /etc/hosts file on your system affects only local hostname resolution and does not impact DNS resolution for other systems. Once this entry is added, accessing Apache NiFi via local.nifi.com in a web browser or any other application will direct traffic to the loopback interface, facilitating seamless communication with the Apache NiFi instance running on the local system.

nano /etc/hosts

Finally, we seamlessly transition into the nifi user environment using su nifi, empowering us to execute commands and manage Apache NiFi processes with the appropriate permissions. With these preparatory steps complete, we're poised to configure and unleash the full potential of Apache NiFi while upholding best practices in user management and security.

su nifi

To ensure that Apache NiFi is configured to use the correct Java environment, we begin by editing the nifi-env.sh file. This file contains environment variables that are used to configure Apache NiFi's runtime environment. By opening the nifi-env.sh file using the nano text editor, we can modify these variables as needed.

cd opt/nifi/conf
nano nifi-env.sh

In this case, we set the JAVA_HOME variable to point to the installation directory of the Java Development Kit (JDK) version 11. The export JAVA_HOME='/usr/lib/jvm/java-11-openjdk-amd64/' command specifies the path to the JDK installation directory. This ensures that Apache NiFi uses the appropriate Java version for its operations. It's crucial to verify that the specified path matches the actual location of the JDK on your system. Once this configuration is in place, Apache NiFi will be able to utilize the Java environment effectively, enabling smooth and reliable operation of the data integration platform.

In our endeavor to ensure smooth operation and secure access to Apache NiFi, we undertake additional configurations and permissions adjustments. Firstly, we grant the Java executable the necessary capability to bind to privileged ports below 1024. This capability is crucial for Apache NiFi to listen on port 443 for HTTPS traffic. By executing the command setcap 'CAP_NET_BIND_SERVICE=+ep' /usr/lib/jvm/java-11-openjdk-amd64/bin/java, we empower the Java executable with the capability to bind to network services, enabling it to listen on port 443 securely.

setcap 'CAP_NET_BIND_SERVICE=+ep' /usr/lib/jvm/java-11-openjdk-amd64/bin/java

To verify that the capability has been successfully applied, we utilize the getcap command to inspect the capabilities of the Java executable. Executing getcap /usr/lib/jvm/java-11-openjdk-amd64/bin/java allows us to confirm that the necessary capability has been set, ensuring that Apache NiFi can bind to port 443 without any issues.

getcap /usr/lib/jvm/java-11-openjdk-amd64/bin/java

Finally, we configure the firewall to allow incoming connections on port 443, which is essential for HTTPS traffic. By executing ufw allow 443, we add a rule to the Uncomplicated Firewall (UFW) to permit incoming connections on port 443. This step ensures that external clients can securely communicate with Apache NiFi via HTTPS, enhancing accessibility and security of the data integration platform.

ufw allow 443

To enhance security and establish administrative control over Apache NiFi, we execute the bash nifi.sh set-single-user-credentials admin nifi@admin1234567 command. This command configures NiFi to operate in single-user mode, setting the username to admin and the password to nifi@admin1234567. By specifying these credentials, we create an initial administrative account with elevated privileges, enabling us to manage and configure Apache NiFi effectively. It's important to note that the chosen username and password should adhere to security best practices, including complexity requirements and regular rotation intervals. With these credentials in place, we ensure that only authorized users can access and administer Apache NiFi, safeguarding sensitive data and maintaining control over the data integration platform.

 bash nifi.sh set-single-user-credentials admin admin@123456789

With Apache NiFi configured and credentials set, we’re now ready to start the data integration platform!

By navigating to the /opt/nifi/bin directory using the cd /opt/nifi/bin command, we position ourselves within the directory containing Apache NiFi's startup script. Once there, we initiate the start process by executing bash nifi.sh start. This command triggers the startup script (nifi.sh), which initializes the Apache NiFi service, initiating the processing of data flows and enabling data ingestion, transformation, and routing. As the service starts up, Apache NiFi undergoes internal initialization routines, including loading configurations and establishing connections to external systems.

cd /opt/nifi/bin
bash nifi.sh start

To access the Apache NiFi web interface, navigate to https://local.nifi.com/nifi in our web browser. Here, we'll be prompted to enter the username and password you previously configured.

If you encounter difficulties accessing the page, it's essential to verify your firewall and browser settings. Ensure that your firewall allows inbound connections on port 443, the default port for HTTPS traffic, to permit access to the Apache NiFi web interface. Additionally, check your browser settings to confirm that it's not blocking access to the page. Adjusting these settings as necessary will enable you to reach the Apache NiFi web interface, where you can log in and begin managing your data flows efficiently.

As we conclude this guide to setting up Apache NiFi on AWS EC2, we’ve embarked on a journey to harness the power of cloud technologies for efficient data engineering and data science workflows. With Apache NiFi configured and operational, you’re now equipped to streamline data processing and orchestrate complex data flows with ease. As I continue to explore the intricacies of cloud technologies and their practical applications, stay tuned for future blog posts delving into the usage and contextual understanding of Apache NiFi, along with other tools and platforms shaping the landscape of modern data management. Your feedback and engagement are invaluable as I navigate this exciting terrain with you, seeking to unlock the full potential of cloud-based data solutions.

Thank you for joining me on this journey, and I look forward to sharing more insights and discoveries in the realm of data engineering and beyond!

--

--