Building a Powerful Document Search Engine: Leveraging HDFS, Apache Tika, SFTP, NiFi, Mongo, Elasticsearch, Logstash, Fast API, React js
Introduction
Our project tackles this challenge by integrating a series of cutting-edge technologies: HDFS, Apache Tika, SFTP, NiFi, MongoDB, Elasticsearch, Logstash, FastAPI, and React.js. This article details how these components work together to create a powerful and user-friendly document search system.
Beginning the Pipeline: SFTP Server and NiFi ETL Process
The pipeline begins with an SFTP server, which acts as the entry point for the Extract, Transform, Load (ETL) process managed by NiFi. Users can upload files to an “uploads” directory, either locally or directly on the SFTP server. NiFi then continuously monitors this directory to detect newly added files.
File Processing with HDFS and Apache Tika
After files are uploaded, NiFi transfers them to an HDFS cluster composed of 3 data nodes, designed for efficient storage of original files. NiFi then determines each file’s…