Building a Powerful Document Search Engine: Leveraging HDFS, Apache Tika, SFTP, NiFi, Mongo, Elasticsearch, Logstash, Fast API, React js

Stefentaime
6 min readDec 7, 2023

Introduction

Our project tackles this challenge by integrating a series of cutting-edge technologies: HDFS, Apache Tika, SFTP, NiFi, MongoDB, Elasticsearch, Logstash, FastAPI, and React.js. This article details how these components work together to create a powerful and user-friendly document search system.

Beginning the Pipeline: SFTP Server and NiFi ETL Process

The pipeline begins with an SFTP server, which acts as the entry point for the Extract, Transform, Load (ETL) process managed by NiFi. Users can upload files to an “uploads” directory, either locally or directly on the SFTP server. NiFi then continuously monitors this directory to detect newly added files.

File Processing with HDFS and Apache Tika

After files are uploaded, NiFi transfers them to an HDFS cluster composed of 3 data nodes, designed for efficient storage of original files. NiFi then determines each file’s…

--

--

Stefentaime

Data engineer sharing insights and best practices on data pipelines, ETL, and data modeling. Connect and learn with me on Medium!