A Tool for Every Data Engineer’s Toolbox
Collecting data from edge devices in manufacturing, processing medical records from electronic health systems, and analyzing text all sound like very different problems each requiring unique solutions. While that certainly is true there are some commonalities between each of these tasks. Each task requires a scalable method of data ingestion, predictable performance, and capabilities for management and monitoring. Also typically required in projects like the ones described are the abilities to track data lineage as it moves through the pipeline and the ability to replay data. Now we can start to abstract out the commonalities of the projects and observe that the projects are actually not all that different. In each case, the data is being consumed and ingested to be analyzed or processed.
A tool that satisfies those common requirements would be invaluable to a data engineer. One such tool is Apache NiFi, an application that allows data engineers to create directed graphs of data flows using an intuitive web interface. Through NiFi’s construct called a processor, data can be ingested, manipulated, and persisted. Data and software engineers no longer have to write custom code to implement data pipelines. With Apache NiFi, creating a pipeline is as simple as dragging and dropping processors onto its canvas and applying appropriate configuration.
To help illustrate the capabilities of Apache NiFi, a recent project required translating documents, existing in an Apache Kafka topic, of varying languages into a single language. The pipeline required consuming the documents from the topic, determining the language of each document, and selecting the appropriate translation service. Apache NiFi’s ConsumeKafka processor handled the ingestion of documents, an InvokeHttpProcessor powered the webservice request to determine the document’s source language, and a RouteOnAttribute processor directed the flow based on the document’s language to the appropriate InvokeHttpProcessor that sent the text to a translation service. The resulting translated documents were then persisted to S3.
A few years back, making a pipeline to do this would have likely required writing custom code, whether it was consuming from a queue, communicating with the language translation services, or persisting the results to a remote store. Not writing custom code also usually translates to saving time and money. Apache NiFi is one tool that should definitely exist in each data engineer’s toolbox. Like with any tool, it is important to understand NiFi’s capabilities and limitations. The Apache NiFi User Guide is a great place to start.