Harnessing the Power of Apache NiFi and Amazon Polly for Machine Learning

Tim Spann
Cloudera
Published in
6 min readJan 26, 2024

--

NiFi Processors: GetAwsPollyJobStatus, StartAwsPollyJob

Introduction

Machine learning is revolutionizing the way we extract insights from data, but it often requires substantial data preprocessing and transformation. Cloudera DataFlow powered by Apache NiFi can streamline this process by efficiently collecting, routing, and transforming data from various sources. When combined with Amazon Polly, a cloud-based text-to-speech service, you can create sophisticated machine learning pipelines that process and analyze both textual and spoken data. In this article, we will explore how to use Apache NiFi and Amazon Polly in tandem to enhance your machine learning workflows.

What is Apache NiFi?

Apache NiFi is an easy-to-use, powerful data integration tool that enables the automation of data flows between various systems. It excels at collecting, routing, and transforming data in real-time, making it a valuable asset in the context of machine learning. NiFi offers a web-based user interface for designing data flows, and its flexibility allows it to be seamlessly integrated with other services and systems.

Amazon Polly: Transforming Text into Speech

Amazon Polly is a cloud-based service that uses advanced deep learning technologies to convert text into lifelike speech. Polly offers a wide range of voices in multiple languages and can produce speech that sounds natural and engaging. It’s a perfect choice when you need to convert textual data into spoken format for machine learning tasks such as voice recognition, sentiment analysis, or chatbots.

Building a Flow

Amazon provides a large number of powerful services for various machine learning tasks. Cloudera’s Data Flow has provided integration to a number of these so you can easily include them as part of your streaming data pipelines. I will be looking at a number of these as part of my data flows, the first is Amazon Polly. Amazon Polly is a service that turns text into lifelike speech. CDF is integrated with AWS Polly via two processors: StartAwsPollyJob and GetAwsPollyStatus. We will use both to run our job and ensure it is completed.

The most important thing to prepare is a JSON Payload that will be used to run and configure our Polly job. We need to pick our language, output format, S3 output bucket, the text we are passing in, text type and one of the voices.

StartAwsPollyJob — JSON Payload

{
"Engine": "standard",
"LanguageCode": "en-US",
"OutputFormat": "mp3",
"OutputS3BucketName": "tspann",
"OutputS3KeyPrefix": "transit",
"SampleRate": "8000",
"Text": "${speaktext:trim()}",
"TextType": "text",
"VoiceId": "Joanna"
}

There are a lot of optional items you can have in the JSON Payload. You need to set these values in a valid JSON Payload. All of these values can be set dynamically via NiFi attributes or expression language. The most important attributes are what output format you would like, I chose mp3, text which is the text to speak in your chosen voice and the S3 bucketname where the mp3 will be deposited.

Once we have started the job, we’ll get the status right away. Most likely it won’t be ready yet, so we’ll route to ControlRate to pause 15 seconds before we try again. We will continue this loop until the status changes from running to success. We may receive a status of throttled which means we could pause longer and retry, but I am going to send that to failure.

GetAwsPollyJobStatus Results in Attributes

GetAwsPollyJobStatus Results JSON

{
"sdkResponseMetadata" : {
"requestId" : "a1a93990-ebd7–482c-a20a-f14ad817ad28"
},
"sdkHttpMetadata" : {
"httpHeaders" : {
"Content-Length" : "505",
"Content-Type" : "application/json",
"Date" : "Tue, 05 Sep 2023 23:46:21 GMT",
"x-amzn-RequestId" : "a1a93990-ebd7–482c-a20a-f14ad817ad28"
},
"httpStatusCode" : 200,
"allHttpHeaders" : {
"x-amzn-RequestId" : [ "a1a93990-ebd7–482c-a20a-f14ad817ad28" ],
"Content-Length" : [ "505" ],
"Date" : [ "Tue, 05 Sep 2023 23:46:21 GMT" ],
"Content-Type" : [ "application/json" ]
}
},
"synthesisTask" : {
"engine" : "standard",
"taskId" : "cff12658-d33f-421a-a521–12ee38f17a42",
"taskStatus" : "scheduled",
"taskStatusReason" : null,
"outputUri" : "https://s3.us-east-1.amazonaws.com/tspann/transit.cff12658-d33f-421a-a521-12ee38f17a42.mp3",
"creationTime" : 1693957582665,
"requestCharacters" : 1,
"snsTopicArn" : null,
"lexiconNames" : null,
"outputFormat" : "mp3",
"sampleRate" : "8000",
"speechMarkTypes" : null,
"textType" : "text",
"voiceId" : "Joanna",
"languageCode" : "en-US"
}
}

One source of data to feed Amazon Polly could be Apache Kafka, here is an example below where we write records as JSON.

Once we have our data from Kafka, we standardize the name of the string to build our payload.

SplitRecord

If we have received many records, we will need to split that to 1 record at a time which makes sense for conversion jobs.

BuildText (EvaluateJsonPath)

StartAwsPollyJob (start our task with our JSON Payload)

GetAwsPollyJobStatus (Get Status of our task from Amazon)

Once we get our results back we will recreate the results as a new JSON record.

AttributesToJSON (Build a New Record)

We then check out AWS S3 Bucket for the mp3 results.

We don’t need too many parameters, the most important being the connection to AWS.

Conclusion

By combining Apache NiFi’s data integration capabilities with Amazon Polly’s text-to-speech service, you can create powerful machine learning pipelines that can handle both textual and spoken data efficiently. This integration opens up a wide range of possibilities for applications such as chatbots, voice assistants, and sentiment analysis. With the flexibility and scalability of these services, you can build robust machine learning solutions that can adapt to evolving data needs.

References

--

--

Tim Spann
Cloudera

Principal Developer Advocate, Cloudera. Principal Engineer - Big Data, IoT, Deep Learning, Streaming, Machine Learning, Cloud. https://www.datainmotion.dev/