Data Engineering: text-to-speech data collection with Kafka, Airflow, and Spark

6 min readMar 12, 2023

i. Introduction

To build a good model, availability of data of good quality is critical for machine learning engineers. For some languages, availability of data pipelines that produce desired output are scare. For that reason, in this paper an overview of an approach to use as a data engineer in order to design and build a robust, large scale, fault tolerant, highly available Kafka cluster tool that can be used to post a sentence and receive an audio file using text corpuses for Amharic and Swahili languages is described. The reasons for choosing that approach are provided, and the purpose of each tool to be used.

ii. Project Objective

The main aim of this project is to build a large-scale, stream processing based distributed system that enables data collection for a speech to text model. If new samples are available, the tool would be able to allow the continuous retraining of a speech to text model. The general approach we have chosen to proceed with is described in the next section.

iii. Approach and Setup

The main components of the project are on the diagram and tools we are suing in these project are Apache Kafka, Apache Airflow, and Apache Spark details bellow:

Frontend : we are creating a front end with Reactjs to interact with users to upload an audio file or validate audio file submissions. The front-end interacts with a Flask back-end API. It is from this API that the front-end get’s new sentences and also sends the text-audio pairs. The front-end interacts with a Flask back-end API. It is from this API that the front-end get’s new sentences and also sends the text-audio pairs.
Backend. It is implemented using flask. The back-end then sends the message to the front-end. The API also acts as a publisher client because it sends the text-audio pairs it gets from the front-end to another Kafka topic..
Airflow : its role is to orchestrate messages of Kafka while also starting the transformation and loading of data onto Spark.
Spark : Spark will be used to convert and load the data as we standardize it to ensure consistency and dependability.
Kafka cluster : is the brain of the entire structure because it will be facilitating the production of messages on a topic to publishing it to the consumers.

iv. Tools Used

The following tools are being used for the project:

Apache Kafka

A popular messaging platform based on the pub-sub mechanism. It is a distributed, fault-tolerant system with high availability. It is popular among big companies. Its strongest features are that it is independent of programming languages, its service-oriented architecture and its extended usability. Kafka is used for publishing and subscribing to streams of audio and text records/messages for our case. It serves as a data bus, which all other components use to communicate.

Apache Airflow

A workflow manager to schedule, orchestrate and monitor workflows. According to its definition, it is “A platform by Airbnb to programmatically author, schedule, and monitor data pipelines.” It streamlines increasingly complex enterprise procedures and is based on the idea of “Configuration as a code.” Directed acyclic graphs (DAG) are used by Airflow to control workflow orchestration. It enables data engineers to put together and control workflows made up of many data sources.

Apache Spark

An analytics engine for unstructured and semi-structured data that has a wide range of use cases. It supports a wide range of applications, including stream processing and machine learning The goal of Spark was to create a new framework, optimized for fast iterative processing like machine learning, and interactive data analysis, while retaining the scalability, and fault tolerance of Hadoop MapReduce. Hence the reason we chose it to be used for our project. It is used for processing, querying and analyzing Big data. Since the computation is done in memory hence it’s multiple fold fasters than the competitors like MapReduce and others.

v. Data

We are using the Amharic news text classification dataset in this work. The dataset is primarily used as a source for the text part of the final dataset. These will be displayed to the users on the front-end and allow them to record the text’s audio equivalent and send it back to us.The dataset contains various news data collected from various sources, with the goal of classifying Amharic news using machine learning classifiers. The data was obtained from a previous work of a researcher, Israel Abebe and consists of six columns: headline, category, data, views, article, and link.

vi. Data exploratory analysis and insights drawn from the data.

The data ontains articles that have a large number of words. A simple preprocessing to replace letters that sounds same with one common letter was done. Each article and its category were paired and formed a new preprocessed dataset

The distribution of the length of article was reviewed and visualized

The distribution of news Category in the dataset is also visualized and national wide is the dominant news category.

vii. . Implementation of the Project

The bellow steps explains how we implemented the project.

1. Record or upload audio

Developed the front-end part with react where users click on button to get text. Users will record themselves reading the generated text out loud by using the recording interface provided. Finally, users will upload their voice that they have recorded by clicking on the upload button.

2. Stream the data

We useD the Kafka cluster, to stream the generated data into the frontend for the user to see. We will send and publish the random text data. To avoid overlaps with the data, flask API configurations were used. We created topic to consume as you see on the below picture.

3. Scheduling jobs using airflow in Kafka clusters

We constructed an end-to-end data pipeline of Kafka cluster and orchestrated tasks with the use of airflow by following these steps: triggering a pipeline, read raw data, produce message, consume message, transform message, add metadata, and finally load message to the data warehouse.

4. Spark Data-frame and streaming (still in progress)

This component receives a list of audio text pairs from s3 bucket and loads them to a spark data frame where they will be distributed among various spark clusters. We have to remove long silences in the audio clips and then remove any noise. Finally, the list of cleaned audio text is going to stored in amazon s3 bucket of cleaned data.

viii. Lessons Learned

During this project, I personally learned how to use Kafka, Airflow, and Spark together to build a strong and robust data engineering pipeline result. Moreover, I have learned how to interact and communicate teammates using git. 6 ix. Future Works Due to time constraints, we could not finish everything we wanted to do on this project. So in the future, we want to finish the spark part to be able to provided good quality videos for the NLP model building and testing.

Code implementation here on github