Project Garble: Long Audio to Short Text Summarizer

Joseph Kim
Institute for Applied Computational Science
9 min readDec 14, 2021

This article was produced as part of the final project for Harvard’s AC215 Fall 2021 course.

Authors: Joseph Kim, Malla Reddy Adaboina, Bharat Ramanathan

Introduction

We’re going to build an app that takes a long audio file and generates a paragraph long summary — with all the main ideas included!

💡 Note: It’s named Project Garble because we’d like to take all the audio or garble online and put them into a useful format for you 😄

Background

  • There is a ton of content online — podcasts, lectures, keynotes.
  • While it would be great to be able to go through them all, there are too many, most of which are too long, for anyone to listen to all of them.
  • What one needs is a condensed version of all the content online. The condensed version would ideally capture all the main points of the audio. This would allows the user to quickly read through more audio content and filter out the ones that the user would like to listen to in full.

Goal

Create an app that allows user to

  • Input/Upload an audio file
  • Get/Download a short summary text of audio file

Architecture

At a high level, we have 3 main parts of the tech stack that we need to build out and connect.

  • Frontend: We keep this simple and use React with basic functionality. We use off-the-shelf components from the Material UI library.
  • Backend: We also keep this simple and use python’s FastAPI to serve up the two endpoints (/transcribe and /summarize) needed for the app to work.
  • Model: This is where the bulk of the work is and what we’ll spend the most time diving into in this article. We source a massive dataset of >1.6M documents, preprocess them, and feed them into a Longformer Encoder Decoder (LED) model.

We containerize the different apps using Docker and deploy them using Ansible scripts and Kubernetes. The containers are

  • Frontend app container (React)
  • API service container (FastAPI)
  • Nginx container (web server/proxy)

We use Google Cloud Platform (GCP) to host our app and run Speech-to-Text transcription. Our fine-tuned summarization model is also hosted on GCP and accessed by our backend API service to generate the transcript summaries.

Tech Stack and Tools

If you’d like to follow along and try this out for yourself, here’s a list of tools that we used to get this entire app up and running.

Data

Data Collection

Our text summarisation corpus consists of text documents and their corresponding summaries. The text documents serve as the input training data and the corresponding summaries serve as the label or output that the model should learn to generate.

A document is loosely defined as a news article, a meeting transcript, an interview dialogue or a talk transcript. A summary is loosely defined as shorter text that captures the meaning and essence of the document.

We combine a variety of existing datasets across domains ranging from dialogue summaries, meeting minutes, news summaries, to podcast and talk descriptions.

Preprocessing

We preprocess the text data using spacy and tokenize both the documents and their corresponding summaries in to sentences This results in the following format.

{
"document": [
"Sentence 1 in document",
"Sentence 2 in document",
...],
"summary": [
"Sentence 1 in summary",
"Sentence 2 in summary",
...]
}

We do not do any further preprocessing such as stop word & punctuation removal or case-folding etc. This is because we intend to further fine-tune the model to produce human-readable summaries that are as close to the actual summaries as possible.

Exploratory Data Analysis

We perform EDA relevant to the the task of text summarization and compute various descriptors to measure the quality of the data. The list of descriptors we compute are as follows:

We visualise the distribution of these descriptors across all datasets. Here’s an example of the visualization for the Spotify podcast dataset.

Additional details related to the above measurements and their corresponding distributions across datasets is present in the EDA Notebook.

Clean Up Dataset

Based on the EDA above we clean up and filter the datasets with the following criteria to remove low quality documents and summaries from the datasets.

  • Filter by Density: Retain only documents and summaries with Sentence Density and Word Density are greater than 10%.
  • Filter by Lower: Retain only document with greater that 100 words and summaries with greater than 50 words.
  • Filter by Upper: Truncate documents greater that 16,000 words and summaries with greater than 500 words.

After clean up, We tokenize and store the dataset using Apache Arrow format. We combine the various datasets and get the following training, validation and test datasets.The total size of the dataset is 26GB.

Model

We fine-tune the pre-trained Allenai’s Longformer Encoder-Decoder (LED) model from huggingface transformers library for the summarization task.

Aimed at long-range language modeling (up to 16K tokens), the model performs well on Long Range Arena, a unified benchmark that is specifically focused on evaluating model quality under long context scenarios.

The LED model is able to achieve this by:

  • reducing the operational complexity of transformers to O(n) from O(n^2)
  • introduce sparse attention mechanisms: sliding window, dilated sliding windows, global attention

“A Longformer variant that has both the encoder and decoder Transformer stacks but instead of the full self-attention in the encoder, it uses the efficient local+global attention pattern of the Longformer. The decoder uses the full self-attention to the entire encoded tokens and to previously decoded locations.”

source: Longformer: The Long-Document Transformer, Iz Beltagy and Matthew E. Peters and Arman Cohan

Moreover, the LED model was specifically developed for summarization and question answering tasks, which make it an appropriate choice for this application.

Training

We train with a max_input_length of 4096 tokens for the document (97% of the documents had less than than 4096 tokens) and a max_output_length of 512 tokens for the summaries generated.

Other notable training parameters include:

per_device_train_batch_size=12 #batch size
gradient_accumulation_steps=4 # gradient accumulation
learning_rate=5e-05
weight_decay=0.0
adam_beta1=0.9
adam_beta2=0.999
adam_epsilon=1e-08
max_grad_norm=1.0
num_train_epochs=10
fp16=True
group_by_length=True # group sequences by lenght for dynamic padding

The model is trained using categorical cross-entropy loss measured over the vocabulary tokens.

Results

ROUGE stands for Recall-Oriented Understudy for Gisting Evaluation.

It is a set of metrics for evaluating automatic summarization tasks. It works by comparing an automatically produced summary against a set of reference summaries (typically human produced).

We measure rouge scores and report the f1 values on the train, validation and test sets. Rouge1 represents the percentage of unigrams that overlap between the generated and referenced summaries. Rouge2 represent the percentage of bigrams that overlap between the generated and referenced summaries.

Frontend

We use React to create a simple page on which you can upload an audio file and get a NLP-generated summary.

The user can upload an audio file by clicking on the grey box on the left. The app will automatically call our /transcribeendpoint to transcribe the audio file, as shown below.

Once the app is finished transcribing the audio, it will also call the /summarize endpoint to generate a summary of the audio transcript.

The summary is about a paragraph long in length.

Backend

We use python’s FastAPI to build out our backend.

Our Transcribe service uses Google Cloud’s Speech-to-Text service. The audio file is first uploaded to the platform, then the speech recognition service is called.

#! transcription.pyfrom google.cloud import speechdef transcribe(audio_path):
# Audio upload code here
# ...
# Transcribe
audio = speech.RecognitionAudio(
uri=f"gs://{bucket_name}/{audio_files}/audio.flac"
)
config = speech.RecognitionConfig(language_code="en-US", enable_automatic_punctuation=True)
operation = client.long_running_recognize(config=config, audio=audio)
response = operation.result(timeout=180)
# return transcription results--------------------------------------------------------------------#! service.py@app.post("/transcribe")
async def transcribe(file: bytes = File(...)):
with TemporaryDirectory() as audio_dir:
audio_path = os.path.join(audio_dir, "audio.mp3")
with open(audio_path, "wb") as output:
output.write(file)
transcription_results = transcription.transcribe_audio_file(audio_path)
return transcription_results

Our summarization endpoint feeds the transcript to our fine-tuned LED model to generate a summary.

#! summarization.py class SummarizationPipeline:
summarization_pipeline = pipeline(
"summarization",
model=local_models_path,
tokenizer=LEDTokenizerFixed.from_pretrained(local_models_path),
)
def make_prediction(self, text: str) -> Dict[str, str]:
"""
Makes a prediction using the model
"""
summary = self.summarization_pipeline(text)[0]["summary_text"]
summary = summary.split("---")[0]
return {"summary": summary}--------------------------------------------------------------------#! service.py@app.post("/summarize")
async def summarize(request: SummarizationRequest):
pipeline = model.SummarizationPipeline()
data = request.dict()
transcript = data["transcript"]
summary = pipeline.make_prediction(transcript)
return summary

Deployment

Our frontend and backend applications are containerized using Docker and the images are pushed to Google Container Registry for deployment.

In order to host the app on the Google Cloud Platform (GCP), we enable the following API services in GCP:

  • Compute Engine API
  • Service Usage API
  • Cloud Resource Manager API
  • Google Container Registry API
  • Kubernetes Engine API
  • Cloud Speech-to-Text API

App Deployment

We use Google Kubernetes Engine to create a K8s cluster with two nodes and machine type of “n1-standard-8” of 8 CPU on each node. We deploy two containers on each node.

  1. React frontend container
  2. API service container

Web Server Deployment

The nginx container is deployed as a web server for routing requests between the frontend and backend api service. Because the audio files can oftentimes be large, we increased the acceptable request payload size in nginx configs.

Additionally, larger audio files mean longer processing time. In order to prevent the connections from getting dropped, we increased the connection timeout time.

Model Deployment

For our LED model, we manually upload the saved model weights to a GCP bucket. Our backend API downloads the model from the bucket to a /persistent folder in the container. The downloaded model is then accessed locally within the API service to generate the summaries.

def download_model_dir(
bucket_name=bucket_name, source_dir=source_dir, destination_dir=local_models_path
):
"""Downloads a blob from the bucket."""
storage_client = storage.Client(project=gcp_project
bucket = storage_client.bucket(bucket_name)
blobs = bucket.list_blobs(prefix=source_dir) # Get list of files
if not os.path.exists(destination_dir):
os.makedirs(destination_dir)
for blob in blobs:
blob_name = blob.name
if blob.name.endswith("/"):
continue
dst_file_name = os.path.join(destination_dir, blob_name.split("/")[-1])
logging.info("destination file: {}".format(dst_file_name))
if not os.path.isfile(dst_file_name):
blob.download_to_filename(dst_file_name)
download_model_dir()

Test Deployment

Once all setup, try the commands below to shell into the pod to make sure the setup is working and the environment is up and running!

kubectl get pods --namespace=garble-app-cluster-namespace
kubectl get pod api-5d4878c545-47754 --namespace=garble-app-cluster-namespace
kubectl exec --stdin --tty api-5d4878c545-47754 --namespace=garble-app-cluster-namespace -- /bin/bash

To view the application

Future Work/Improvements

While providing basic functionality, the app could continue to be improved in three key ways.

Improve the UI/UX:

  • Because the focus of this project was more on deploying an AI application, we spent minimal time on optimizing the user flow and experience. The barebones user interface could be made to be more aesthetically pleasing and user friendly.

Improve the summarization:

  • A cuda optimized version of the longformer model can be used to reduce the training and inference time.
  • Use a hierarchical transformer encoder to leverage natural hierarchical structures in speech.

Improve the architecture:

  • Change long polling http calls to event driven mechanism using message-queues and kafka
  • Use cloud functions to perform on-demand inference

References

  1. Edmundson, H.: New methods in automatic extracting. J. Assoc. Comput. Mach. 16(2), 264–285 (1969)
  2. Lloret, E., Palomar, M.: Text summarisation in progress: a literature review. Artif. Intell. Rev. 37(1), 1–41 (2011)
  3. Mani, I., Maybury, M.T.: Advances in Automatic Text Summarization. MIT, Cambridge (1999)
  4. Turner, J., Charniak, E.: Supervised and Unsupervised Learning for Sentence Compression. In: ACL, Michigan, Ann Arbor, USA. ACL, Stroudsburg, USA (2005)
  5. G PadmaPriya and K Duraiswamy. An approach for text summarization using deep learning algorithms. J Comput Sci, 10:1–9, 2014.

--

--