Automatic detection and analysis of relationships between prosody and co-speech gesture

Ahmed Ayman
14 min readMay 6, 2023

This blog is maintained by Ahmed Ayman about the updates on the progress of the GSoC 2023 project with Red Hen Lab.

Hello Everyone!

My name is Ahmed Ayman Abdel-Hakeem and I am Graduated from Shoubra Faculty of Engineering, Computer Engineering Dep. I have a keen interest and decent experience in Python, Machine learning, Artificial Intelligence, and Natural Language Processing. This summer, as part of Google Summer of Code 2023, I am grateful for the opportunity to make a valuable contribution to Red Hen Lab.

As a student developer, I will be working for Red Hen Labs and this blog is to document my weekly progress on the project that I have proposed to work on.

Connect with me :

Project Details

The Project that I am working on is ‘Automatic detection and analysis of relationships between prosody and co-speech gesture.

For the GitHub repo of the work on this project, check here: https://github.com/ahmedayman9/GSOC23_Redhenlab

Check the proposal @ GSoC website: Click Here

Mentors

  • Anna Wilson, Cristóbal Pagán Cánovas, Peter Uhrig, Raúl Sánchez.

I would like to express my gratitude to Ilya Burenko for his invaluable assistance.

Abstract

The study of discourse dynamics provides insights into social, political, and cultural processes by analyzing language use. Manual annotation of discourse data is time-consuming, but automated discourse analysis using machine learning and natural language processing techniques can provide an efficient and reliable means of analyzing large datasets. However, there is a need for a comprehensive and user-friendly tool that can accurately annotate and label various discourse features. The proposed project aims to develop an automated annotation tool that can accurately identify and label various features of discourse data, such as speech acts, sentiment, and rhetorical devices. This tool will reduce the time and effort required for manual annotation, provide a more objective and consistent means of annotating, and enable the analysis of larger datasets. Challenges include the accuracy, generalizability, and usability of the tool, which will be addressed through the development of robust and adaptable machine-learning models and a user-friendly interface with necessary documentation and support.

Project Goals

The goal of the proposed project is to develop an automated annotation tool for analyzing discourse dynamics, which incorporates machine learning and natural language processing techniques to identify and label various features of discourse data, such as speech acts, sentiment, and rhetorical devices. The tool aims to provide an efficient and reliable means of analyzing large datasets of discourse data and will be available on an open-source platform as a functional annotation tool, along with user documentation and test data.

Community Bonding Period (May 4–28)

During the Community Bonding, my engagement with the Red Hen Lab project began on May 5th when I received confirmation of my proposal acceptance. Subsequently, I promptly initiated discussions with my mentors to delve deeper into the project’s specifics. To overcome the challenge of time zone differences, we established a regular meeting schedule that accommodated everyone’s availability.

In addition to my involvement with Red Hen Lab, I also obtained access to the High-Performance Computing (HPC) clusters at Case Western Reserve University (CWRU) by receiving my CWRU ID. I successfully established a connection to the HPC clusters via VPN access, which proved instrumental in setting up my server and enhancing my capabilities.

Furthermore, I actively participated in the welcome meet organized by the Red Hen Mentors and Founders. This meeting provided me with a comprehensive understanding of the Red Hen project and an opportunity to familiarize myself with the entire team.

Overall, the Community Bonding Period was a productive and enlightening phase, laying the groundwork for the forthcoming stages of the project.

Coding Period

The highly anticipated coding period for GSoC 2023 has commenced, and I am absolutely thrilled and filled with excitement! This marks the beginning of an exciting journey, and I can’t wait to dive into the coding tasks and make meaningful progress on the project.

Week 1 & 2(29 May — 15 June 2023):

  • We had a meeting with two mentors, Dr. Peter Unrig and Raul Sanchez. This meeting was crucial to setting the stage for our project.
  • During the meeting, we had a comprehensive discussion about our project’s direction and how to access the dataset. Dr. Unrig and Raul Sanchez provided invaluable insights and guidance based on their expertise. They helped me refine my approach and better understand the goals we wanted to achieve with the dataset.
  • Dr. Peter highlighted the importance of understanding the dataset’s structure and content before proceeding with the download. He emphasized that this understanding would be key to effectively utilizing the dataset for our analysis.
  • Raul Sanchez shared his experiences in locating and downloading datasets of this magnitude. He emphasized the significance of finding reliable sources to ensure the authenticity of the data.
  • Both mentors provided practical advice on managing the challenges that could arise during the download process, such as slow download speeds and intermittent internet connectivity. They stressed the need for patience, perseverance, and utilizing download management techniques to overcome these obstacles.
  • Where my main focus was on downloading the Boston University Radio Speech Corpus, Santa Barabra, NXT Switchboard corpus, and Santa Barbara Corpus of Spoken American English. That dataset Dr peter made available for me to work on it.
  1. Understanding the Dataset: Familiarizing yourself with the dataset’s structure and content is a crucial step. This understanding helps you determine how the dataset can be used effectively and ensures you’re aware of its potential applications.
  2. Locating the Dataset: Researching the availability of the dataset online and accessing it through reputable sources is important. Working with reliable sources ensures the data’s authenticity and minimizes the risk of using incorrect or misleading information.
  3. Downloading the Corpus: Downloading large datasets like the SBCSAE can be time-consuming and resource-intensive. Planning ahead and allocating the necessary resources, including a stable internet connection and sufficient storage space, is crucial for a successful download.
  4. Managing the Data: Organizing the downloaded files systematically enhances your workflow efficiency. Creating clear directory structures and using consistent labeling make it easier to navigate and work with the dataset during subsequent analysis and processing stages.
  5. Verifying the Downloaded Data: Verifying the integrity of the downloaded files is essential. By performing validation checks such as comparing file sizes, confirming the presence of expected components, and using checksums, you can ensure that the dataset hasn’t been corrupted during the download process.
  6. Challenges and Lessons Learned: Dealing with challenges like slow download speeds and connectivity issues is common when working with large datasets. These challenges highlight the importance of patience, persistence, and utilizing strategies to manage and overcome technical difficulties.

Week 3(15–21 June 2023):

I had the privilege of engaging in a meeting with mentors Anna Wilson, Dr. Peter Unrig, and Ilya Burenko, and our discussions have been truly enlightening.

During the meeting, we delved deeply into the intricacies of the datasets at hand. We meticulously examined the structure, content, and potential applications of each dataset. This process helped us identify the most suitable dataset for our project, and we collectively decided to focus on the Boston University Radio Speech Corpus for our work.

Moreover, our discussion revolved around the fundamental concepts of prosody, pitch accents, and boundary phrases. With Dr. Peter Unrig’s guidance, I gained a comprehensive understanding of these concepts and their significance in the realm of spoken language analysis. Dr. Unrig’s insights have proven invaluable in shaping my grasp of these intricate linguistic elements.

With another meeting with Dr.Peter Uhrig, Our recent sessions have provided a comprehensive understanding of what break files are and how they can be leveraged to enhance our analysis. Break files essentially delineate the boundaries between various linguistic units, such as phrases and intonation groups, within spoken discourse. By carefully examining these boundaries, we can unveil the cadence and rhythm of speech, ultimately contributing to our exploration of prosody.

Guided by Dr. Peter Unrig’s expertise, we are now on the cusp of applying this newfound knowledge to our chosen dataset, the Boston University Radio Speech Corpus. Understanding how break files interact with pitch accents and boundary phrases is proving to be a fascinating endeavor. This deep dive is opening doors to a more nuanced interpretation of the spoken word, enabling us to capture the essence of prosody with greater accuracy.

Week 4 & 5(22 June–7 July 2023):

In the past weeks, we’ve been focused on a critical phase that involves the manual collection of information from break files. This targeted effort aims to enhance the efficiency and accuracy of annotation, making it a smoother process for both mentors and myself.

During this period, I’ve had the privilege of working closely with mentors Dr. Peter Unrig and Ilya Burenko, whose guidance and insights have been instrumental in shaping our approach. Our primary objective has been to extract valuable insights from break files, which serve as the cornerstones of prosodic elements such as pitch accents and boundary phrases within spoken language.

To optimize our annotation efforts, we meticulously compiled this extracted data into a structured CSV file. This strategic step serves two crucial purposes. Firstly, it facilitates the collection of valuable statistics pertaining to the distribution and patterns of breaks. Secondly, it lays the foundation for a more informed and streamlined process of annotating prosody.

Our well-structured CSV file serves as a compass for enhancing our understanding of the dataset’s prosodic structure. It also significantly simplifies the collaborative annotation process that lies ahead. As we prepare to undertake the task of manual annotation, this file will prove to be a valuable resource for both my mentors and myself.

CSV file we did

We have successfully gathered all audio files into a single, centralized folder. This organization not only ensures a structured workflow but also simplifies access for annotation purposes. Additionally, recognizing the importance of compatibility, we have converted the audio file extensions from SPH and SPN to the widely recognized .wav format. This conversion guarantees uniformity and ease of processing, setting the stage for comprehensive analysis.

In addition to the progress made, I am pleased to mention that we have conducted around four productive meetings within this two-week period. These meetings have played a vital role in ensuring alignment, exchanging insights, and refining our strategy moving forward.

Week 6(8 July–14 July 2023):

During That week we did two meetings with Dr peter Uhrig to help me to access the HPC Cluster

I wanted to share some insights into the challenges I’ve encountered while striving to access the HPC (High-Performance Computing) Cluster and subsequently run our model through Jupyter. These difficulties have presented valuable learning experiences, and I’m eager to provide you with an overview of the hurdles I faced and how we’ve been working to overcome them.

Accessing the HPC Cluster, which houses the computational resources needed for executing our model, has indeed posed some challenges. While the cluster offers powerful capabilities, the process of connecting to it and setting up the necessary environment can be intricate, particularly for individuals new to this system.

Among the difficulties I encountered:

  1. Authentication and Permissions: Establishing the necessary authentication credentials and permissions to access the HPC Cluster can be complex. Negotiating this process required careful coordination with the HPC administrators and familiarizing myself with the cluster’s access protocols.
  2. Environment Setup: Configuring the environment within the HPC Cluster to accommodate our project requirements proved to be a multifaceted task. This included installing the required dependencies, ensuring compatibility with our model, and optimizing resource allocation.
  3. Jupyter Integration: Integrating Jupyter into the HPC Cluster environment for executing our model added another layer of complexity. Ensuring that Jupyter operates seamlessly within the cluster’s infrastructure required careful configuration and troubleshooting.

Despite these challenges, I am grateful for the support and guidance I’ve received from my mentors and the HPC administrators. Their expertise has proven invaluable in navigating these intricacies and troubleshooting any roadblocks encountered.

In light of these challenges, the journey has not only broadened my technical skills but also deepened my appreciation for the collaborative and dynamic nature of research and development. These experiences underscore the significance of meticulous planning, continuous communication, and the ability to adapt to changing circumstances.

Week 7(14 July–21 July 2023):

Title: Navigating Prosody Analysis: Unveiling Insights with Wav2Vec 2.0 and Mentorship

In the quest to unravel the intricacies of prosody and spoken language, we embark on a transformative journey infused with guidance from mentors Dr. Peter Unrig and Ilya Burenko. Join us as we dive deep into the world of audio analysis, utilizing the prowess of Wav2Vec 2.0 and the invaluable expertise of our mentors.

The Essence of Frame-Level Features:

Prosody, the rhythm, and melody of spoken language, is a cornerstone of effective communication. To capture its nuances, we delve into frame-level features, which offer a window into the prosodic structure of speech. This exploration guides us in accurate annotation and automated analysis, setting the stage for a richer understanding of spoken discourse.

Wav2Vec 2.0: Empowering Audio Analysis

At the heart of our journey lies Wav2Vec 2.0, a groundbreaking model developed by Hugging Face. Its feature extractor revolutionizes audio processing, transforming raw audio into interpretable representations. With Dr. Peter Unrig and Ilya Burenko as our mentors, we harness this technology to amplify our exploration of prosody.

Navigating the Journey: A Step-by-Step Approach

Our path is paved with meticulous planning and collaboration. Here’s an insight into the steps we undertake:

1. Preparing with Mentorship

Guided by Dr. Peter Unrig and Ilya Burenko, we approach the project with a solid foundation. Their mentorship fuels our understanding of prosody, frame-level analysis, and the intricate workings of Wav2Vec 2.0.

2. Harnessing the Model’s Power

We load the pre-trained Wav2Vec 2.0 model and its feature extractor, tapping into its capabilities for transforming audio data into meaningful features.

3. Organizing Audio Files

Under the guidance of our mentors, we meticulously organize audio files into a coherent structure, optimizing our workflow for annotation and analysis.

4. Frame Snippets: Capturing Prosodic Nuances

Our codebase enables us to loop through frame snippets within each audio file. These snippets encapsulate vital prosodic cues, laying the groundwork for deeper analysis. we did a Frame size of 1 second and the step size of 200 ms.

5. Extracting Features and Embeddings

Utilizing the feature extractor, we transform frame snippets into feature representations compatible with Wav2Vec 2.0. This transformation yields embeddings, condensed representations of the audio’s underlying patterns.

6. Guided Annotation and Analysis

With the mentorship of Dr. Peter Unrig and Ilya Burenko, we store essential frame-level information, including start and end times, embeddings, and audio file details. Their guidance ensures that our annotations are informed and meaningful. finally, we have a file called “all_frame_snippets.npz” which has the features we extracted as a numpy array.

Week 8(22 July–29 July 2023):

In essence, this code demonstrates a process of data integration, comparison, and calculation. It combines data from frame snippets and two separate DataFrames to generate a new column that holds meaningful ‘target’ values. The code showcases how to manipulate and analyze data, making it a valuable building block for more comprehensive data analysis pipelines or applications.

Steps Breakdown:

  1. Loading Frame Snippets: The code starts by loading a set of frame snippets from a saved NumPy .npz file. This collection of frame snippets contains important information like ‘name’, ‘start_time’, ‘end_time’, and possibly other attributes.
  2. Extracting Attributes: The code then goes through each frame snippet and extracts relevant attributes such as ‘name’, ‘start_time’, and ‘end_time’. These attributes describe different aspects of each snippet and will be used for comparisons later.
  3. Identifying Unique Names: Unique names are extracted from the ‘name’ column of both the ‘data’ and ‘df’ DataFrames. This step sets the foundation for later comparisons between the frame snippets and the two DataFrames.
  4. Creating the ‘target’ Column: For each unique name in the ‘data’ DataFrame, the code checks if that name also exists in the ‘df’ DataFrame. If there’s a match, a function called ‘new_col’ is defined. This function performs intricate calculations involving time intervals and specific conditions from the ‘df’ DataFrame to determine the ‘target’ value for each row in the ‘data’ DataFrame.
  5. Applying the ‘target’ Function: The ‘apply()’ function is employed to iterate through each row in the ‘data’ DataFrame and apply the ‘new_col’ function to calculate the ‘target’ values based on the conditions set in the ‘df’ DataFrame. This is an operation that integrates the data from both the ‘data’ and ‘df’ DataFrames.
  6. Final Output: After the ‘target’ values have been calculated for all relevant rows in the ‘data’ DataFrame, the modified DataFrame is displayed. This data frame now includes the newly created ‘target’ column, which is a result of the complex calculations involving the ‘df’ data frame and the conditions derived from the frame snippets.

Week 9(30 July–6 Aug 2023):

By meet of Dr peter Uhrig and Ilya Burenko.

In the realm of natural language processing and speech analysis, prosody plays a vital role in conveying the emotional and pragmatic aspects of human communication. Prosody encompasses various acoustic features, such as pitch, duration, and intensity, which together shape the rhythmic and melodic patterns of speech. Detecting intonational phrase boundaries, where speech naturally pauses and transitions, is crucial for understanding the rhythm and meaning behind the spoken language. In this blog post, we’ll explore how Multilayer Perceptron (MLP) neural networks are harnessed for automatic prosody detection, with a specific focus on detecting intonational phrase boundaries.

The Significance of Intonational Phrase Boundaries: Intonational phrase boundaries are like linguistic signposts that guide the listener through a conversation. These boundaries indicate where speakers take a breath, make pause, or change the pitch and intensity patterns, which influence the interpretation and emphasis of spoken words. Detecting these boundaries computationally helps in understanding the natural rhythm of speech, enabling more accurate speech synthesis, emotion recognition, and even automatic captioning for audio content.

The Power of MLP Neural Networks: MLP neural networks are a fundamental class of artificial neural networks that have shown their effectiveness in a wide range of tasks, including image recognition, text classification, and speech analysis. Their ability to model complex relationships in data makes them well-suited for tasks involving sequential patterns, such as prosody detection. MLPs consist of multiple layers of interconnected neurons, each layer processing the input data and passing it through nonlinear activation functions. This architecture allows the network to capture intricate patterns in the input data.

Prosody Detection with MLPs: The process of detecting intonational phrase boundaries using MLPs involves two main steps: feature extraction and boundary prediction. The acoustic features, such as pitch contours and energy patterns, are extracted from the audio signal using signal processing techniques. These features are then used as input to the MLP model. The model learns to recognize the patterns associated with intonational phrase boundaries by adjusting its internal weights during training.

Model Training and Evaluation: To train the MLP model for prosody detection, labeled audio data is required, where intonational phrase boundaries are annotated. The model learns from these annotated examples and fine-tunes its parameters to make accurate predictions. The model’s performance is evaluated using metrics like precision, recall, and F1-score, which quantify how well it identifies intonational phrase boundaries.

Benefits and Applications: Automatic prosody detection using MLPs offers numerous benefits, such as:

  1. Improved Speech Synthesis: Better detection of intonational phrase boundaries can lead to more natural-sounding synthesized speech.
  2. Emotion Recognition: Accurate prosody detection aids in recognizing emotions and sentiments conveyed through speech.
  3. Language Understanding: Identifying intonational phrase boundaries aids in the segmentation and comprehension of spoken language.
  4. Speaker Adaptation: Prosody detection can be tailored to individual speakers, enhancing speech recognition systems.
The system is already worked but it still needs some improvement and we works on it.

--

--