GSoC 2024 Experience with Red Hen Labs

25 min readMay 3, 2024

Hi! I’m Prakriti Shetty, an undergraduate student from IIT Bombay, India. I’m pursuing GSoC 2024 under Red Hen Labs, and I’m going to be working on an extremely exciting project on the detection of intonational units, and an overhaul of the AuToBI system.

I will be updating my progress here weekly, stay tuned!

Week 1– 13th May, 2024 to 19th May 2024.

Project Summary: Detection of Intonational Units- An overhaul of AuToBI aims to refurbish the AuToBI system first proposed by Andrew Rosenberg in 2009, with today’s state-of-the-art algorithms and models so as to cater to the compute, speed and accuracy requirements that are characteristic to today and the future. The main idea is still the automatic detection and classification of prosodic events (Rosenberg’s thesis considers two events of interest pitch accents and phrase boundaries, but we can target to cater to a wider array of events), but we will lay our focus on the broad goal of establishing a new baseline by employing some of the more sophisticated MLmethods as of today.
Goals:
- Goal 1: Try to incorporate speaking rate and speech rhythm as additional prosodic events.
- Correlated Goal 1A: Analyse the correlation between the prosodic events of speech rate/rhythm with phrase boundaries.
- Goal 2: Try to incorporate CTC alignment to do away with disparities due to frame-level alignment. Analyse the benefits of the change.
- Goal 3: Employ techniques like Tandem DNN-HMMs, RNN-Transducers and WSFTs and analyse the benefits, if any
- Ambitious Goal 4: Employ state-of-the-art techniques to improve speaker diarization. Can try to generalise speakers to larger groups based on ethnicity/region to employ region-specific analysis

Week 2– 20th May, 2024 to 26th May 2024.

Covered:

Understanding the thesis
Understanding Andrew Rosenberg’s Code
Overview of goals identified in the proposal

Section 1: Understanding the thesis

Automatic detection and classification of prosodic events

1.1: Introduction

Q] What is prosody?

Words — lexical content of speech, Prosody — manner in which these words are spoken

Q] Why is prosody important?

Desired interpretation of an utterance/ disambiguate multiple syntactic interpretations of an utterance.
Conveys additional information (emotional states/ speaker states)

Q] 2 prosodic events considered -

Pitch Accents: Act of making a word acoustically prominent from its surrounding
Phrase Boundaries: Perceived disjuncture between words

Q] 3 goals in the thesis -

Novel techniques in detection and classification of prosodic events: New ML techniques to leverage lexico-syntactic and acoustic info to improve detection and classification performance
Improved understanding of prosodic events: Through error analysis, descriptive stats, comparing classification performance of distinct feature sets.
3 PoC applications — Speech summarization, Story segmentation, Non-native speech assessment.

1.2 ToBI Standard of Intonation

Tone and Break Indices — standard to describe intonation of SAE

The idea is that in the pitch accent and phrase ending classification chapters, you predict ToBI tones associated with the events.

4 parallel time-aligned tiers.

TONE

Linear sequence of pitch events aligned in time.
5 types of pitch accents: H*, L*, L+H*, L*+H, H+!H*
Also, high tones produced in compressed pitch range: !H*, L+!H*, L*+!H* (these downstepped tones can only occur following a previous high tone (pitch range compression))
Also included in this tier — phrase accents and boundary tones (describe intonation preceding prosodic phrase boundaries)
2 levels of prosodic phrasing — intermediate phrase, intonational phrase

BREAK:

Level of disjuncture between words indicated on BREAKS tier.
Each word boundary has a break index. Typical word boundary: BI=1, intermediate phrase boundary: BI=3, intonational phrase boundary: BI=4
Phrase accent: describes pitch movement between ultimate pitch accent and phrase boundary H-, !H-, L-
Intermediate phrases must have at least 1 pitch accent
Intermediate phrases have associated phrase accent
Intonational phrase boundaries have associated phrase accents and an additional boundary tone, to describe a final pitch movement (H%, L%)

MISCELLANEOUS

Breaths, laughter, coughing, disfluency

1.3 Datasets

Boston Directions Corpus (BDC)

Monologues by 4 speakers, 1F, 3M
Spontaneous direction-giving task and then after 2 weeks, read material
Spontaneous speech (60min, 11k words) and read speech (50mins, 10k words) treated as different corpora
Distribution of pitch-accent types in BDC-read and BDC-spontaneous: H*, !H*, L+H*, L+!H*, L*, L*+H, L*+!H, H+!H*, X*?
Distribution of phrase-ending tones in BDC-read and BDC-spontaneous: L-L%, L-H%, H-L%, !H-L%, H-H%, X-?X%…

Boston University Radio News Corpus (BURNC)

Radio news (recorded in a radio station during broadcast) + lab news (recorded in a lab)
Professional speaking —
- Adv: speech clear and free of disfluencies,
- Disadv: frequent accenting of discourse-given words and deaccenting of discourse-new words. Different from natural speech
Manual ToBI annotations, but whole material has output of a forced aligner (transcription of speech time aligned to a speech signal). Problems:
=> Location of break indices and word boundaries don’t always align perfectly. (linear assignment — nth break index is nth word boundary regardless of time)
=> No of break indices != no of word boundaries. Eg. school-based is 1 word acc to word boundary, and 2 words acc to ToBI annotation.
BURNC has a lexicon which contains syllabification info for each lexical item spoken in corpus. But phonetic inventory used in the lexicon doesn’t correspond to phonetic inventory in the forced-alignment output. Therefore, align syllable boundaries from lexicon phone sequence to the phone sequence contained in the forced alignment material usign DP (min edit distance)

TDT-4 Corpus

New-wire text and broadcast news audio in English, Mandarin, Arabic (raw audios) plus auto-produced annotations (ASR transcripts with word boundaries and inter-word durations, sentence boundaries, and speaker segmentation hypotheses)
For prosodic event detection, expert ToBI labeller labeled for pitch accent presence (not type) and intonational phrase boundary detection( not boundary tones)

1.4 Pitch Accent Detection

Accenting- acoustic highlighting of a word through some modification of its associated speech signal

Pitch excursions -> corr b/w speech energy and pitch accent -> auto-detection of pitch accent -> HMM -> TDRNN -> CHMM

Related work (for auto pitch detection):

supervised HMM, trained with smoothed pitch and intensity features to detect emphasis
HMM, with speaker normalised pitch and energy values to detect pitch accent.
Sequential modelling — TDRNNs. Distributed TDRNNs have 4 models, one for pitch, intensity, duration, and filtered energy coefficients (based on wavelet transforms of energy in speech signal -> DCT -> discrete coeffs -> SBCC spectral balance cepstral coeffs )
CHMMs coupled multistream HMMs (uses inputs from pitch, energy, duration domains and trains a coupled model to combine info)
Pitch accent detection concurrent with phone recognition. Uses a standard HMM and normalised pitch values alongwith MFCC, to recognise phones in accent, and non-accent forms.
Instead of 10ms windows, aggregate info over syllables or words
Decision trees to simultaneously predict accented syllables and phrase boundaries
Inputs are pitch features aggregated over syllables, syllable structure, diff from surrounding syllables, position within a word, lexical stress
Outputs go to a HMM model to model likelihood of observing a given sequence of prosodic labels.
Stochastic modelling framework to simultaneously predict accent and phrase boundary locations. Input are pitch, duration and energy features.
Ensemble learning techniques (boosting and bagging with CART models). Inputs are acoustic and syntactic features.
Kmeans, fuzzy kmeans, GMM clustering
Inputs are intensity, pitch, duration, pause based acoustic feature, lexciosyntatic features capturing syllable identity, PoS, lexical stress
Laplacean SVMs

Using only text features: Prosodic Assignment

Using lexico-syntantic features: Prosodic Analysis

Features derived and ML algo used (RIPPER rule (induction algo), CART modelling, memory based learning, CRF modelling, ANN model) make the difference.

Thesis:

Features: Pitch (min, max, z-score, SD, mean), Energy (same) (Calculate pitch and energy contours for each token using Praat), Duration (single feature duration in sec) (Generate Vowel, Syllable and Word-level segmentation) and voice quality features. Context normalised (z-score using mean and SD from context regions)
Models: J48 decision trees, Logistic regression (max entropy models using weka toolkit), Sequential minimal optimisation (SMO) i.e trained SVM with linear kernels.

Using filtered energy features to detect pitch accents

Feature vector has elements which contain only features derived from energy of a speech signal. Spectral emphasis
J48 decision trees
Accent realised through increased energy in a particular frequency subband

Corrected energy based classifier

Combining results from the filtered energy experiments with pitch and duration features to create a robust pitch accent detection module.
Voting classifiers

Using PoS tag info in pitch accent detection

A system with syntactic infor can hypothesize that if a noun in encountered it is more likely to be accented.
Feature vector which has both acoustic and PoS based features.
2 classifiers trained and merge their hypotheses.

1.5 Phrase Boundary Detection

Related work -

Decision trees (detects pitch accents and intonational phrase boundary) -> HMM classifier (location)
NN syntactic prosodic model
GMM acoustic model
Memory based learning

Thesis:

Features: Acoustic phrase boundary detection — Presence of silence, Pitch and energy reset, Pre-boundary lengthening. See representations of pitch and energy reset and representations of pre-boundary lengthening
Models: LR/ decision trees/ J48/ SVM

Lexico-syntactic phrase boundary detection:

We don’t have punctuation etc, hence just syntactic parse tree information and part-of-speech based word-class features
First, sentence boundary hypothesizer, then, parsing for syntactic features: Constituent features, Positional features, Parse-tree distance features, PoS-based features (certain sequences of word-class tokens are more or less likely to precede or surround an intonational phrase boundary than others.), Decision trees

Detection of intermediate phrase boundaries

Everything is the same, except less dramatic, plus no silence feature

SPEAKING RATE AND RHYTHM

1.6 Pitch Accent Type Classification

(not done)

Related work, examples of pitch accent types, descriptive analysis of pitch accents are classified

Experiments:

Acoustic aggregations
Context-normalized aggregations
Shape modeling — features that attempt to capture narrower phenomena within each accented word by modeling the contour shape. (extrema-based, tilt coeffs, lot of math)
Sampling strategies
Impact of phrase accents
PoS info

1.7 Phrase final Type Classification

(not done)

Related work, examples of phrase final types

Experiments:

Acoustic features (pitch contour — slope aggregations, tilt, extrema location etc)
Syntactic features (parse-tree, positional, PoS etc)
Regions of Analysis (full word, last 200ms etc)
Quantized Contour Modeling (bayesian technique)
Final segment class modeling

Section 2: Understanding the code

1.1: Prediction divided into 6 tasks

Detection of pitch accents
Classification of pitch accent types
Detection of intonational phrase boundaries
Classification of intonational phrase ending tones (phrase accents and boundary tone pairs)
Detection of intermediate phrase boundaries
Classification of intermediate phrase ending tones (phrase accents)

Each task requires distinct features to be extracted from the words — each task has an associated FeatureSet (desxribes the features reqd for classification)

When extracting features for a FeatureSet, AuToBI calls only those FeatureExtractors necessary to generate the required features

2.2 run() function of AuToBI.java

Get input:
=> read audio data from wav file,
=> get annotations file
Decide on a WordReader:
=> If there is an annotations file, wordreader
=> If no annotations file, pseudo-syllable word reader based on silences in the audio file
FeatureSet (set datapoints as words)
Create a tasklist from all the command line args given
For each task,
=> retrieve a FeatureSet, a Classifier
=> Initialise a feature registry by registering necessary feature extractors
=> Extract features
=> Construct feature objects
Evaluate task performance
Merge hypotheses

2.3 Broad overview of what each function does

(in notebook)

Inheritance, and which file is used where

2.4 Proposal Goals

(in proposal)

# Meeting 1: 27th May, 2024

Peter Sir gave an introduction to Andrew Rosenberg. He is in Google, working on language modelling.

Peter Sir focused on the importance of maintaining the blog. Made a mental note to update the blog before the meeting every week.

Walkthrough of the goals I had identified in my proposal

Goal 1 — Speaking rate and rhythm are more like prosodic properties, and not events.
Goal 2 — Ahmed Sir talked about model training a bit, then Peter Sir gave an overview of the datasets.
Said that the datasets have varying levels of phonetic annotations.
BURNC
NXT: Switchboard Corpus
Santa Barbara: Not as deeply annotated
Goal 3 — This is the research question: To what extent and how much do very deep pretrained models (transformer models) affect the accuracy?
All 3 methods I have mentioned in my proposal are pretty much outdated because of transformers.
Goal 4 — Not really required at this point, only at end if time permits.

Other advice

Not very useful to look at Andrew’s code.
No need to extract features manually, because pretrained transformers will do it itself.

Work for the next week

What technology should I use? How to leverage all the pretraining people have done all this while?
Peter Sir: It’s an open question, since there is a debate that handcrafted features are better. Andrew’s method works in principle, but accuracy is less
Ahmed Sir: Maybe a combination of handcrafted and extracted features?
Ahmed Sir: Transformer models to look at
wav2vec and later
Models that have most support in the community — wav2vec2.0 => Meta HuBERT => Microsoft wavLM (most recent)
Datasets structure
Should show some code in the coding phase: Github — data loader, or converting to a more readable format etc
Blog update

Advice on how to go about researching the models

WavLM paper (technique, results — how it compares to others aspects)
HuggingFace — Try out the models yourself, how it is performing on similar tasks

Logistics

Meeting time fixed — Mondays, 1.30pm IST
Signal group formed for mid-week communication
Blog update

Week 3– 27th May, 2024 to 2nd June 2024.

Section 1: Understanding the timeline of models

1.1 A brief history of LLMs (and a little bit of random info)

1966: Eliza (1st chatbot made by humans)
1986: RNNs- Able to answer Qs based on context coz stores information, but suffered from short-term memory loss (vanishing gradient)
1997: LSTMs — Can remember information over long sequences (through gates) (conference resolution)
*(Random) 1999: Nvidia introduced GPUs
*(Random) 2006: Facebook FAIR (Facebook AI Research Team) created
2014: GRUs (Gated Recurrent Units) — Simplified version of LSTMs, with only 2 gates instead of 3, hence less computationally intensive
*(Random) 2014: Google Brain Project created
2014: Attention — RNNs tried to cram all information of a sequence into a fixed length context vector, but attention allows to look back at the entire sentence selecting different parts based on relevance.
2017: Transformers (Attention is all you need) — Ditched recurrence entirely. Multihead attention focuses on different parts of the sentence simultaneously. Ability to process sequences in parallel.
*(Random) 2017: Tensorflow framework released
2018:
=> Google BERT — Bidirectional encoder representation from transformers
=> OpenAI GPT
2019:
=> OpenAI GPT-2
=> Google T5
=> Microsoft LXMERT
=> Facebook RoBERTa
2020:
=> OpenAI GPT-3
=> Google GShard, mT5
=> Nvidia Megatron
2021:
=> Huawei PanGu-a
=> OpenAI DALL-E, Codex, Web-GPT
=> AI21 Labs Jurassic-1
=> EAAI CPM-2
=> Baidu Ernie 3.0, Ernie 3.0 Titan
=> BigScience T0
=> HyperCLOVA Naver
=> Google FLAN, GLaM
=> Inspur Yuan 1.0
=> AnthropicAI Claude
=> EleutherAI GPT-Neo
=> Deepmind Gopher
=> Facebook XLM-R
2022:
=> OpenAI InstructGPT, ChatGPT
=> Google LaMDA, UL2, PaLM, FLAN-T5, FLAN-PaLM,
=> Salesforce CodeGen
=> Microsoft MT-NLG
=> Deepmind AlphaCode, Chinchilla, Sparrow
=> Meta OPT, NLLB, Galactica, OPT-IML
=> EleutherAI GPT-NeoX-20B
=> AllenAI Tk-Instruct
=> Cohere LLM
=> Yandex YaLM
=> Aleph Alpha Luminous
=> WeChat WeLM
=> Amazon AlexaTM
=> Tsinghua University GLM
=> BigScience BLOOM, mT0, BLOOMZ
2023:
=> EleutherAI Pythia
=> LM-SYS Vicuna
=> Google Bard, AI Garden
=> OpenAI GPT-4, APIs, Microsoft Bing+OpenAI
=> Huawei Pangu-e
=> Meta LLaMA (65B, 33B, 7B), LLaMA2, Code LLaMA, SeamlessM4T
=> Falcon 40B
=> AI21 Labs Jurassic-2
=> Baidu Ernie Bot
=> Nvidia NeMO
=> Anthropic Claude2
=> MistralAI 7B

1.2 Timeline for Large Speech Transfomers

2000s to 2010s: Traditional models for speech processing
Primary objective is to extract significant features from the speech signal through math operations (fourier transforms, wavelet transform, LPC) to serve as inputs to classification/ regression models.
=> GMMs: Generative models that represent probability distribution of a speech feature vector using a weighted sum of gaussian distribution. Tasks include speaker identification and speech recognition.
=> SVMs: Supervised Learning. Tasks include speech classification.
=> HMMs: Model probability distribution of speech sounds by incorporating a sequence of hidden states along with corresponding observations. Tasks include predict most probable sequence of speech sounds given an input speech signal.
=> KNN: Tasks include speaker identification and language recognition.
=> Decision Trees: Tasks include speech classification

2. Deep Learning Architectures

=> RNNs: Can model time-varying speech processing tasks otherwise hard to capture by feed-forward networks
(1) vanillaRNN
(2) bidirectional RNN
(3) LSTMs — to address vanishing gradient
(4) GRUs — computationally efficient LSTMs
(5) CTC — scoring and output function used to train LSTMs for sequence-based problems (eg phoneme recognition, ASR) with variable timing
Benefits — ability to handle unknown alignment between input and output, — input and output vary in size, — specifies the position of the character in the output. Can transform neural network output to final text without post-processing.
RNNs are good where context plays a vital role in outcome prediction.
Diff from CNNs as they utilize feedback loops to process a data sequence for final output.
E2E RNN models + CTC Loss
Attention-based encoder-decoder models
Bimodal RNNs
LSTMs (processed over time) + CNNs (extract local features)
Bidirectional LSTM

=> CNNs: Deep neural architecture consisting of 1/more pairs of alternating convolutional or pooling layers
Convolutional layers apply filters that process small local parts of the image and filters are replicated over the whole input space
Pooling converts convolutional layer activations to low resolution.
2DCNNs, 1DCNNs
Applications:
(1) Facebook AI (2021) wav2vec 2.0 [Hybrid ASR: CNN to learn representations of raw speech signals, then fed to transformer-based LMs],
(2) Google VGGVox [CNN with VGG architecture to learn speaker embeddings from mel spectrograms]
(3) Deep Noise Suppression (DNS)
=> Temporal CNNs: TCNNs better than RNNs as they allow faster training by allowing parallel computation plus no vanishing/ exploding gradients.
1D fully convolutional network with causal convolutions
Can also deal with lengthy input sequences more efficiently than LSTM/ GRU because of shared filters.

=> Transformers:
2018:
SpeechTransformer: No recurrence seq2seq model for speech recognition. To reduce dimensions difference between input and output sequence CNN layers before feeding features to the transformer. Then, CTC is integrated with the transformer model.
Tacotron2, Deepvoice3, TransformerTTS
2019:
VQ-wav2vec (315M)
wav2vec: 2 CNNs
Continuous acoustic features -> discrete units -> used to train BERT (transformer encoder network)
Mockingjay (35M) — National Taiwan University
Continuous acoustic features ->Transformer encoders. Also, masking
DiscreteBERT (110M)
Benefits of using discrete units as inputs to transformer encoder.
2020:
Conformer (118M)
CNN+ Transformer
wav2vec 2.0/ XLSR-53 (317M)
Discrete speech units like VQ-wav2vec and replaces the convolutional context network with a transformer encoder
But original contrastive objective rather than BERTs masked language modelling objective.
wav2vec — Conformer (1B)
Conformer architecture merged with wav2vec2.0 pre-training objective and noisy student training.
DeCoAR 2.0 (317M)
DeCoAR was a non-transformer LSTM model
The eLMO model inspires DeCoAR
replaces bidirectional LSTM with transformer encoder
2021:
UniSpeech
Combines an SSL objective (same as wave2vec 2.0) with a supervised ASR objective (CTC)
Joint optimisation allows better alignment of discrete speech units with phonetic structure of audio, hence better performance
HuBERT
reuses wav2vec 2.0 architecture, but replaces contrastive loss with BERT’s original masked language modelling objective
Pretraining process has two steps: 1.) Clustering = Pseudo labels assigned to short segments of speech, 2.) Prediction = Model trained to predict these pseudo labels at randomly masked positions in the original audio
w2v-BERT
Similar to w2v Conformer, but combines contrastive loss of wav2vec 2.0 with masked language modeling objective of BERT, hence end-to-end training of MLM without need to alternate between processes like HuBERT.
XLS-R
scaled-up version of XLSR-53. 2B params
data2vec
learning representations from multimodal (self supervised)
Whisper
Minimalist data processing and weak supervision. Multitasking model
VALL-E, VALL-E X
Big-SSL
Scaled up w2v Conformer. Scale up model size and data. 8B parameters.
UniSpeechSAT/ wavLM
follows HuBERT framework while focusing on data augmentation during pretraining to improve speaker representation learning and speaker-related downstream tasks.

=> Seq2Seq models
Stacks of RNNs/ Transformers/Conformers
Attention based approaches
CTC

=> Reinforcement Learning
Can learn directly from raw audio, eliminating need for hand-engineered features
Value-based DRL (deep reinforcement learning), Policy based DRL, Model-based DRL

=> Graph Neural Network
Graph Convolutional layers, Graph Attention layers, Graph transformer
Applications like multichannel speech enhancement.

=> Diffusion Probabilistic Models
Speech synthesis and enhancement
FastDiff, SRTNet, DiffWave

# Meeting 2: 3rd June, 2024

Understanding the variety of different models that can be used:

wav2vec may have a better ecosystem of models and support.

But Peter Sir said he wouldn’t spend crazy amounts of time on model selection, choose one and move forward. Can always come back to this stage if weird results

Clarified the doubt about training a model from scratch: We will be using the model off HuggingFace directly, and just add some classification layers on top of it (fine-tune a pretrained model)
Ahmed Sir mentioned some toolkits for fine-tuning: SpeechBrain, ESPNet

Clarified doubt about dataset download (winscp, rsync)

Week 4– 3rd June, 2024 to 9th June 2024.

Section 1: Understanding the datasets

1.1 NXT Switchboard Corpus

The Switchboard Corpus is a collection of spontaneous telephonic conversations between previous unacquainted speakers of American English on a variety of topics chosen from a predetermined list.

[Penn Treebank project]: A subset of 1M words annotated for syntactic structure and disfluencies.
[Intl CS Institute, UCB] Phonetic transcripts generated
[Mississippi State Univ] Corrected the phonetic transcripts
Penn Treebank Transcript provided a basis for NXT Switchboard corpus. Noun phrases from that subset annotated for animacy
Treebank Transcript aligned with corresponding subset from the corrected MS-State transcript in order to provide word timing info.

1.1.1 What is NXT
NXT is an open source toolkit that enables multiple linguistic annotations to be assembled into a unified database. It uses a standoff XML format i.e several XML files point to each other. NXT format provides a data model that describes how the various annotations for a corpus relate to one another. There is no linguistic theory or particular markup structure.
Users define annotations in a metadata file that expresses contents, and how they relate to each other in terms of a graph structure for the overall corpus annotations.
Graph structure: Relationships that can be defined in the data model draw annotations together into a set of intersecting trees, but also allow arbitrary links between annotations over the top of this structure, giving a representation that is highly expressive, easier to process than arbitrary graphs.
NXT’s other core component is the query language, which si designed specifically for working with data conforming to this data model.
Together, the data model and query language allow annotations to be read as 1 coherent set containing both structural and timing info.

1.1.2 Structure of the data
The Penn Treebank bracketed format data was first extracted into multiple XML files associated with 1 dialogue using an XML-based tool (set of XSL stylesheets) for syntactic query.
The data was divided into separate XML files representing the (1) Orthographic transcription (flat list of terminals: words, punctuation, traces etc), (2) Syntax (starts with a flat list of parses and works down through nonterminals, grounding in terminals (which are in the transcription file, but are referenced by pointers), (3) Turn structure (flat list of turns that themselves contain parses as children, again via pointers into the syntax file), (4) Disfluencies (couples reparanda and repairs by pointing to the appropriate nonterminals using named roles) and (5) Movement, or the relationship between traces and their sources.

1.1.3 Advantages of separate XML files
=> Information in a single tree structure, with co-indexing for the crossing links that are sometimes required for disfluency and movement,
=> Facilitates querying the crossing structures, since they are treated on a par with other structures within the data. Although this ease is not particularly important for the initial, syntactic data, it is crucial for a correct understanding of discourse phenomena such as coreference.
=> Separating the tags into their various types makes it easier to add data using external processes (part-of-speech taggers, named entity recognizers, and the like)

1.1.4 Actual Information included in the NXT Corpus Data
From the Penn Treebank transcript:

terminals: Includes words, punctuation and silence, as well as traces marking the origin of ‘moved’ syntactic elements. Part-of-speech information is included. The Treebank transcript did not originally include timing information, so word timings have been derived by automatic alignment with the MS-State version of the transcript.
syntax: The hierarchical syntax structure is represented by parent-child relationships in the XML. The syntactic phrase category (e.g. VP, NP), optional sub-category (e.g. SBJ, MNR), timing information and a word count of the phrase is included.
movement: marks the link between traces and antecedents as co-indexed in the Treebank annotation. For example, in “What book_i did you buy t_i?”, what book is the antecedent of the trace (t).
turns: encodes the speaker turns within each conversation, i.e. the approximate linear order in which the sentences were said by each speaker (note that overlapping speech can appear before or after the other speaker’s turn).
disfluency: coding of disfluent speech from the Treebank release. Disfluencies consist of a reparandum, i.e. the words where the speaker hesitated or made a false start, and a repair, where the speaker corrected the error, e.g., “the-_reparandum the government_repair”.
active: sentences which have been automatically identified as being in the active voice.
markable: encoding of selected NPs at Edinburgh and Stanford for information status (old, mediated or new) and animacy (e.g. human, animal, non-concrete). Only a portion of the corpus is annotated for information status.
coreference: marks the relationship between each anaphor (i.e., NP marked as old) and its antecedent i.e. the previous mention of the referent of that NP in the discourse. This was done as part of the information status annotation.
kontrast: encoding of selected content words (e.g., nouns, verbs, adjectives) at Edinburgh as to whether they are kontrastive, i.e. made salient to distinguish them from alternatives to that word which could have been used in the context. Coding was done according to certain categories of kontrast, e.g., contrastive, subset or answer. Only the portion of the corpus annotated for information status was annotated for kontrast.
trigger: encodes the relationship between certain kontrasts and the word(s) that motivated their marking. For example, if A says “I live in Garland”, and B replies “Well, I prefer San Antonio”, then motivates the marking of San Antonio as contrastive (a type of kontrast).
dialAct: the dialogue (or speech) act of units of the discourse, e.g., statement, question by Shriberg et al 1998. Note the units used were based on conversation purpose, not syntax. They are roughly equivalent to syntactic sentences, but often do not align with them.

From the MS-State transcript:

phonwords: representation of the corrected, time-aligned MS-State transcript of the corpus. Includes words, laughter and noise. Timing information, and the stress profile of the syllables in the word is included, e.g. “agree” has the profile ‘np’ i.e. no stress-primary stress.
syllables: automatically derived syllable information, done at Stanford. Includes stress information (primary, secondary or no stress).
phones: automatically derived phone information based on MS-State transcript, done at Stanford. Includes phone identity and timing information. Users should be aware of technical issues in the automatic phone boundary detection when using the phone times.
accent: pitch accents, associated with words in the MS-State transcript. The time of the peak of the accent as annotated, and the strength of the accent (weak, full) are given. Annotations fall into three sets according to their source: some of the annotation was done at the University of Washington (UW prosody), some we have converted from the UW set to NXT standards by Edinburgh and some are Edinburgh original annotations. For the Edinburgh annotations, word association was marked manually; whereas for the UW annotations it was derived automatically from word timing. Accent type is also given for accents in the Edinburgh/Stanford set (nuclear, pre-nuclear, plain). Note prosody annotation has only been completed on a portion of the corpus.
phrase: grouping of words in the MS-State transcript into prosodic phrases. Timing information is given, as well as the phrase type, determined by the ToBI break index following the last word in the phrase (minor, major, disfluent, backchannel). Note in the Edinburgh/Stanford annotations phrases were directly marked by annotators, whereas for the UW annotations, they were determined automatically from the break information and silences.
breaks: ToBI break index, for the UW annotations only. Breaks have been aligned automatically with the nearest word boundary in the MS-State transcript. The ToBI break index, and phrase and boundary tone (where applicable), are given, along with the time of the break in the original UW annotation. For the Edinburgh original set, these were generated automatically from the phrase files and do not include boundary tones.
prosnotes: notes made by annotators during the prosody annotations, for the Edinburgh set only. Includes comments on errors in the transcription, f0 tracking problems, etc.

1.2 BURSC Corpus

8 different speakers, F1A, M1B, F2A, M2B, F3A, M3B…
Datafiles in any of the subdirectories for a speaker:
1. “filename.txt” — transcript (text form for humans, whole story)
2. “filename.sph” — SPHERE-format waveform file
if the final character is “n” (“txn”,”spn”) this indicates a file
that was recorded with significant background noise
3. “filename.f0” — ESPS-format pitch file (can be displayed with ERL “waves”)
4. “filename.f0a” — ASCII-format pitch table (content equivalent to “f0” file)
F0 value and pitch detection data are given at 10 msec intervals
5. “filename.pos” — part-of-speech tags
6. “filename.ala” — automatic phone alignments (phones & words, BU format)
7. “filename.aln” — phone alignments (phones & words, BU format),
translated from hand-corrected *.lbl files
8. “filename.lba” — automatic alignments in Waves label (ASCII) format
9. “filename.lbl” — hand corrected alignments in Waves label (ASCII) format
10. “filename.wrd” — word boundary markers, Waves label (ASCII) format,
currently aligned with *.lba
11. “filename.brk” — prosodic phrase breaks in ToBI Waves (ASCII) format
12. “filename.ton” — accent and boundary tone labels in ToBI (ASCII) format
13. “filename.msc” — ToBI misc file, contains breath markers and some
f0 errors so far (ASCII)

[“filename” always has the following format: sssxyyzz, where sss — speaker identifier (e.g. “f1a”), xyy — story identifier (radio news: x=s, yy= story number (s01, s02, …), lab news: x in (j,p,r,t), yy in (nr — nonradio, rl — radio lab)) and zz — paragraph number (p1, p2, …)]

Other whole-story files in the story directories (radio or labnews):
1. “filename.prn” — dictionary pronunciations for words in this story
2. “filename.net” — pronunciation network built from .prn, used by
the recognizer

[“filename” for these cases generally has the following format: sssxyy
sss — speaker identifier, xyy — story identifier or “lab” (e.g. f1as01 or f1alab)
or sometimes “filename” will simply be: sss]

1.3 Santa Barbara Corpus

Part I:
3 CD-ROM volumes, each volume has a README.doc, table.doc, SPEECH (*.wav, *.trn, *.flt) and DOC(table files *.tbl)
Each speech file (*.wav) is accompanied by a transcript (phrases time stamped, personal phones/names/places are altered to preserve anonymity (audio files are filtered to make these portions unrecognisable) — separate *.flt files to list beginning and end of filtered regions)

Part II, III
segment.tbl, segment.txt, segment_summaries.txt, speaker.tbl, speaker.txt, table.txt, annotations
Each speech file is accompanied by 2 transcripts in which intonation units are timestamped. The text and coding content is same for the 2 transcripts, but the *.ca file had transcripts in the CHAT format and the *.trn is structured according to the LDC Callhome format.

Section 2: HuggingFace Audio Classification Tutorial

Loading the dataset
Preprocess [Feature Extractor to process audio signal]
Evaluate [Define evaluation method]
Train
Inference [pipeline()]

Section 3: My Analysis

3.1 What is my Goal?
I have a pretrained model wavLM.
I need to send an audio file to wavLM, and get it to make a prosody annotation file.
Then compare this file with the ground truth prosody annotations by manual ToBI
What are prosody annotations? Pitch accent detection and classification, Phrase boundary detection and classification, as per Andrew Thesis.
Also, Andrew handpicked features from the data, but we will use an AutoFeatureExtractor.

3.2 What are prosody annotation files? How do they look like?

NXT
Which: Separate XML files under
/accent [Start and end timepoints, strength (full/weak), type (nuclear/prenuclear/plain)]
/phrase [start and end timepoints, type (minor/major/disfluent/backchannel)]
/breaks [start and end timepoints, break index]
/prosnotes
Format: NITE Object Model format
BURSC
Which: filename.brk [prosodic phrase breaks]
filename.ton [pitch accents and boundary tones],
filename.msc [breath markers, f0]
Format: Files ?
SBC
Which: annotated within transcript (2 types — *.trn and *.ca)

Section 4: Hugging Face Audio Course

Went through the HuggingFace Audio course.

# Meet 3: 10th June, 2024

Agenda

Work done in the past week -
Section 1: Understanding the datasets,
tutorial on wav2vec finetuning,
My analysis (and doubt)
HF Audio course (ongoing)
Doubt about input output
Knowledge: Ask about RL, GNNs, Diffusion probabilistic models

Doubt:

So I have been looking at the datasets and all of them have different formats for presenting their prosody annotations

1. NXT has XML annotation files structured in the NITE Object Model format.

2. BURSC has files in the Waves Label or ASCII format.

3. Santa Barbara is different from both since it has the annotations incorporated into the transcript file itself, in two different formats- LDC Callhome and CHAT format.

At the moment I am trying to get down the basic principles and answer the question of what is our input and output, considering the model and implementation as a blackbox.

Input is an audio file, divided into clips of say “X” ms duration each

Which of the three outputs are we targeting to replicate?

As in, do we want separate XML files for each audio clip, or do we want the annotations interspersed in the transcript?

Also, if we want the annotations interspersed in the transcript, do we already have the transcript as a lateral input, or do we have to generate the transcript as well?

Plan for this week:

Finish HF audio course
Dataset preprocessing (to streamline input and output structure)
If possible, a basic implementation of the model or atleast think about the classification layer etc.

Discussed:

BURSC 3 and 4 phrase boundary are events of interest. All 4s are 3s.

Format ELAN (.eaf) and PRAAT (.txt)

Python libraries like pympi-ling

Final file with two columns, for international and intermediate boundaries, with corresponding timestamps.

# Meet 4: 17th June, 2024

Agenda

Work done in the past week -
Input Structure
File formats
Bursc docs
Model finetuning learn (little)

Input Structure

Raw:

Model input:

The last column is the break index acc to ToBI standard

[123]p stand for disfluent phrases
X stands for backchannel phrases (these are short feedback phrases like ‘okay’, ‘yea’ etc
3 stands for minor (or intermediate phrase boundary)
And 4 stands for major (or intonational phrase boundary)

BURSC

Raw

Model Input

For the bursc dataset, prosody annotations are not done for the complete dataset, which means that only a subset is labelled.

In each of the annotation labels we have the timestamp and break index. The break index signifies the phrase boundaries (3 for intermediate and 4 for intonational)

It is different from nxt in the sense that it just has a point timestamp and not a start and end like nxt, but it is similar in the usage of break index.

Raw

For the Santa Barbara, we can get the start and end timestamp just like nxt, but the annotations are interspersed within the transcript so that will require string parsing to identify and isolate the specific annotations (acc to the LDC Callhome format)

# Meet 5: 24th June, 2024

Dataset preparation almost complete (separating samples, and unique identifiers)
Began model code

Dataset code

BURSC

NXT

Dataset Doubt:

Separate .wav files (timestamp such that before and after, or at end (can’t keep a fixed interval) or iterate over frames?

Unique identifiers — separate filenames if separate .wav files, else ..

NXT speech files

Model Code:

Dataloader script — dummy data from the Minds dataset as two separate folders for audio and text
Upsample, feature extractor,
wavLm base plus
Trainingargs, trainer

Model code doubt:
I have two labels, so how to define my output

import torch.nn as nn
class WavLMNew(nn.Module):

def __init__(self, wavlm_model):
super().__init__()
self.wavlm = wavlm_model
self.timestamp_layer = nn.Linear(..)

def forward(self, inputs):
outputs = self.wavlm(……)
logits = outputs.logits
hidden_states = outputs.hidden_states
timestamps = self.timestamp_layer(hidden_states)
return logits, timestamps

# Meet 6: 1st July, 2024

Data Preprocessing -
Extract info from the break files. Save output in df.
Extract all frames from the audio files. Save output in data.
Compare the break files and audio files to make a target label column in data (1 break index corresponding to each frame)
Final data structure — zip folder BURSC_DATA_TRIAL, nested with zip folders for audio and text. Audio has all the audio files that were in wav_BURSC_AUDIO_FINAL. Text has bursc_annotations.csv
Model
Dataloader Script
Info
Split generator (text path till bursc_annotations.csv), (audio path till folder)
Generate examples (read the csv line by line and extract)
Dataset_encoded = Feature Extractor
Multi Label Classification problem
Training Arguments
Define evaluate function
train