Aasthapunjabi
3 min readMay 3, 2024

GSOC-24 Speech and Language Processing for a multimodal corpus of Farsi

Hi there! I’m Aastha Punjabi, and I am a third year undergraduate at Indian Institute of Technology Kanpur.

I’m super excited to share my weekly updates through this blog as I dive into Speech and Language Processing for a multimodal corpus of Farsi for Google Summer of Code 2024 with the Red Hen Lab.

**Week 0 : Community Bonding Period**

Project Summary:

CWRU HPC:
- I created an affiliate account in mbt8 group on CWRU HPC, installed the authorization keys and made a directory for my project at /mnt/rds/redhen/gallina/home/akp165
All the subsequent files and codes would be stored in this directory.

**Week 1 and 2 (27th May to 9th June)**

- In the first meeting with Peter Sir, we went through the next steps that were needed to be done in the following weeks. These were :
- Forked the repository, youtube_pipeline (https://github.com/RedHenLab/youtube_pipeline) and then clone as it was important to merge and maintain the changes with the original repository.
- Downloaded test videos and their respective json and subtitle files in Persian (Peter sir gave me a list of some channels to download the videos from) using yt-dlp
-Use pip install yt-dlp for installation
- There are many modules already available on Case HPC, and one can load them using “module load MODULE_NAME”
-Available models can be checked using
“module avail”
-Any task on HPC that requires more memory that has to be using job commands such as sbatch or srun
Command used:

yt-dlp -i -o "%(id)s.%(ext)s" "URL of the video" - write-info-json - write-auto-sub - sub-lang en - verbose

-This downloads the info.json file along with the mp4/webm video and subtitles if available.

- For the English pipeline, the Udpipe model for English had to be downloaded
- Downloaded SoMaJo for tokenization in English (can be used for Russian and Bangla too but different tools need to be used for other languages)
-Downloaded the Redhen punctuation restoration tool (cloned the repo)
- Installed and built udpipe-1.2.0-bin from this link
https://github.com/ufal/udpipe/releases/download/v1.2.0/udpipe-1.2.0-bin.zip

- There was some issue in running the files due to line endings which was resolved by running dos2unix *.sh
- Note : Try to install other required modules without versions and run each script one by one if errors arise while running run_corpus_pipeline.sh

- There were certain changes to be made in the original repository before running the pipeline
Changes:

In annotate_english_pos_sent.sh, change

"find en.conll.tok $inpath -type f "

to

"find $inpath -type f -name '*conll.tok' "

In tok_conll_merge.sh,
write the whole file as

#!/bin/bash
CONLL_IN="$1" #"conll_input"
TOKENIZED_IN="$2" #"puncttext"
OUTPATH="$3" #"conll_tokenized"
for c in $CONLL_IN/*conll_input
do
SUPERBASENAME=${c%.conll_input}
SUPERBASENAME=${SUPERBASENAME##$CONLL_IN}
python3 merge_conll_somajo.py $c $TOKENIZED_IN/$SUPERBASENAME.punct > $OUTPATH/$SUPERBASENAME.conll.tok
done
  • Successully tested the English pipeline running run_corpus_pipeline.sh, creating a final VRT file

**Week 3 and 4 (10th June to 23rd June)**

- Tested the georgian_pipeline using inputs as subtitle file with .txt extension and time_information.xlsx file which I generated using the speech to text AI tool voiser.net
- Downloaded the treetagger for Georgian from here https://github.com/SophikoComp/TreeTagger-for-Georgian/
- Used OpenAI Whisper for videos that do not have subtitles, itcan be used to generate .srt, .vtt etc. files
Used these commands:

pip install -U openai-whisper
sudo apt update && sudo apt install ffmpeg
pip install setuptools-rust

For our purpose, we used large model for whisper which requires around 10GB VRAM, this can be allotted using the srun command.

whisper "path_to_mp4_file" - model large - language Persian -f 'srt'

Whisper can also be used in a python script and also for audio files.