k2-fsa:text_search Install TextSearch

Nadira Povey
4 min readJul 24, 2023

This is a tutorial on how to install text search from the k2-fsa github using pip install, and run it successfully.

First, clone the repositories below and download the facebook dataset. Then, rename the folders specified below to make them compatible with the provided scripts. Note that the provided code is demonstrating only with the librilight dataset.

git clone https://github.com/k2-fsa/text_search.git
git clone https://huggingface.co/datasets/pkufool/librilight-text
git clone https://huggingface.co/Zengwei/icefall-asr-librispeech-zipformer-2023-05-15
#download facebook dataset and unzip
wget https://dl.fbaipublicfiles.com/librilight/data/small.tar
tar -xf small.tar
#rename folders
mv librilight-text librilight_text
mv icefall-asr-librispeech-zipformer-2023-05-15 exp
# make download folder in libriheavy
mkdir text_search/examples/libriheavy/download
mv librilight_text text_search/examples/libriheavy/download
# make libri-light directory
mkdir text_search/examples/libriheavy/download/libri-light
mv small text_search/examples/libriheavy/download/libri-light

Next, make a virtual environment and activate it.

python -m venv venv_ts
source venv_ts/bin/activate

Install the below packages using pip install.

pip3 install fasttextsearch
pip install tqdm
pip install lhotse
pip install torch==1.12.1+cu113 torchvision==0.13.1+cu113 torchaudio==0.12.1 --extra-index-url https://download.pytorch.org/whl/cu113

Finally, run the script ( in the directory ‘text_search/examples/libriheavy’).

./run.sh

(Note, if you get the error, “ModuleNotFoundError: No module named ‘regex’”, pip install it using the command below.)

pip install regex

Another thing to note, if you are running the code on less than 32 GB GPU, you may get the following error:

RuntimeError: CUDA out of memory.

If so, you need to go into the run.sh file, stage 3, and change the line:

--max-duration 2400

I changed the max-duration to 1200 for my 9 GB GPU, so keep this in mind.

run.sh edited to only run the small set.

 #!/usr/bin/env bash

# fix segmentation fault reported in https://github.com/k2-fsa/icefall/issues/674
export PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python
#export PYTHONPATH=/mnt/speech1/anna/text_search/build/textsearch/python:$PYTHONPATH
#export CUDA_VISIBLE_DEVICES="0,1,2,3"
export CUDA_VISIBLE_DEVICES="0,1"
set -eou pipefail

# This script is used to recogize long audios. The process is as follows:
# 1) Split long audios into chunks with overlaps.
# 2) Perform speech recognition on chunks, getting tokens and timestamps.
# 3) Merge the overlapped chunks into utterances acording to the timestamps.

# Each chunk (except the first and the last) is padded with extra left side and right side.
# The chunk length is: left_side + chunk_size + right_side.
chunk=30.0
#chunk=10.0
extra=2.0

stage=1
stop_stage=30

# We assume that you have downloaded the LibriLight dataset
# with audio files in $corpus_dir and texts in $text_dir
# The corpus_dir looks like:
# .
# |-- large
# |-- medium
# `-- small
#
# The text_dir looks like:
# .
# |-- output_text_large_cleaned
# |-- output_text_medium_cleaned
# |-- output_text_small_cleaned
# |-- recording2book_large.json
# |-- recording2book_medium.json
# `-- recording2book_small.json

corpus_dir=$PWD/download/libri-light
text_dir=$PWD/download/librilight_text
# Path to save the manifests
output_dir=$PWD/data

. parse_options.sh || exit 1

#world_size=4
world_size=2
log() {
# This function is from espnet
local fname=${BASH_SOURCE[1]##*/}
echo -e "$(date '+%Y-%m-%d %H:%M:%S') (${fname}:${BASH_LINENO[0]}:${FUNCNAME[1]}) $*"
}

if [ $stage -le 1 ] && [ $stop_stage -ge 1 ]; then
# We will get librilight_raw_cuts_{subset}.jsonl.gz
# saved in $output_dir/manifests
log "Stage 1: Prepare LibriLight manifest"
python prepare_manifest.py \
--corpus-dir $corpus_dir \
--books-dir $text_dir \
--output-dir $output_dir/manifests \
--num-jobs 10
fi

if [ $stage -le 2 ] && [ $stop_stage -ge 2 ]; then
# We will get librilight_chunk_cuts_{subset}.jsonl.gz
# saved in $output_dir/manifests
log "Stage 2: Split long audio into chunks"
# for subset in small medium large; do
for subset in small; do
./tools/split_into_chunks.py \
--manifest-in $output_dir/manifests/librilight_raw_cuts_${subset}.jsonl.gz \
--manifest-out $output_dir/manifests/librilight_chunk_cuts_${subset}.jsonl.gz \
--chunk $chunk \
--extra $extra # Extra duration (in seconds) at both sides
done
fi
# --max-duration 2400
if [ $stage -le 3 ] && [ $stop_stage -ge 3 ]; then
# This script loads torchscript models, exported by `torch.jit.script()`,
# and uses it to decode waves.
# You can download the jit model from
# https://huggingface.co/Zengwei/icefall-asr-librispeech-zipformer-2023-05-15

# We will get librilight_asr_cuts_{subset}.jsonl.gz
# saved in $output_dir/manifests
log "Stage 3: Perform speech recognition on splitted chunks"
for subset in small; do
#for subset in small medium large; do
./tools/recognize.py \
--world-size $world_size \
--num-workers 8 \
--manifest-in $output_dir/manifests/librilight_chunk_cuts_${subset}.jsonl.gz \
--manifest-out $output_dir/manifests/librilight_asr_cuts_${subset}.jsonl.gz \
--nn-model-filename exp/exp/jit_script.pt \
--tokens exp/data/lang_bpe_500/tokens.txt \
--max-duration 1200 \
--decoding-method greedy_search \
--master 12346
done
fi

if [ $stage -le 4 ] && [ $stop_stage -ge 4 ]; then
# Final results are saved in $output_dir/manifests/librilight_cuts_{subset}.jsonl.gz
log "Stage 4: Merge splitted chunks into utterances."
#for subset in small medium large; do
for subset in small; do
./tools/merge_chunks.py \
--manifest-in $output_dir/manifests/librilight_asr_cuts_${subset}.jsonl.gz \
--manifest-out $output_dir/manifests/librilight_cuts_merged_${subset}.jsonl.gz \
--extra $extra
done
fi


if [ $stage -le 5 ] && [ $stop_stage -ge 5 ]; then
log "Stage 5: Align the long audios to corresponding texts and split into small piece."
#for subset in small medium large; do
for subset in small; do
./matching_parallel.py \
--manifest-in $output_dir/manifests/librilight_cuts_merged_${subset}.jsonl.gz \
--manifest-out $output_dir/manifests/librilight_cuts_${subset}.jsonl.gz
done
fi

--

--