Removing all lines containing specific string from a text file with sed

I am working with TAC KBP sequencing dataset which includes brat annotation files. Brat is a nice tool for annotation and visualization but this time all the other unrelated annotations in the dataset made it impossible for me to see my data for the specific task: sequence detection.

One sentence from a annotated document before pruning @brat

So all I need was to get rid of all the other relations in the .ann files. Brat annotation file includes one annotation per line. So if I deleted those lines with coreference annotations the visualization would be a lot nicer.

sed -i '' "/Coreference/d" /path/to/file

Sed did the job for me. Here the `-i` is for inplace replacement/deletion. Of course I needed to run this not just for one file but all the annotation files in the folder.

for file in `ls data/training/`; do if [[ $file == *ann ]]; then sed -i '' "/Coreference/d" data/training/$file ; fi ; done

The visualization still look messy with long distance relations but that helped to tidy it up a little.

One sentence from a annotated document after pruning @brat

I also wanted to get rid of end of lines, cause the source txt files had in-the-middle-of-sentence end of lines everywhere. However this time I needed to replace them with some other character since I need the offsets unchanged. So I replaced all the end of lines with white space. (I needed gnu-sed on OSX to run this second one)

gsed -i ':a;N;$!ba;s/\n/ /g' data/training/$file

References:

One clap, two clap, three clap, forty?

By clapping more or less, you can signal to us which stories really stand out.