Quick Hacks #3

Maxim Shevelev
chemoinformatics
Published in
2 min readMar 12, 2021

Quick Hacks #3: Counting the number of SMILES strings in a .smi file.

Hi there. So, there’s yet another common situation when working with chemoinformatics-related data. Imagine that a colleague sends you a .smi file, and says that there is a SMILES string on each line of this file. He doesn’t mention how many compound representations there are, though. So, is there any quick way to count the number of compounds in such a file?

Yes, there is! Just run this command, and you will be good to go:

wc -l file_with_smiles.smi

The output will be a single number, saying how many compounds this .smi file has, and the filename itself.

Q: Well, that was pretty quick. But what’s the deal with wc and the -l flag?

A: The wc command (word count) is a useful utility in Linux to calculate the number of lines, words, and characters in a given text file. If you run it like this: wc file_with_smiles.smi, it prints out three numbers, along with the file’s name: number of lines, words, and, well, characters (thanks, Captain Obvious). But for a file where one line corresponds to one SMILES string, it doesn’t really make sense to count anything else than the number of lines. That is why we run it with the -l flag, which basically says “return only the number of lines, and nothing else”. Essentially, we assume that the number of lines corresponds to the number of compounds this file contains.

Be wary, though — it doesn’t make a difference if your input file contains empty lines, and it counts them as well. But it’s a good and quick trick to estimate how many compounds you have in a file that is formatted this way.

Hope it will be useful for you. Cheers!

--

--