Quick Hacks #1

Maxim Shevelev
chemoinformatics
Published in
2 min readFeb 22, 2021
A cover picture featuring a Unix fgrep command applied on files with SDF and MOL2 file extensions.

Hello!

Imagine a very common situation in the life of a chemoinformatician: you are working on Linux (or any Unix machine) and have just downloaded a .sdf or a .mol2 file with several molecules. Now you want to quickly know how many structures this file contains.

The fastest way would be to use one of these command-line expressions:

For .sdf:

fgrep -c ‘$$$$’ path_to_sdf_file or

fgrep -c “M END” path_to_sdf_file

For .mol2:

fgrep -c “@<TRIPOS>ATOM” path_to_mol2_file

The output of these commands will be a number representing a total count of occurrences of a certain character sequence (either $$$$ or M END or @<TRIPOS>ATOM), which is basically the number of compounds this file contains.

Q: But what is this fgrep command? Why can we use it like this?

A: fgrep is a utility program built in any Unix-based operating system. In essence, it is used to search for character strings in text files. When used without any arguments (i.e., fgrep <character_string> <path_to_file>), it will just return all of the occurrences of this string. We use the -c argument (which stands for count) after the utility’s name to simply print out the count of the occurrences of a given string in a given file. Because, by design, .sdf and .mol2 files separate different molecules by special character strings (such as $$$$, M END, or @<TRIPOS>ATOM), calculating occurrences of these strings will give us the number of separate molecules contained within a given file.

Q: Why can we use thefgrep command with .sdf and .mol2 files?

A: Because files of this type are basically text files: they contain characters and strings, and the fancy file extension doesn’t matter much.

Q: Can we use the fgrep command on files of any other formats?

A: Yes, absolutely! For example, if you have a text file that contains multiple entities separated by the same character string (let’s say, you have a molecules.stuff file separating molecules by ****), you could execute the command in the same fashion –fgrep -c ‘****’ ./molecules.stuff — and get the count of occurrences of the **** string in the file, i.e., the number of separate entities.

I hope that this quick hack will be useful for you. Cheers!

--

--