Quick Hacks #1
Hello!
Imagine a very common situation in the life of a chemoinformatician: you are working on Linux (or any Unix machine) and have just downloaded a .sdf
or a .mol2
file with several molecules. Now you want to quickly know how many structures this file contains.
The fastest way would be to use one of these command-line expressions:
For .sdf
:
fgrep -c ‘$$$$’ path_to_sdf_file
or
fgrep -c “M END” path_to_sdf_file
For .mol2
:
fgrep -c “@<TRIPOS>ATOM” path_to_mol2_file
The output of these commands will be a number representing a total count of occurrences of a certain character sequence (either $$$$
or M END
or @<TRIPOS>ATOM
), which is basically the number of compounds this file contains.
Q: But what is this fgrep
command? Why can we use it like this?
A: fgrep
is a utility program built in any Unix-based operating system. In essence, it is used to search for character strings in text files. When used without any arguments (i.e., fgrep <character_string> <path_to_file>
), it will just return all of the occurrences of this string. We use the -c
argument (which stands for count
) after the utility’s name to simply print out the count of the occurrences of a given string in a given file. Because, by design, .sdf
and .mol2
files separate different molecules by special character strings (such as $$$$
, M END
, or @<TRIPOS>ATOM
), calculating occurrences of these strings will give us the number of separate molecules contained within a given file.
Q: Why can we use thefgrep
command with .sdf
and .mol2
files?
A: Because files of this type are basically text files: they contain characters and strings, and the fancy file extension doesn’t matter much.
Q: Can we use the fgrep
command on files of any other formats?
A: Yes, absolutely! For example, if you have a text file that contains multiple entities separated by the same character string (let’s say, you have a molecules.stuff
file separating molecules by ****)
, you could execute the command in the same fashion –fgrep -c ‘****’ ./molecules.stuff
— and get the count of occurrences of the ****
string in the file, i.e., the number of separate entities.
I hope that this quick hack will be useful for you. Cheers!