Unix: Data Mining from a Large Number of Files
Guide on extracting information of interest in data science.
In data science, sometimes you get lucky, and all of the data you need is located in a single database/file/etc. In reality, you likely need to review multiple files. This can be too tedious, especially if you want a single output among many other lines in many files.
In this tutorial, I will be going over how to accomplish this task using Unix code. Unix is a useful means of approaching this problem in data science because large data repositories, like what I use for my work, are often based on Linux.
While a user can type functions out on the command line to accomplish this task, I prefer to use scripts because it keeps everything organized and speeds up debugging. This requires using the ‘sh,’ shell script, file format, which can be demonstrated by the following: addTwo.sh.
As we can expect that we will be running the script multiple times as we get things working and then more times as we process new groups of samples; I like to add the following line of code to start:
echo > output.txt
This creates a new text file called output.txt or takes this existing file and replaces it with an empty text file.
We then want to iterate through all the files in a given folder. As in other programming languages, from R to Java to Python, we can do this using a for loop. Set this up using the following:
for I in $PWD/*.out
This code iterates through every file ending with “.out” in the current working directory and stores each iterant in the “I” variable.
Next, we set up the set of actions; a user can do this with a block structure delineated by do…done versus { } in other languages. Now that we are in the meat of the function, I want to skip files that have some error code in them, marking that they either lack the data I need or cannot trust them. A user can do this with the following:
do
if grep -Fq "Cannot read file" $I
then
continue
fi
...
done
The grep function goes through a file looking for a target string, “Cannot read file” in this case. The options of -Fq delineate to check for strings that match the pattern and then not produce any output, respectively, per GNU documentation. If the pattern is present in the file, then the loop will continue to the next iterant without doing the following functions. The $I is used to refer to the I variable with the dollar sign for indicating the selection of a variable. If statements in Unix need to end with fi.
Next, I want to write the code for the files that have strings of interest. I will again use the grep function, albeit doing different operations (trimming the output and then adding a line to the output with its returned value. A user can do this with the following:
grep "Current sample" $I |cut -c15- >> output.txt
This grep function does the same as the previous one; it searches for the pattern in the file logged as $I. The | notation is like using a pipeline in R (computes one result and then immediately does an operation on that output). The cut function removes all characters before character 15 in the returned String. The >> adds a line to output.txt with the results.
In total, the code should look something like this:
echo > output.txt
for I in $PWD/*.out
do
if grep -Fq "Cannot read file" $I
then
continue
fi grep "Current sample" $I |cut -c15- >> output.txt
done