Unix: Data Mining from a Large Number of Files

Guide on extracting information of interest in data science.

Julian Willett, MD
Mar 3 · 3 min read
Photo by Artyom Korshunov on Unsplash

In data science, sometimes you get lucky, and all of the data you need is located in a single database/file/etc. In reality, you likely need to review multiple files. This can be too tedious, especially if you want a single output among many other lines in many files.

In this tutorial, I will be going over how to accomplish this task using Unix code. Unix is a useful means of approaching this problem in data science because large data repositories, like what I use for my work, are often based on Linux.

While a user can type functions out on the command line to accomplish this task, I prefer to use scripts because it keeps everything organized and speeds up debugging. This requires using the ‘sh,’ shell script, file format, which can be demonstrated by the following: addTwo.sh.

As we can expect that we will be running the script multiple times as we get things working and then more times as we process new groups of samples; I like to add the following line of code to start:

echo > output.txt

This creates a new text file called output.txt or takes this existing file and replaces it with an empty text file.

We then want to iterate through all the files in a given folder. As in other programming languages, from R to Java to Python, we can do this using a for loop. Set this up using the following:

for I in $PWD/*.out

This code iterates through every file ending with “.out” in the current working directory and stores each iterant in the “I” variable.

Next, we set up the set of actions; a user can do this with a block structure delineated by do…done versus { } in other languages. Now that we are in the meat of the function, I want to skip files that have some error code in them, marking that they either lack the data I need or cannot trust them. A user can do this with the following:

do
if grep -Fq "Cannot read file" $I
then
continue
fi
...
done

The grep function goes through a file looking for a target string, “Cannot read file” in this case. The options of -Fq delineate to check for strings that match the pattern and then not produce any output, respectively, per GNU documentation. If the pattern is present in the file, then the loop will continue to the next iterant without doing the following functions. The $I is used to refer to the I variable with the dollar sign for indicating the selection of a variable. If statements in Unix need to end with fi.

Next, I want to write the code for the files that have strings of interest. I will again use the grep function, albeit doing different operations (trimming the output and then adding a line to the output with its returned value. A user can do this with the following:

grep "Current sample" $I |cut -c15- >> output.txt

This grep function does the same as the previous one; it searches for the pattern in the file logged as $I. The | notation is like using a pipeline in R (computes one result and then immediately does an operation on that output). The cut function removes all characters before character 15 in the returned String. The >> adds a line to output.txt with the results.

In total, the code should look something like this:

echo > output.txt
for I in $PWD/*.out
do
if grep -Fq "Cannot read file" $I
then
continue
fi
grep "Current sample" $I |cut -c15- >> output.txt
done

The Startup

Get smarter at building your thing. Join The Startup’s +787K followers.

Sign up for Top 10 Stories

By The Startup

Get smarter at building your thing. Subscribe to receive The Startup's top 10 most read stories — delivered straight into your inbox, once a week. Take a look.

By signing up, you will create a Medium account if you don’t already have one. Review our Privacy Policy for more information about our privacy practices.

Check your inbox
Medium sent you an email at to complete your subscription.

Julian Willett, MD

Written by

Loving husband. Physician scientist who enjoys spreading his knowledge with the world whether related to medicine, technology, or science.

The Startup

Get smarter at building your thing. Follow to join The Startup’s +8 million monthly readers & +787K followers.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store