Filtering CSV files outside of a spreadsheet

Method 1: The Command Line

Alicia Williams
6 min readJan 3, 2018

Spreadsheets are often my one-stop-shop for viewing, editing, joining, filtering, preparing, and visualizing data. Importing large CSV files into products like Google Sheets and Excel has its limits. I recently hit one of these limits while preparing data for my Google Sheets + Cloud Natural Language demo.

The error turned opportunity.

The reviews.csv file used in the demo contained over 100,000 rows of data. I needed only about 400 for my analysis, which I knew would easily fit in a Google Sheet. Luckily, this roadblock turned into some (forced) fun in learning a few new technologies.

For when you find yourself in the same boat, I’d like to share a few methods for editing a CSV file outside of Google Sheets. I’ll show you the solutions based on my specific data filtering needs for this case: a CSV file containing only the rows from reviews.csv that have a listing_id (column 1) equal to “66288”. The first method, which I’ll outline in this post, uses command line tools.

The command line

As you likely know, you can use the command line to interact with and give commands to your computer using a text interface. While I tend to spend most of my day using the mouse to interact with applications and files, the command line definitely has its time and place! I didn’t originally think to use the command line for this problem, but after a colleague suggested that I “grep” the relevant rows, I immediately had to look it up.

In my search, I learned about two command line tools, grep and awk, that can quickly grab rows of a CSV that match certain criteria and create an output file for these rows. These tools are available on Mac and Linux. (There are similar tools that you can use on a Windows PC, which I plan to cover in a future post.) I work on Mac, so my screenshots will show the Mac environment. In either case, the first step is getting the command line opened and set up.

Getting set up

Using the command line can be daunting, so it can be helpful to use a starter’s tutorial such as Codecademy’s Learn the Command Line. I’ll also show you how to get started here, and you might find it’s not so hard to follow the same steps on your own computer.

First I’ll create a folder that contains the reviews.csv file. You can place it where you like, such as on your desktop or within your Documents folder.

Then, I will open the command line interface; it’s called Terminal.

When I open Terminal on my Mac, I see my last login time and a command line prompt, which by default lists the computer name, current directory, and my user name, followed by a dollar sign ($). The dollar sign indicates that it is ready to accept a command, which will be applied in the current directory. The ~ (tilde) you see in this example represents my home directory.

In my case, I need to make sure the current directory is the one that contains the reviews.csv file. There are lots of quick ways to get there, but for a new user I recommend using the commands pwd, ls, and cd.

Above you can see the process. First, I check my current directory (using pwd), then I list the files/folders in my current directory using ls. I want to move down into my “Dev” folder, so I type cd Dev. Then I check again which files are available in this directory using ls. My reviews.csv file is in the “grep” folder, so I’m close. I use cd grep to move into the grep folder. One final ls, and I can see that the reviews.csv file is in my current directory.

This Macworld article is a concise and easy-to-understand resource if you are having trouble finding your directory.

Now that I’m in the correct directory, I can use grep or awk.

Using grep

grep is the simplest solution, in my case. Because I need to filter based on values in the first column, listing_id, I can ask grep to grab each row that starts with “66288”. I will use two commands to complete the operation:

  1. Use head to copy the header row from reviews.csv to a new output file, which I will name 66288.csv
  2. Use grep to search all rows from review.csv that start with “66288,” and copy them into the output file

Here’s what that looks like on the command line:

$ head -n 1 reviews.csv > 66288.csv
$ grep '^66288,' reviews.csv >> 66288.csv

Once I’ve run the two commands, I will see the output file, 66288.csv, in the same folder as reviews.csv. I’m curious how many rows made it into the output. From the command line I can use the following two commands to count the number of rows in the input and output file and see how they differ.

$ wc -l reviews.csv
$ wc -l 66288.csv

In this case, the data went from 119,206 rows to 405 rows. I can import the CSV into Google Sheets now for further analysis.

Want to adjust these commands for your own data?

  • Change the names of the origin and output files from reviews.csv and 66288.csv to the names of your own files.
  • Replace ^66288, with your own regular expression to tell grep what to search for and match on each input line. Here is a good resource with examples.
  • If you need to match data from a specific column beyond the first column, or extract rows based on more complex criteria such as a value greater than a certain number, consider using awk (see next section).

Using awk

awk requires a little more work, but it is a more robust tool that can understand and search for matches within specific columns of data. After telling awk the column delimiter (a comma, in the case of a CSV), I can specify a regular expression to match within a specific column. I will use two commands for this operation:

  1. Use head to copy the header row from reviews.csv to a new output file, which I will name 66288_awk.csv
  2. Use grep to search all rows from reviews.csv, test for “66288” in the first column, and copy them into 66288_awk.csv

Here’s what that looks like on the command line:

$ head -n 1 reviews.csv > 66288_awk.csv
$ awk -F, '$1 ~ /^66288$/ {print}' reviews.csv >> 66288_awk.csv

Just like with grep, once I’ve run the two commands, I will see the output file, 66288_awk.csv, in the same folder as reviews.csv. I can count the number of rows in the input and output files:

$ wc -l reviews.csv
$ wc -l 66288_awk.csv

I see the same row counts as I did with grep (119,206 rows input to 405 rows output). And finally, I can import 66288_awk.csv into Google Sheets for further analysis.

Want to adjust these commands for your own data?

  • Change the names of the origin and output files from reviews.csv and 66288_awk.csv to the names of your own files.
  • Replace $1 with the column number you would like awk to test against on each row.
  • Replace ^66288$ with your own regular expression to tell awk what to test for and match on each input line. Here is a good resource with examples.
  • If your criteria are more complex and require matching multiple columns and/or comparing values, you can find resources online that give examples for these cases. I’ll also be publishing two additional blog posts in the coming weeks that show alternative tools for filtering on complex criteria.

Resources for learning more

For more information about grep and awk, here are some resources I found useful on my learning journey:

Stay tuned for my follow-up blog posts on how to use two more tools, BigQuery and Trifacta Wrangler, to filter CSV files outside Google Sheets.

--

--

Alicia Williams

Google Cloud Developer Relations. Learning how to do cool things with @GoogleCloudTech and @GoogleWorkspace. Opinions are my own, not that of my company.