Filtering CSV files outside of a spreadsheet

Method 2: Trifacta Wrangler

Alicia Williams
5 min readJan 11, 2018

When I signed off at the end of Part 1 of this series, we had just filtered a CSV file (reviews.csv) using grep and awk from the command line. The original file had over 100,000 rows and was too large to import into Google Sheets. Having identified the specific data we were trying to analyze (only rows with a listing_id of “66288”), we used command line tools to output a new CSV file that was ready to be imported into Google Sheets.

The first method came with a few costs: time spent becoming comfortable with the command line, learning grep and awk, deciphering regular expressions, and ultimately, wondering how easy it would be to filter on different and more complex criteria.

Today, I’ll show you how to use Trifacta Wrangler, a free application that helps users to explore, understand, and prepare data. The first time using the application, I found the experience intuitive and accessible, even without any specific technical expertise. Let’s get started!

No coding experience required!

Download and install the software

After entering your name, phone number, and email to register, you can download the free version of Trifacta Wrangler (Mac or Windows) to your desktop.

Create your first flow and import your data

Once you’ve installed and opened the application, the first window will direct you to create your first “flow”. A flow is similar to a project, a sort of container that holds all of the datasets involved in the data preparation as well as the recipes you build to clean and transform the data. After creating and naming the flow, you can import reviews.csv by simply dragging and dropping the file onto the application window.

Create a recipe to filter on listing_id column (and more!)

A recipe refers to the series of steps that you define to clean and transform your data. Trifacta Wrangler can help you build this recipe through several tools, including a transform builder and suggested transformations that appear as you highlight different columns of data. For this exercise, you will use the transformation options in the column menu to keep rows in which the listing_id is 66288. The animation below shows exactly how to make that happen.

At this point the data grid will preview the results of the current recipe. This preview is showing “No valid values”. Say what?!

This is due to the way Trifacta Wrangler samples the data. Click “Initial Sample” in the top left corner of the application window, and you will see that it is using the first 26,180 rows as sample data.

The file, reviews.csv, has over 100,000 rows, and I suspect that it happens that entries for listing_id 66288 do not appear in the first 26,180 rows. Since the free version does not provide alternate sampling options, you will have to wait until you generate the results to see if the recipe is working correctly. But first, let’s add one more transformation.

When reviewing the data, I noticed that the comments column had an inconsistent smattering of quotes encasing the values. Specifically, quotes only surrounded those fields that contain a comma.

Exhibit A: inconsistent leading and trailing quotes

Normally, I might turn a blind eye and hope the less-than-desirable formatting affected neither my analysis nor the potential reporting and visualization. But Trifacta Wrangler makes these sorts of transformations very easy. On the same menu under the comments column, there is a formatting option to “Trim leading and trailing quotes”. Done in one click!

Generate the results

Now you have two steps added to your recipe, which are viewable on the right side of the application window.

The final step is to ask Trifacta Wrangler to apply the recipe to the entire dataset, and generate the results. You can kick this off by clicking “Generate results” in the upper right corner of the application window. A pop-up window will ask you to confirm the output format(s) (in this case, a CSV). After the results have been generated, click the “Summary” button. On this page you can see the number of resulting rows (404, which is same as with Method 1 once you add in the header row) as well as open the resulting CSV file (click “Open Results” and a Finder window brings you directly to where the output file is stored on your computer).

Resources for learning more

Trifacta Wrangler proved to be a capable and easy-to-use tool for the above exercise. However, this post barely scratched the surface of what the tool can do to clean and enrich your data! Here are some ideas to get a better understanding of the possibilities:

Google Cloud Dataprep provides functionality similar to Trifacta Wrangler, plus scalability in the cloud and integrations with data in Cloud Storage and BigQuery. Dataprep has a user interface that is much like Trifacta Wrangler’s (it was built as a partnership between Google Cloud Platform and Trifacta).

--

--

Alicia Williams

Google Cloud Developer Relations. Learning how to do cool things with @GoogleCloudTech and @GoogleWorkspace. Opinions are my own, not that of my company.