Filtering CSV files outside of a spreadsheet

Method 3: BigQuery

Alicia Williams
5 min readJan 22, 2018

This third and final post of the series once again tackles the task of filtering a CSV file outside of a spreadsheet. A quick recap: while preparing data for a demo, I was unable to import the source data file into Google Sheets due to its size. Knowing that I was targeting a small subset of the data for analysis (rows with a listing_id of 66288), I started exploring options for filtering the data prior to importing it into Google Sheets. The first method used command line tools and the second method used a desktop application called Trifacta Wrangler. In this post, I will be showing the tool that I ultimately chose to use to quickly explore, filter, and export the data. It’s called BigQuery.

BigQuery, part of the Google Cloud Platform, is a fully-managed data warehouse where users can store and analyze data using SQL. Its claim to fame is its ability to process large amounts of data, fast. But BigQuery is fast and easy to use on data of any size. BigQuery has a free tier for up to 1TB of data analyzed each month and 10GB of data stored.

Sign up for the GCP free trial

The easiest way to get your first BigQuery project up and running is to follow the free trial sign-up at cloud.google.com/free. While BigQuery has a free tier, users are required to have a billing account enabled (part of the free trial process) in order to store their own tables (free for first 10GB).

Create a BigQuery project

After you’ve signed up, navigate to bigquery.cloud.google.com. If you signed up for the free trial, you will already see “My First Project” in the left navigation. Projects are the top-level containers that store billing information and access control.

Create a new dataset

You will need to create a new dataset in order to upload a table. I think of datasets as subprojects that allow you to organize tables and provide finer-grained access control. Click the down arrow icon next to “My First Project”, choose “Create new dataset”, and give your dataset a name. I named mine “Sheets_NaturalLanguage” since I’ll be feeding the results into a demo showing how to connect to Cloud Natural Language from Google Sheets.

Add the CSV file to Google Drive and load it into BigQuery

The integration between BigQuery and Google Drive makes it easy to create tables directly from files in your Google Drive. Navigate to drive.google.com and add the reviews.csv file by dragging and dropping it onto your Google Drive. Next, right-click the file to get the shareable link.

Back in the BigQuery console, click the down arrow icon next to your dataset name and choose “Create New Table”

The “Create Table” menu will appear to the right. Choose “Google Drive” as the source data location, include the shareable link you obtained in the last section, and choose CSV for the file format. Give the table the name “reviews” and ask BigQuery to automatically detect the schema. In the “Options” section, you will need to tell BigQuery that the CSV file has one header row.

After clicking “Create Table”, give BigQuery permission to access your Drive, and the new table will be populated.

Query the table and output results to Google Sheets

Once the table is loaded, you will see the “Table Details” screen, which includes a button that directly links you to the query interface.

In the query interface, you will now write a simple query to filter only the records where listing_id equals “66288”. Here’s what it looks like:

#standardSQL
SELECT
*
FROM
`striking-decker-187523.Sheets_NaturalLanguage.reviews`
WHERE
listing_id = 66288

BigQuery defaults to using “BigQuery legacy SQL”, which is a non-standard SQL dialect. In order to enable standard SQL, add the #standardsql prefix to your query. You will also need to adjust the table name from the legacy notation to standard notation; in my case I changed [striking-decker-187523:Sheets_NaturalLanguage.reviews] to `striking-decker-187523.Sheets_NaturalLanguage.reviews`.

Finally, run your query by clicking the big, red “RUN QUERY” button.

Once the query completes, you can load the results into Google Sheets with one click by using the “Save to Google Sheets” button. You’ll find 404 resulting rows (plus the header row) which is exactly in line with our results from method #1 and method #2. Now you’re ready to perform additional analysis in the comfort of a spreadsheet!

Resources for learning more

The query in this case was very simple, but BigQuery can filter data based on complex criteria, join data from across multiple CSV files, or both! What you might try next depends on your level of proficiency in SQL:

  • If you are already familiar with SQL: go forth and query to your heart’s delight.
  • If you are looking to get more comfortable with BigQuery and/or SQL: start exploring the BigQuery public datasets. You can query up to a terabyte of data for free each month, and documentation for each dataset includes example queries.
  • If you aren’t familiar with SQL: start using a SQL-like language in Google Sheets with the QUERY function. Get started by following along Ben Collins’ blog post on this function. I learned QUERY and then transitioned to SQL and BigQuery using SQL in 10 Minutes and Codecademy’s Learn SQL course.

If you have any questions, leave a comment or find me on Twitter @presactlyalicia.

--

--

Alicia Williams

Google Cloud Developer Relations. Learning how to do cool things with @GoogleCloudTech and @GoogleWorkspace. Opinions are my own, not that of my company.