Google Refine

Tow Center
Tow Center
Published in
5 min readFeb 21, 2012

Google Refine is a free tool designed, as the tagline puts it, for “dealing with messy data.” Although Refine opens in a browser window, it is a desktop install and can therefore be used even when no internet connection is available. It is extremely useful for quickly transforming, assessing, and cleaning data from a variety of source formats. While its functions for joining, splitting, and summing data are more involved than other spreadsheet programs, its automatic macro-writing feature makes developing these a worthwhile investment in situations where these operations are needed on a regular basis (such as with a weekly or monthly data release).

High-level analysis
Google Refine is especially good at providing quick, top-line overviews of imported data. For example, it provides an automatic row-count when data is loaded, while the “faceting” does the same for each unique value that appears in a given column. The summary window that this generates can be sorted both alphabetically and by frequency, making small variations or errors in the data easier to identify and update. Finally, selecting one or more of these values filters the dataset to contain only matching rows. These filters can be applied across many columns simultaneously, allowing for quick drilldown to specific value intersections. Any and all of these filters can be removed by closing the corresponding facet summary window.

Format conversion
Google Refine is also useful for transforming data formats. For example, it can automatically parse well-formed XML or JSON into spreadsheet format, an otherwise difficult and time-consuming manual or programmatic task.

Recording actions on the data
As mentioned above, Google Refine also keeps a step-by-step record of any transformative data manipulation (such as merging, splitting, removing or renaming columns), which can viewed and stepped through at any time using the Undo/Redo tab. Although these manipulations are more complex to execute in Refine when compared to other spreadsheet programs, the resulting commands can be applied wholesale to another file of the same structure. This is particularly useful when certain calculations or manipulations must be made on a regular basis, as with monthly economic data and the like.

Google Refine Screenshot

The main interface of Google Refine is essentially a spreadsheet layout. The small arrows to the left of each column header provide access to the faceting, sorting, splitting and other features, while the Facet/Filter and Undo/Redo tabs appear next to the current row count.

Google Refine screenshot 2

Unlike most spreadsheet programs, however, a maximum of 50 rows can be viewed at any given time, and beyond this are paginated at the upper right. At any time, the currently selected rows can be exported as a .csv or other format.

After downloading and installing Google Refine, double-click on the program’s blue diamond icon — it will open a new browser window or tab (on a PC, a small, black window may also appear — you can minimize, but do not close, this window).

Loading & parsing data
Select the file you want to open and click Next. Refine will guess the file type you have selected and offer a visual preview in the top half of the contents in the top half of the window. In the case of non-spreadsheet formats, such as XML and JSON, you simply need to highlight with your mouse the first element you want converted into a row - it will then show you a preview of the resulting file.

By using the Pick Record Elements and Update Preview buttons at middle right, you can experiment with your selection until it's correct. Then simply click Create Project at the upper right.

Faceting

Select the arrow to the right of the column you want to facet and choose Facet>>Text facet from the dropdown. The resulting summary will be shown at right. By selecting one of the values, the dataset will be filtered to show only those containing that value.

To add values to the selection, roll over them and choose “Include” to the right of its value.

The latest version of Google Refine can be downloaded for free for all platforms at code.google.com. As noted earlier, though the interface operates through a browser window, Refine does not need a web connection to operate, nor does it transmit information to a remote server by default.

While its tools for quickly “faceting” and cleaning data are extremely useful, there are some standard spreadsheet operations (such as concatenating columns and calculating derived values) that are more complex to execute in Google Refine than in programs like OpenOffice or Microsoft Excel, because it uses the custom GREL expression language. However, for complex analyses that must be performed on a regular basis, developing them in Google Refine is often more efficient in the long term, as the macros can be stored separately and so easily reapplied.

Several in-program and online tutorials have are available at Google Refine’s project home page. The downloads page also includes a Getting Started guide and FAQ section. Additional tutorials will also be available on this site in the near future.

--

--

Tow Center
Tow Center

Center for Digital Journalism at Columbia Graduate School of Journalism