OpenRefine — The power tool

Ekta Mishra
Code for Cause
Published in
4 min readMay 27, 2020

OpenRefine, also formerly known as Google Refine, is an Open Source software used to work with messy data and provide many functionalities for data refining, data processing, data manipulation, reconciliation, importing data to wikidata, and other additional support for external web services.

By using OpenRefine, you can inspect errors in your data, amend your data, and even save its history. What’s more interesting about it, you can reverse your actions to any stage where you want. So, there’s no hassle about either you should apply a certain operation or not. You can always test any kind of operations on it to see how it works on the dataset.

Many of you have maybe already used OpenRefine for data analysis and linking your data to the web. Still, some of you might not have a detailed view of what actually OpenRefine does. In this Introductory article on OpenRefine, we’ll talk about its applications, working, how to get started with it and contribute to it.

For using OpenRefine, you can either directly download the latest version of it from the official website or can build it from the source code available on its Github page.

Applications:

OpenRefine provides you an interface through which you can create a project using any CSV, XML, Excel (.xls and .xlsx), JSON, and other Google data documents you can first create your project to perform the operations. You can even create projects by providing valid URLs to download the data. Once you have created the project now using various facet operations you can inspect the nature of the data, you can find out if there are null values, identical values, etc. By using the feature text facet you can even group the data and provide them exactly the same value to all identical cell values, which will make it easier for further processing.

OpenRefine provides another feature to even join the groups of values that are identical without even manually visiting them. You can use the clustering feature to group by the identical values. Performing these kinds of operations provides you the data in a manner where you can have a quick decision about the nature of the data. If done carefully, it also provides good help in reconciling the data. We’ll talk about reconciliation.

You can even add columns on the basis of existing columns and URLs. You can perform scripting operations with GREL (General Refine Expression Language) on the values of a column to modify it according to your own use-case. GREL provides you something similar to what formulae do to Excel and queries to SQL.

For instance, you are provided with data of branch post offices having a column of Branch post offices’ name with Pincode and you want to extract names and Pincodes in different columns then, using GREL expressions you can substring the values present in that column and generate two other columns. So, it makes the data more readable. Not only this, but GREL expressions also provide you a lot of power to play with your data.

Public and open datasets are generally inconsistent and messy. Sometimes, there are unwanted values and columns that you probably don’t want for your use-case. But, public open databases are generally provided in that form so that it can fulfill the needs of a larger audience. Using OpenRefine you can clean your datasets accordingly.

This was all about the analysis part of the dataset. You can also link your dataset to the web in just a few steps. OpenRefine also facilitates support for a lot of reconciling Webservices. Reconciliation is a process to match the datasets’ values to the data uploaded on to any Open Database such as Wikidata. Reconciliation semi-automates the process of matching data in OpenRefine fields with more authoritative data in external sources. Wikidata (wikidata.org) is a free open database (One of the Wikimedia Foundation’s projects) that anyone can edit if he/she has a wikidata account. So, after you are done with the reconciliation of the data, you have to create a valid wikidata schema to upload your edits to Wikidata.

Useful links & References:

Thank You

--

--

Ekta Mishra
Code for Cause

Software Engineer @PhonePe | Former @RedHat'21 &Outreachy’20 intern @OpenRefine | Google Code-In’19 Mentor @JBoss | Teaching Assistant @Coding Blocks