Google sheets: Basic scraping, sorting and cleaning data.

Google sheets is a great starting point for exploring many of the basic processes of data journalism.

20 min readJul 16, 2021

In this tutorial, I want to work through an end-to-end process of scraping data from a page, organising it and cleaning it up so we have some structured data to analyse.

We’ll use a couple of formulas and some standard spreadsheet tools that make up the backbone for grabbing content from a web page. It’s a bit long. But I wanted to try and put something together that was detailed enough to follow from start to finish but also include some useful asides.

So we’ll look at a couple of formulas:

=importHTML()=Split()

We’ll also use the standard find and replace function to tidy up the data and explore how regular expressions can help add rocket fuel to the process.

Nothing extra added. Just Google Sheets.

Hopefully, by the end, you’ll see how the process could be adapted to work more generally.

The setup

Let’s imagine a scenario.

This list of deaths in immigration custody contains loads of useable data but its ‘trapped’ in a list

Lets’s say I’ve been researching stuff around immigration centres and came across a Wikipedia entry for UK detention centres. As part of the page, I found a list of Deaths in immigration custody which contains some interesting content that I might want to use later or build on as I do more research.

The individual elements look useful — age for example might be helpful in getting an idea of any pattern. But all the content is stored as a list and clumped together as blocks of text. Breaking down that list into its constituent parts and putting them in a spreadsheet means I can organise it and possibly add to it as my research develops. Putting it into a spreadsheet also means I can analyse it using the common tools in a spreadsheet e.g. using a pivot table.

We’re adding some structure to the data.

To get that, there are a few steps I need to go through:

I need to get that content into a spreadsheet
I need to organise that content into something a bit more structured and consistent
Save the content in a format that I can return to later

Scraping a page

The first thing I’m going to do is grab that list from Wikipedia.

Now I could copy and paste and see what happens. It’s a gamble — in this case, it works — but it may not always be the case. Website code can often hold on to content pretty tightly and we can often bring a lot of extra stuff with content when we cut and paste. You may have experienced that with all the formatting that carries over when you cut and paste text

So a little more reliable process would be good. So here’s our first bit of code in the form of a formula.

=importHTML()

It does pretty much what you’d expect. It imports HTML from a site based on some parameters

=importHTML( the link, the HTML element we want -a list or a table, if there’s more than one list of table on the page which occurrence of that element )

In a new spreadsheet add the following formula to A1

=IMPORTHTML(“https://en.wikipedia.org/wiki/Immigration_detention_in_the_United_Kingdom#Removal_centres","List", 5)

In this formula, we’re asking Sheets to go to the Wikipedia page

=IMPORTHTML(“https://en.wikipedia.org/wiki/Immigration_detention_in_the_United_Kingdom#Removal_centres”

Then get the HTML and look for the specific HTML code for a list. The formula only does Lists and Tables.

,”List”

Then grab the contents of the fifth list on the page. A little trial and error adding numbers got me to five. You just need to tweak the number and compare the results to the page until you get what you’re looking for.

, 5)

And here is the result.

You can tweak the number in the formula to ‘search’ for lists on the page.

A little tip. If you double-click between columns A and B the column expands out to show the contents of each cell.

Double-click between column headers to expand the cells to fit the content

Capture the results

The next step is to think about how we can organise that content.

But before we do, I’m going to copy the results and add them to a new sheet.

Copying the values of a formula into a new sheet ensures we capture the actual values rather than simply the result of the formula which are often dynamic and not visible to a lot of the tools in a spreadsheet

Copy the columns with all the results in and the easiest thing to do is just select the whole sheet
Open a new sheet by clicking the + tab
Select Edit > Paste Special > Paste Values Only

This might seem like extra work. But because the original results were created using a formula, the content is dynamic — it can change as the formula changes and some formulas and spreadsheet functions can’t see the content in the same way it would see plain text and numbers. So it’s hit-and-miss as to what we can do with the results.

Doing a paste values only in a new sheet converts the results into ‘normal’ text that we can work with more reliably but it allows us to keep our original source material

So now I can think about organising that content.

Breaking down the details with split()

A look through the results and the page itself shows the content follows this structure.

Name | Cause of death | Age | Date of Death | Location.

But it’s by no means consistent. The first two rows, for example, are missing data.

But it seems to be the dominant structure. So the next step is to see if we can break down each result into those blocks and create separate columns to organise things.

Again we could do a cut and paste across columns. But that’s time-consuming — even for this relatively small number of rows. I want something to automate the process.

To do that we’ll use another formula

=split(CONTENT,DELIMITER,split_by_each,remove_empty_text).

As the name suggests the formula works by splitting the content of a cell-based on a specific character or characters, what’s called a delimiter.

Here’s an example:

Some text, separated by a delimiter, can be separated using the split() formula

If we use the comma (,) as the delimiter we could break that down into three blocks

The split formula breaks content up based on a character or characters

Some textseparated by a delimitercan be separated using the split() formula

Obviously, the key is to find a consistent delimiter. So let’s take one entry from the beginning, middle and end as a sample of our results.

*Siho Iyiguveni* — *8 October 1989 — Harmondsworth Detention Centre*[33]*Rene Frings* — Heart disease, aged 44, *23 November 2011 — HMP Wormwood Scrubs*[36]*Carlington Spencer* — Stroke, aged 27, *3 October 2017 — IRC Morton Hall Lincoln*[72]

There are some possible candidates.

The comma (,) separates the content as the name and cause of death; age; the date and location of death.
The asterisk(*) separates the content as name; cause of death, age; date and location
The hyphen (-) separates the content as the name; cause of death, age and date of death; location.

So, there isn’t a delimiter that clearly separates all the individual elements.

But out of those three, the asterisk seems like the most consistent across all the elements and a good place to start. I can break them down into sub-chunks and then look at them independently.

The asterisk is also a good delimiter to use as it also separates these square brackets. These are Wikipedia footnotes that we don’t need.

In the new sheet we created add the formula =split(A1,”*”, TRUE, TRUE)to the Cell B2

The formula takes the content of A1 and splits it up:

=split(A1

Using the asterisk (*) as the delimiter:

,”*”

The two final elements set the parameters for the two options in the formula split_by_each and remove_empty_text.

, TRUE, TRUE)

Adding the split formula helps break up the content.

Split() options

It’s worth diving into the split options a little more. Setting the split_by_each option to TRUE we tell the formula to split around each delimiter. We’ve only got one in this formula — the *

But you can have more than one, so we need to have a little control over what happens when the formula finds each one — that’s what split_by_each does

Let’s take this as an example.

=split(“Maybe we could build a fire and sing a couple of songs, huh? Why don’t we try that?!”, “,?”, TRUE, TRUE)

If I run that formula, I get the result

Maybe we could build a fire and sing a couple of songshuhWhy don’t we try that!

You can see the delimiters, the comma and question mark have been removed and the text split at the point that the formula found a comma OR a question mark.

If I do the same thing again but with the split_by_each set to FALSE, the split doesn’t work. Setting it to FALSE effectively tells the formula to look for ,? together.

=split(“Maybe we could build a fire and sing a couple of songs, huh? Why don’t we try that?!”, “,?”, FALSE, TRUE)

The split_by_each parameter is actually set to True by default so technically we don’t need to include the statement. And we only have one delimiter, the asterisk, so the default is fine. But we’ll see later that changing this setting can be useful so better we know why it's there.

What about theremove_empty_textparameter. This attribute removes any empty results from the split.

Take the example below. If we have the formula

=split(“Text * split by asterisk * but ** there is a gap*”,”*”, TRUE, TRUE)

We get the following

Textsplit by asteriskbutthere is a gap

The formula sees that there is nothing between the two asterisks after BUT and ‘removes’ it.

But set the remove_empty_text attribute to False

=split(“Text * split by asterisk * but ** there is a gap*”,”*”, TRUE, FALSE)

And we get

Textsplit by asteriskbutthere is a gap

You can see that the formula keeps the gap.

There are times that this would make sense to keep the gap.

Looking through the data we can assume that the two commas after Paul in that second row show an empty place in Lane Two rather than no content. So setting the remove_empty_text attribute to False means we can maintain the structure of the data.

This can be important stuff more generally when we’re working with content that uses delimiters like commas to denote a more formal structure — like a csv data file.

When working with CODE, no content doesn’t always mean no content!

But we digress. For now, we can go with

=split(A1,”*”, TRUE, TRUE)

We can drag the Fill Handle down to the end of the results to copy the formula.

You can use the drag handle to copy a formula down the sheet

You can see that new columns are created. There are some gaps but the general structure remains:

Name | Cause of death and age | Date and location of death | Wikipedia footnote

And even where there are gaps we can see the content lines up. You can see that all the asterisks have also disappeared.

That’s a part of the way the SPLIT formula works, the delimiters are always removed. Great news for us — we don’t need them — But it’s worth remembering for later.

A quick scan of the results shows there’s one that hasn’t split.There is an asterisk missing before the date of death (24th Jan).

*Robertus Grabys* — Suicide by hanging, aged 49 24 January 2000 — Harmondsworth Detention Centre[35]

It’s an early indicator that this data may not be as ‘clean’ as we would like. In other words, the formatting and content are not consistent. We can easily solve this by quickly adding the * to bring it in line with the others.

Hand editing results like this might seem daunting when you have a very large data set. But the truth is, it may come to that. You may have to go in and pick out the stray ‘dirty data’.

For now, we’ll correct errors as we see them and get the content sorted as much as we can which should hopefully make it easier to find and clean data as we go and minimise the hassle.

So, let’s look at breaking down some of those subgroups. Let’s start with the cause of death and age and explore another set of random results from Column C

– Suicide by hanging, aged 26,– Suspected heart attack, aged 52,– Murdered, aged 49,

More splits

So we are looking to use our split function again and there are a couple of possible delimiters here we could pick. The commas are still there separating the elements, So I could try that

I’m using the next free column for the formula because the Split() is pretty greedy. It will take as many columns as it needs to do the split and display the results. But if there’s any content in the way, it throws up an error. So at this stage, I tend to build up the content column by column with a view to sorting it later if required.

So the next free column for me column F and I’m going to add.

=SPLIT(C1,”,”)

Take the content of C1 (the newly filtered result)

=SPLIT(C1

And split the contents on a comma.

,”,”)

The split formula isn’t picky when it comes to using the comma as a delimiter.

On first inspection, that looks good.

But a quick scan of the columns shows that one entry has one more column than the others

– Schizophreniahypertensionaged 35

If we look at the original entry, we can see this particular case has two issues listed as the cause of death separated by a comma. But the split formula simply splits on the comma. It can’t differentiate the context.

As I noted earlier, we could correct the anomaly. in a small data set that’s not too much of a problem. But this suggests a pattern that in a large data set might be more common — there may be more than one death with multiple factors. It highlights why a human, journalistic eye, is key here. We know the context of the data. So, better to find a more reliable delimiter.

Remember a delimiter doesn’t have to be one character. It can be more than one — a string of characters. Or put another way ‘a word’. So we can use that by changing the formula in F1 to the following and doing the copy down in all the cells in F.

=SPLIT(C1,”aged”, FALSE)

There’s a little more going on here than with the other uses of split.

First, we say split the contents of C1.

=SPLIT(C1

Based on the delimiter “aged”

,”aged”

Now you might remember that earlier, I detoured a little to look at how setting split_by_each changed how the formula treated the delimiter.

So If we leave the split_by_each as the default TRUE, each letter of aged would be treated as a separate delimiter

- Collapsed while running, aged 34,

Would break down the text anytime there was an a, g, e or d.

- Collpswhilrunnin,34,

So we set the split_by_each element to FALSE so it’s treated as one word.

Using the split_by_each set to False means we can split by a word.

We don’t need to add the remove_empty_text value here as it’s set to TRUE by default.

Avoiding deleting content with split

You’ll notice that, as per the normal behaviour, the formula removed the delimiter and aged has been deleted. In this case, that is ok. We would only want the age expressed as a number anyway. But it’s a reminder to be careful when using strings to split content. We need to be careful that the string isn’t included in other areas. eg. what would have happened if one of the rows had the phrase managed in it somewhere?

We need to look for cells where that pattern appears and check for outliers. Luckily there’s a quick way we can do that using conditional formatting

Select the whole table
Select Format > Conditional formatting
Set the Format Rules to Text Contains
Add aged as the pattern
Set the Formatting style as a colour that will stand out.

Conditional formatting helps identify data elements

In this case, we would expect the Cause of death and age column to change colour. But in a large data set, we could also look for stray cells of that colour.

More ‘dirty’ data

So the final block to break up here is the date of death and location.

That content seems to be separated by a dash. So we can use our Split() formula again to give us the two columns of date and location.

Now we’ve got quite a few columns to work with here generated by a few formulas.

Organising the data

The general structure is where I want it, but there’s a lot of stuff I don’t need and some of this content still needs tidying up — there’s a lot of stray dashes for example.

I’m like to keep a backup at key points (years of video games perhaps). So at this stage, I tend to make a new sheet, copy and paste values again to give me raw text to work with for the final step which is to make sure my content is properly organised and consistent.

Looking through my sorted content, the first thing I notice is that there’s a weird column. What should be the date is in a weird number format. That was something the formula covered for us, but now we’ve just got text, the spreadsheets best guess is as to the format of the number is, well, wrong.

But we can easily change that by setting the FORMATof the column to date. We’ll need to check this with all of our columns as we go through cleaning up.

Setting the FORMAT for the column can catch weird errors as we gather values from formulas

The next thing is to get rid of the columns I don’t need — that includes the intermediate splits and the Wikipedia references.

Just to say that personally, I’ve got into the habit, where possible, to include the raw content in my sheets along with the sorted stuff. It means I can always fall back on the original to check or analyse in a different way. So I’m going to keep that first column.

Now we have the data elements separated we can get rid of our working out and redundant columns.

More cleaning with FIND and REPLACE.

Now that each element is separated, I need to look at making sure the data is formatted correctly and there are no stray elements. column. Let’s start by getting rid of the comma at the end of the age.

You can see the third entry doesn’t have one and the spreadsheet has placed that to the right of the cell. The rest are to the left. Sheets is seeing that as a number and the others as text. So I need to get rid of that trailing comma and make sure the format is correct.

We can do that easily using the Find and Replace function.

Highlight the whole column.
Select Edit > Find and Replace
Put the comma in the Find section, leave the Replace with section empty
Click Replace all

Magic!

We can do something similar with the dashes in the Cause of death column.

Select Edit > Find and Replace
Put a dash in the Find column
Nothing in the Replace with column
Replace all.

Using Find and replace doesn’t seem to work with the hyphen

But hold on! There are still dashes in there. But here’s the trick. These aren’t dashes! These are emdashes. It’s a great example of how inconsistent use of characters is a real issue when ‘cleaning data’.

Copying stray characters and using them to drive find and replace is a great way to tidy data.

In this instance, we can simply run another find and replace using the emdash ( — ). But, as I’ve said before. Sometimes it’s not that straightforward. You may have to do multiple passes to clean stuff up. I can also see that we also have a trailing comma in our Cause of death column. Could we use Find and replace again? Not this time.

REGEX and Find and Replace

Remember we know that we have at least one entry where a comma separates two factors so I can’t just go at it with split(). I need to be able to specifically define that it’s the end comma we want to get rid of.

Luckily, we can still use the find and replace function to do this. But instead of just searching for the comma we can add some extra power to by using a bit of code in the form of a regular expression.

A regular expression or REGEX is a way of describing a pattern of content to look for rather than a specific term. They use what can be a complex collection of letters, numbers and symbols to build these patterns. For example, a caret (^) denotes the start of a string and the dollar symbol ($) the end. Here’s an example.

Let’s say I want to anonymise the location details in some data by removing the last three characters of a list of postcodes in my data. UK postcodes (e.g. SW1A 2LW) always end with three characters that refine the location. But we don’t know what they might be in our data. A regular expression would describe the pattern to look for.

simple regular expression to do that would be

.{3}$

Here’s how that regex breaks down.

The . (dot) says find any character
The {3} says do this three times
The $ says start from the end of the string.

So let’s try some regex to get a little more control over where our comma is.

Highlight the column with the cause of death in it
Select Edit > Find and replace
Check the Search Using a Regular Expression box
Add , $ to the find box (there’s a space between the comma and dollar)
Leave the Replace With box blank
Click Replace All.

regex can help us focus what we find and replace

All the trailing commas go but that comma between the cause of death remnains. Let’s break down that regular expression:

The , is the thing we are looking for, the comma
The $ is a special character that says ‘look from the end of the string’

You can see how we could change the comma for another character. But be careful. Some characters have a special meaning in REGEX. We’ve already seen how the dot means ‘any character’. So if we wanted to get rid of the full stop at the end of any sentence we might try

.$

But following the rules of regex, would find any character at the end of the line. Similarly, if we wanted to remove the dollar sign from the beginning of a list of costs, we can’t use the following. The dollar sign here, as we have seen, means the end of the string

^$

To get around this we use an ‘escape symbol’ in the form of a backslash. (\). So the examples above would become \.$and ^\$

I think it’s fair to say that at first glance, regex statements are not very intuitive. There’s a bewildering array of operators and loads of combinations you can apply. The positive is that makes them really powerful and you’ll find them in most programming languages. So it’s worth knowing about them. But personally, I find them a bit of a brain squeeze.

There are lots of cheat sheets and tutorials kicking around to help. But my approach is to keep a note of patterns that work for tasks I do a lot. I have a little notepad file full of them, so I can reuse them as required. But as with most things coding related, Google is your friend.

So that’s the comma at the end taken care of. But let’s not waste the trip into regex. Checking through the final column of locations I can see we still have a few of the Wikipedia Refs in there e.g.

Colnbrook Detention Centre[36]

Getting rid of these would seem like an ideal thing for regex. So, I’m going to select all the columns apart from the first, raw data column. Trawl through to see if there are any hiding and get rid of them.

The escape character might be needed in your Regex statement.

Select the columns I want to look in.
Go to Edit > Find and Replace
Check the Search Using a Regular Expression box
Add \[.+\] to the find box
Leave the Replace With box blank
Click Replace All.

The \[ and \] represent the square brackets. Square brackets are reserved in REGEX so we have to note we want to use the actual character by escaping them — using the backslash. The . means find any character. The + means one or more times till you have everything.

Clean up what we can’t see

Finally, I’m going to tidy things up by making sure I have no spaces at the beginning or end of the entries. Whitespace as we call it can cause all kinds of problems. In spreadsheets, for example, it can make it hard to filter columns.

Trimming the spaces at the start and end of cell content (whitespace) is a common data cleaning task

Stripping out whitespace is a very common function in clearing data, so it’s something built into sheets.

Select the whole spreadsheet,
Select Data > Trim Whitespace.

Describing the data

The final step is to add some headings that reference what the data is.

Add a new row at the top and add column headings.

Headings are really important to be able to identify content.

Now I can use the built-in filter options to quickly sort the data as well as more advanced options like a pivot table to really drill down into the data.

Once the data is organised and structured we can interrogate it using tools like Pivot Tables.

I can also download the sheet as a new spreadsheet file so that I have my data in a form I can share and re-use or add to.

Conclusions

We’ve covered a lot of ground here.

We grabbed the content of a list from a WIKIPEDIA page using the importHTML formula. It only works with LISTS and TABLES so if you’re using it outside of Wikipedia you’ll need to do some digging into the html of the page you’re after. But play nice ! Just because we can scrape pages, it doesn’t mean we are actually allowed (copyright anyone!) or that we should.
We then used the =SPLIT() formula and tweaked its attributes to break the content down and organise it into columns.

We had to do some checking and some cleaning up along the way. But those two formulas alone save us from a huge amount of cut and paste when it comes to grabbing and organising content.

Finally, we did some cleaning — getting rid of unwanted symbols and stray punctuation. We used find and replace function along with some regular expressions to make sure our content was as clean and consistent as possible.

It’s easy to get focused on the details of the data here. But this isn’t about the immigration data. Ideally, you can see how the formulas and functions alone could form the basic building blocks of any part of the data collection process.

Hopefully, you’ll also get an idea of the ways ‘dirty data’ can trip us up. But it’s a good reminder that working with data is a process — this stuff isn’t a magic bullet. The process of working with data can be time-consuming, hard or frustrating. It might be all three. The key is that there are large chunks of the process where tools like Google sheets can act as reliable pain management!

Next steps.

If you want to develop some of the things we’ve explored here then you might want to consider the following:

The importHTMLformula is limited to lists and tables. Have a look at the =importXML() formula. How could you use it to unlock content on other pages? tip — the HTML for a list is uland a list item is li
How could you use a combination of REGEX and FIND and REPLACE to remove the dashes and emdashes at the same time?