If in Doubt, Paste to Excel

Web Scraping with Bash, Grep and Excel

The web is full of information, when you come across something really useful but it turns out that it’s hard to copy and paste then do something along the lines of this.

First, take a look at the source of the page to see where the information is in the page structure. Could be that you are looking for a specific class or id for a div element or a unique string. Once you know what you are looking for, and in this example we are looking for email address within the page, create a simple bash script which looks like something like this:

#!/bin/bash
for i in `seq 1 10`; do
echo “www.examplepage/users/page/$i/"
curl -v — silent https://www.examplepage/users/page/$i/ 2>&1 | grep @
done

This script will go through 10 pages (www.examplepage/users/page/1/, www.examplepage/users/page/2/ etc) and output a strings where the character “@” is within the page structure.

The important line is

curl -v — silent https://www.examplepage/users/page/$i/ 2>&1 | grep @

where we only output the return of the curl command (no errors/warnings) and then then use the incredibly useful grep command to look for our string (in this case the character “@”).

An example output is something like this:

<a class=”keywordclass” href=”https://www.examplepage/users/user@example.com/">

every time there is a line matching the “@” character. Say we have a 100 of those entries, how do we semi-automate turning that into a list of nice email addresses?

Here’s one way: Just paste all these lines into Excel and use the Excel String functions and filter to turn it into an actionable list.

In this case we have pasted the string output from the Bash script into column B.

The C column is a simple MID function where we take all (ok, 100) characters from character 61.

=MID(B2;61;100)

In column D we just cut off all the characters to right and including the “/” character

=LEFT(C2;FIND(“/”;C2;1)-1)

What we like to do is then clean the list using filters. For example you may have duplicates, or entries which do not contain in a email. With Excel it’s super easy to filter for those entries.

While you could just to all of this programmatically what we like about this semi-automatic way is that in Excel it’s very easy to play around with potential filters as you see the effects immediately.

For Growth Hackers this may be a good technique, you can always automate it further.

We would go so far as to say “If in doubt, paste to Excel” and take it from there.

— —

We are always looking for fun projects or good people to join our team either in Houston, London or Madrid or remotely. Get in touch here.

One clap, two clap, three clap, forty?

By clapping more or less, you can signal to us which stories really stand out.