In this document I am going to explain a task I did for the TBIJ. The task was about Safeguarding Adult Reviews (SAR) and Serious Case Reviews (SCR) in England. So the idea was to scrape some links of the Local Authorities and see which of these are in relation to a homeless death.
The links sent to me were already scraped, so that part was sorted for me. But, when I searched those links alone, I didn’t find any mention of SAR/SCR on some links (in connection with homelessness). So a better thing would be to search on the domain rather than just the links.
Paul Bradshaw, my professor said it would be a good idea to create a Google sheet and copy those links there. And then create a function that would search on Google for words like homeless, homelessness, or rough sleeping. This only has to be done once, for the first link, and then simply copy the formula for the rest.
For this, you have to first go on to Google, type site: copy paste the link, and then type the words you’re looking for in double quotes. Like this: https://www.google.co.uk/search?q=site%3Ahttp%3A%2F%2Finfo.westberks.gov.uk%2Findex.aspx%3Farticleid%3D30162+%22homeless%22&oq=site&aqs=chrome.5.69i57j69i61l2j69i60j69i59l2.6866j0j7&sourceid=chrome&ie=UTF-8
Then to create a formula, you need to look at the link above, and add the cell number in which the link is located between two ampersands — after the ‘q=site%3A’. After that add ‘%22 homeless%22 after the ‘+’ sign. Remove the rest. Like this:
=”https://www.google.co.uk/search?q=site%3A"&B2&"+%22homeless%22"

B2 is the cell address for the first link. Also, the entire thing should be in double quotes after an equals to sign, as this is how Excel determines it’s a formula. To do this for all the links, just double click over the little square on the cell you’re working on:

The formula will be copied for all the links and this can be repeated for other search terms as well, and add them in separate columns. I did the same with the words ‘homelessness’ and ‘rough sleepers.’

After that, I ‘joined’ all the links so that copy pasting them while writing a script would be easy. This formula can be written in any cell like this: =C2&” “&C155. C2 is the cell with the first link and C155 is the cell with the last link. Now all the links are joined together so we can easily copy paste just one cell in the script I wrote below.
The second stage is to write a script that would visit each link and return a true or false value (which in this case tells me if Google has returned any search results or not. I did this using Python, which is a popular language used to scrape websites. I did this in quickcode.io, which is a website you can use to code in Python. The reason I chose this website is because it enables scraperwiki to run — and scraperwiki is helpful because it can use many libraries.
The code I used was created by Paul and I just had to adapt it to suit the needs for this project:
#!/usr/bin/env python
import scraperwiki
import requests
import urlparse
import lxml.html
import urllib2
searchurls = [‘https://www.google.co.uk/search?q=site%3Ahttp://careandsupport.lbbd.gov.uk/kb5/barkingdagenham/asch/adult.page?adultchannel=7_7+%22homelessness%22']
print len(searchurls)
record = {}
for url in searchurls:
headers = {‘User-Agent’: ‘Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36’}
timeout=0.25
req = urllib2.Request(url, None, headers)
#adapting some of the code from https://stackoverflow.com/questions/3569152/parsing-html-with-lxml
doc = lxml.html.parse(urllib2.urlopen(req))
print doc
divs = doc.xpath(‘//div[@id=”resultStats”]’)
print len(divs)
record[‘resultsornot’] = len(divs)
record[‘url’] = url
scraperwiki.sqlite.save([‘url’], record, table_name=”googleresults”)
html = requests.get(“http://www.flourish.org/blog")
print html.content
# Saving data:
# unique_keys = [ ‘id’ ]
# data = { ‘id’:12, ‘name’:’violet’, ‘age’:7 }
# scraperwiki.sql.save(unique_keys, data)
Quick explanation of the code:
· All lines starting with a hashtag are comments to help me remember stuff/they were already provided by Quickcode.
· The function import “imports” libraries into quickcode to use.
· Searchurls is a term I created to remember what all links I scraped. All links would be stored in this term.
· The highlighted bit shows only one link, that’s because if I copy pasted the whole code with 154 links it would take pages in this document.
· The function print runs the code and shows the results.
· for url in searchurls: this loops through all the links and checks for results.
· “Headers” sends information to the site I’m scraping from and “pretends” to be a browser and not a scraper.
· The information is saved in a table with the name googleresults and can be accessed in another tab called “view data in a table” in Quickcode.
· The rest of the code is just something I copied from both Paul and different websites — I don’t necessarily understand it.
The one issue with this is that I used Google search results to scrape. Google actively blocks scrapers, so one way of getting around it is to use headers to ‘pretend’ to be a browser. However, even after doing this, Google allowed only the first 102 links to be scraped — and then I’m assuming it temporarily blocked my IP address. So I waited another hour or so and ran the script again, after which I got only 145 links scraped. For the rest, I just manually looked them up. This process was time consuming because I had to repeat this with different search results.
I’m sure there’s a better way to do this, and one thing I’m still looking at is running proxies in the script to make Google think it’s coming from different IP addresses — this will reduce the time I wasted waiting for my IP address to be unblocked.
You may be thinking, “why don’t you just use another search engine?” I tried that, but for some reason Bing, Yahoo, DuckDuckGo and others just didn’t return the search result.


I know I am doing something wrong here, but I haven’t figured that out. Let me know if you know what I am missing!
Anyway, once I finished scraping the results I just manually visited the sites with a positive result and checked whether the terms “homeless”, “homelessness”, and “rough sleepers” were mentioned in relation to SAR/SCR. After that I highlighted those links in green.
