Legality of the web scraping in PropTech

Lack of data in PropTech had been widely sited as a major barrier but no one from the big boys network really wants to address it. So what should startups do? Naturally web scraping comes in mind. The truth is that everyone does it. Corporates do it too to gain edge over the competition but they do not want startups do the same and take away their business.

Robot.txt

How could you possibly get round that? Obvious answer is to read the terms and conditions. There is a quicker way of checking what startups can and cannot access by just reading the Robot.txt for each site. Let’s take the Rightmove for example:

# robots.txt for http://www.rightmove.co.uk

User-agent: Mediapartners-Google
Disallow:

User-Agent: MJ12bot
Disallow:

User-agent: *
Disallow: /login.html?*
Disallow: /register.html?*
Disallow: /addtoshortlist.html*
Disallow: /*/contactBranch.html*
Disallow: /*/save-search.html*
Disallow: /feedback/feedback.html*
Disallow: /price-information.html*
Disallow: /property/reportPropertyErrorForm.html*
Disallow: /property/mediaViewer.html?*
Disallow: /rss/*
Disallow: /user/*

A bot named Mediapartners-Google can access everything. Startups come under User agent * which denotes everyone else. According to the the code Rightmove does not allow startups to scrape much. No surprise there ! Startups should check the whole list and think of the ways how to maximise what’s on offer.

Be brave but not stupid.

Web crawling is what Google does. They index and use everything in the web making profits doing so. Sometimes there are objections just like in WSJ case (read it here) but most publishers do not object. Web scraping is a grey area but it is possible to navigate the labyrinth cleverly. There is an interesting article written by a data expert Gene Ekster CFA about the mitigating data compliance risks associated with web crawling. This infographic from the article shows the well-known scraping cases up till May 2016. Most have been ruled in the favour of the defendant (the web crawler).

Conclusion

Data is the essential infrastructure and big corporations do not want startups to use it. Imagine the lack of progress in society if you were not allowed to walk on the road without paying hundreds of toll charges along the way in order to reach the destination.

Corporates are sharks but startups are clever and the technology is going to get them.

Disclaimer:

This article answer does not constitute legal advice in any respect. The legality can not be generalised as the laws are different in each country.