Building Scrape.it in 2019 in 24 hours.

Terry Kim
6 min readMay 9, 2019

--

It is 11AM PST as I begin writing this sentence to start the countdown.

So a few days ago I dropped my laptop. It had all the code for the new launch of Scrape.it API & Scrape.it Terminal, I was working on, some experimental p2p stuff that I was wrestling with and threw out. Anguish — Sad — finally laughter, it was a piece of shit laptop that I overpaid in cash in a bid.

Fortunately, I’ve been trying to build the perfect SaaS for crawling the web since 2009, so I think I have the blueprint in my head clear as day by now. This is probably my 10th attempt. There are lot of factors to why I would just focus on this one thing. I don’t know maybe an obsession. I *must* complete what I set out to do….10 fucking years ago…and in 24 hours.

(And you know its hard owning up to this fact but fuck it, I was depressed as shit for a while but then began having some side effects from the medication I was prescribed, and hence I couldn’t do anything for the past few years just maintaining the few loyal number of subscribers remaining.

Fast forward to today and feeling much fucking better. Quitting anti-depressants is the best fucking decision I made. There was this mental block, like significant degradation in cognitive function. I couldn’t read anything for long or let alone find the motivation to even get up from the bed.)

So, unlike all these times when I just went completely radio silent, I’m gonna record the journey to re-launch/rebuild Scrape.it for the 10th fucking time and also keeps me motivated and accountable.

Here’s what I’m thinking:

You send me WSL (Web Scraping Language — docs here) surrounded by `` (called backtick had to google lol) ‘n separated by pipes |, case-insensitive.

https://scrape.it/wsl/`GOTO asdf.com|CRAWL a; |EXTRACT {'titre': '<title>'}`

(check that url at 8pm PST — gunning for proof-of-concept up and running by then — and then integrating the 3 stripe subscription on AWS lambda — 49, 99, 199/month— for a monolithic “legacy” architecture built with Python Flask, Xvfb, Chrome, Celery, RabbitMQ, SQLite.)

You get back a normal JSON array like this when it finishes crawling the first three links from http://asdf.com

[{'titre': 'About'}, {'titre': 'Asdf.'}, {'titre': 'Forum'}] 

You can use Xpath //titleor HTML tags like <title> from above:

Key Ingredients:

background: I got confused with Google Cloud Tasks so just went with AWS SQS

Architecture in a nutshell:

[GET scrape.it/wsl/`GOTO...`]—>[Gateway]—>[CognitoPool]—>[Sqs]—>[λ]Scrape.it will run your WSL script and return 200 OK and 500 on ERR.

WSL was designed to handle AJAXy, SPA from the past 20 years from all sorts of stacks from the ground up, and starts with the least amount of assumptions about edge cases like:

1. Navigations to previous page may function unexpectedly.2. URL doesn’t change at all, link to take you back may not exist.3. Page doesn’t reload instead updates via AJAX sorcery.4. HTML source has no ID or class, line breaks to separate data.5. Automating permutations of FORM inputs, real and faked.6. An IP address doesn't ensure network throughput over time.7. Long ETL cycle: Extracting-> Transform/clean data-> Load/deploy. 8. Can’t get text using Xpath because Computer Vision is needed.

The Scrape.it Way:

1. Navigation to previous page may function unexpectedly.

We only go forward. redundant pages, initial landing or search results page, can be skipped to by prepending with ^.ex) `GOTO github.com | ^CRAWL //h3/a IN .px-2` will crawl each repo links for each keyword search, directly visiting href urls instead of the default mouse click.

2. URL doesn’t change at all, link to take you back may not exist.

If web app URL never changes, we can only go forward and must start from the beginning each time. When we encounter multiple branches of path, like the previous` search result page, CRAWL action generates additional WSL scripts, each relevant to specific element out of all the matching elements from the provided selector. In the example above, ^ indicate we can skip redundant `GOTO github.com ` action via navigating directly to each URL, we expect to get back 30 links to crawl (10 links per each search results), 30 WSL scripts.ex) `GOTO https://github.com/BootstrapCMS/CMS`, `GOTO ...`,
`GOTO
https://github.com/Kooboo/CMS`,
`GOTO
https://github.com/bevacqua/js`, `GOTO ...`,
`GOTO
https://github.com/hakimel/reveal.js`
This really is the WSL's core fundamental. We can apply it to CLICK on forms, dropdowns, buttons, any element. We have a way of expressing an known/unknown collection of links or elements to repeat any sequence of actions with CRAWLThis WSLs get queued and processed by bunch of "Scrape.it Workers in the Cloud", each with a capricious IP address.

3. Page doesn’t reload instead updates via AJAX sorcery.

So in some SPA sites, the url never changes and we can only go forward without being notified about page reloads, then we rely on the above core fundamentals of generating additional WSL scripts, but instead of navigating directly to href url like above, we need to start from the scratch each time, imagine an alternate universe github from 2002.ex) `GOTO github.com | CRAWL //h3/a | EXTRACT {title: h3}` will generate the following WSL scripts:    `GOTO github.com| CLICK[1] //h3/a | EXTRACT {title: h3}`, 
`GOTO github.com|
CLICK...`, `GOTO github.com...`,
`GOTO github.com|
CLICK[10] //h3/a | EXTRACT {title: h3}`
If the website urls were partially dynamic, we could use ^ again.

4. HTML source has no ID or class, line breaks to separate data.

Xpath can do stuff CSS can't do but CSS is elegant for named classes and ids so we add support for both, although Xpath is strongly recommended and our preferred way going forward. The WSL from above, ^CRAWL //h3/a IN .px-2` is a perfect example.

5. Automating permutations of FORM inputs, real and faked.

Once again, WSL's core fundamental comes into play here. By nesting multiple CRAWL actions, we end up achieving a cartesian product of nested CRAWL actions.ex) `GOTO redux-form.com/6.6.3/examples/simple|
TYPE input[@name="email"] [user1@x.com, user2@x.com]|
CRAWL select/options
`
The list of WSL generated is too large but you can rest assure, every possible combinations of options & provided keywords will be generated.

6. An IP address doesn't ensure network throughput over time.

Sometimes, a website is only available in the said country, or show completely different content based on cities & provinces, or just straight up not showing you it for whatever the reason.  A merry-go-round of widely capricious IP addresses maybe be required to ensure high crawl throughput and success rate because relying on a single IP address given enough time will no longer be able to proper network connectivity with the target website for whatever reason.

7. Long ETL cycle: Extracted data-> Transform/clean-> Load/deploy.

Now your crawling has finished but the work is not finished! You have to now check that you got everything, clean the damn thing and finally deploying it online or in an intranet to a Data Lake.The appropriate solution is to simply allow access the data you just scraped via the same HTTP API you used above: https://scrape.it/data/`GOTO asdf.com|CRAWL a; |EXTRACT {'titre': '<title>'}`View the already "loaded" and deployed on a fast REST API. Congratulations! You've "APIfied" an APIless website in 4 chars.https://scrape.it/edit/`GOTO asdf.com|CRAWL a; |EXTRACT {'titre': '<title>'}`Edit, clean, wrangle your extracted data stored at above URL.

8. Computer Vision is needed in extreme cases.

Sometimes you can't get anything from the HTML until it is rendered in a browser, usually there's usually some black magic in javascript/html5 or texts in images (feelsbadman.jpg). AWS Textract is perfect for this but I applied and never heard back from Amazon!

Conclusion:

It is with great hope & confidence that WSL can crawl almost any website without having to write boilerplate code, setting up servers & proxies, expressed via an intuitive & succinct syntax that aims to be maintainable.

That’s All Folks!

It is 1:35 PM PST as I finish writing this sentence and submit on HN.

--

--