Real Time scraping using Puppeteer

Patrick Arminio
Team Stink
Published in
6 min readJul 11, 2018

A few weeks ago, I was browsing Hacker News and I stumbled upon a unofficial API for the World Cup. I quickly messaged one of my friends and coworkers, Kiko, to see if he had any idea on how we could use that API.

Coincidentally, we were having an internal sweepstakes at Stink Studios London, so we went for a real time score and leaderboard for the World Cup. Luckily, it was almost the end of the week so we spent a bit of our 3pm Friday working on this side project. Kiko quickly designed a couple of nice and simple mockups for this score board. We had a unused vertical screen so he decided to create a layout for it.

I created a mockup using React JS and, since I knew this would only be shown on that screen, I decided to not care about responsiveness of the prototype, at least for the initial version.

I won’t describe the “architecture” for the application, since there are plenty of blog posts for that and I’ve based it on our custom version of the create React app scripts.

When I finished the first version of the prototype with static data, I started working on the API and database. I was going to try using App Sync, but I didn’t really want to setup a new AWS account, so I went for Firebase’s Firestore database. This allowed me to quickly create a reactive application where the data would be automatically updated after any change on the database.

To make things simple, I decided to go with two collections — one to store the list of teams (with some metadata, like colours, name and the “owner” in our internal sweepstakes) and one to store today’s matches with their status and score.

This is a simplified snipped of the component I used to fetch the current games from Firestore. I used the Render Props pattern to keep the data fetching and actual presentation and styling of the data separate. In the componentDidMount, I created an observer that will listen to updates to the collection. Unfortunately, I wasn’t able to find a nice way to fetch the refs while fetching a collection without creating two observers and then merging the data manually. Here’s the implementation of the component:

Thanks to the simplicity of Firestore, I was able to make it work in less than one hour. The next task was to fetch the game data from the API. I was already using Firebase for the database, so I went to use Firebase Cloud Functions to fetch the data from the API and the store the result on Firestore.

This part was also quite easy. The only issue I found was that Firebase Cloud Function doesn’t have any ability to run on a scheduled basis like a cron. Instead of creating a project using an App Engine and its cron feature, I decided to keep it simple and run a watch command on the Raspberry Pi that was being used to show the score board on the screen. I left the API endpoint without authentication in the first prototype since it wasn’t designed to be public.

After a few days, we noticed some errors in the data that was returned from the API, especially with times. As we know, dealing with time (and timezones) is hard.

I didn’t realise at first, but this year’s World Cup matches are in a few different timezones, so I guess that could be one of the reasons the API broke a few times.

For a work project, I used Puppeteer to render videos for some WebGL scenes, so I wondered if I could use it again to scrape data from the FIFA website in real time. Turns out, it is possible and pretty easy, too!

The first step was to have a look at the FIFA website and get a sense of the markup that was being used. Luckily for me, the structure was pretty simple and had plenty of classes that I was able to use with puppeteer.

The first thing I had to do was to fetch today’s games. Remember, we are only storing the today’s games on the database, so I don’t really care about other games. The FIFA website has a nice today class on the list of today’s games, so I was quickly able to find the correct elements (yay semantics!).

Unfortunately, this class wasn’t used on all the rounds of the games, so I had to get today’s games by parsing the date in the HTML and do a manual filter on all the games.

After that, I needed to fetch the games information, but, again, this was pretty simple since we have access to the DOM. I really should say thanks to FIFA’s developers for having plenty of classes in the markup.

The only remaining challenge was to get the date and time in UTC. At first I was happy to see that there was a data attribute, utcdate, on the elements that were showing the date, unfortunately, the content of that attribute was wrong.

My second option was to get the local time that was shown on the page and the timezone from the venue, and then create a new date object based on the correct timezone, and store that on the database. After finding all the timezones for the cities in which the games are played, I wrote this small code using moment.js to get the date and time in our local time:

Cool, done! We have a way to get the data from the website, but can we get this data in real time and update the db accordingly? Well first we had to update the db with the data. I didn’t want to make the database publicly writable so I used the Firebase admin SDK with a service account key that I got from the Firebase console. I also made sure that I was deleting old games, so the app would always show today’s games.

Ok, what about that real time thing? Once again I was “saved ” by the FIFA website, which was already polling the data for me, so I didn’t have to refresh the page every time I wanted to get new updates. Instead, I asked puppeteer to fetch the content of the page when I wanted. This was pretty simple. I wrapped all the logic to fetch the data in an infinite loop and then added a promised based sleep time out so it wouldn’t update too often.

Considerations

Puppeteer is an awesome tool. If you haven’t watched it yet, I recommend the following talk by Eric Bidelman: The power of Headless Chrome and browser automation. It’s a short showcase of some of the things you can do with the tool.

I can see puppeteer being used more and more in the future. It can be helpful, for example, to deal with control panels that don’t have an API. Instead of reverse engineering the backend “API” for, let’s say, a control panel of an air condition system, we could use Puppeeteer to build an API on top of the control panel (if that exists, of course).

I’ve been using TypeScript for the backend code and, while sometimes it can be tedious, I’m quite sure it saved me a bit of time thanks to typings for the firebase library, as there have been a few occasions where I didn’t read the docs properly and TypeScript helped me find issues even before running the code.

Overall, I had a nice experience using Puppeteer, TypeScript and Firebase and I definitely see myself using these tools again in the future.

--

--

Patrick Arminio
Team Stink

Full Stack developer @Stinkstudios. Chair of Python Italia.