Politely scraping by — tracking my passport’s whereabouts with Puppeteer and Docker
From time to time, I make the quest home to update a sticker in my passport so that I can keep working overseas. It is normally a stressful three week or longer trip that carves a nice hole in my bank account, while I pretend that everything is not resting on a stranger’s decision as to whether I can return to the life I have built over the last few years. Of course I am immensely grateful for the opportunities I have been afforded to work abroad, and it’s well worth the stress and the paperwork.
The process involves having your passport held while the new sticker is issued and the paperwork is finalised, followed by a courier delivering it back to you within a reasonable timeframe. The whole process is rather nerve-wracking in general until you have the passport safely back in your hands and it’s a massive relief when it’s all done.
🏛 🙍🏻 🕰 🚗 📬 📘 🙆🏻 🎉
Depending on which country and city the passport is processed in, the local courier company used can vary quite a bit in quality of service. In my home country, the courier of choice does not offer modern features such as SMS tracking updates. The alternative is to keep the tracking webpage open in a tab at all times, feverishly hitting refresh hoping for an update. This is very important, because if you miss the first delivery attempt the package is taken to the depot for personal pickup which causes further delays and stress.
I’m currently going through this process at home. Only this time, I decided I might be able to put some more chill in my life and figure out a solution to get SMS updates on my passport’s delivery journey. That way I’ll know where it is without the paranoid page refreshing when with family or friends (“sorry, I’m just checking on my passport”), or when trying to concentrate on work.
I sat down and ran through the different ways I could set this up. A lot of it boiled down to things I knew and services I’d used in the past.
For sending the text message updates, I figured I’d use Twilio as I already have an account and a credit balance with them from previous personal projects. That and they’re a rad company with good quality SDKs. I wanted this project to run without my laptop having to be powered on and connected to the (dire) hotel internet, so naturally the code should run in the cloud somewhere. The courier service doesn’t offer an API for tracking packages, so web scraping would be the only way to do this. Web scraping is a bit cheeky and I recommend it as the last resort for this kind of project. After some hesitation I decided that if I just ran a scrape every 10 minutes, it wouldn’t be causing any noticeable stress on the website at all.
Google recently released a great tool called Puppeteer.
Puppeteer is a NodeJS library that provides great abstractions for running a Chrome browser instance in a headless fashion. I’d been looking for an excuse to give it a try. Today seemed like a good day to do so!
I ended up writing a small NodeJS script which uses both Puppeteer and Twilio to complete the task at hand.
So what exactly does the script do? Here is what happens every 10 minutes:
- Launch Puppeteer, and open the tracking URL as a page.
- Extract the text content of the delivery status DOM element returned by the query selector.
- Compare the text content with the previous extraction that ran 10 minutes prior.
- If the two don’t match, the status must have changed. Cache the new result and dispatch a text message via Twilio with the new status in the message body.
- Wait 10 minutes, and repeat all steps.
The diagram below is a visual representation of these steps:
A simplified extract of the Puppeteer code:
I then bundled everything up into a Docker container so it would be easy to deploy and run. Docker allows me to set up Puppeteer just once, and I can expect my code to work in the same way in production as it did on my machine when developing. I also have a number of convenient places for my container to run in the cloud without having to manually configure any VMs. I’ll cover more on the advantages of using a container throughout this post. Here’s a handy resource for getting Puppeteer up and running in a Docker container. 📦 🐋
To mitigate my container possibly crashing (uncaught promise rejection, anyone?) I wanted a more permanent solution to store the latest delivery status rather than relying on a variable at runtime. That way if the container restarted due to an exception in the NodeJS process, I wouldn’t receive duplicate delivery status updates. A database is overkill for one tiny piece of data (a string!), so a text file would definitely do the trick. Where can I persist this file so that it survives a container restart though?
That’s where Docker Volumes come in handy.
Docker Volumes allow you to have persisted data that containers can share and access. You can read and write from these volumes, by ‘mounting’ them to a location of your choice within the container’s file system.
I created a file called status.txt in a mounted volume for this purpose, and added some code to my script to write the new status to the file whenever it changed. When the script boots for the first time, it reads the contents of the file and caches it as a variable, ready to match against a newly fetched status from Puppeteer. When deployed, this file lives in an Azure File Storage share for reasons I’ll explain soon.
After testing locally, the passport tracker was ready to run in the cloud! There are many strategies for deploying Docker containers. This scenario didn’t have to scale, and only needed to run as a single instance. Therefore, I decided to deploy it on the Azure Container Instance (ACI) platform, which is currently in public preview. Deploying an Azure Container Instance can be done in as little as one command with the Azure CLI in your terminal, and only two if you don’t already have a resource group already set up for it in your Azure account. ACI supports Docker images pushed to either Docker Hub or Azure Container Registry.
You can specify a volume for your container from an Azure Storage file share quite easily with a few extra options in your deployment command, which is why I chose to store my status text file in this way. It saved me from writing a chunk of extra code just to fetch the remotely stored status file and write to it. Instead, I could just pretend it was a local file and interact with it that way. This feature demonstrated to me that Azure Container Instances are pretty magical, and this project was a perfect use case for it ✨
I tagged and pushed my Docker image to an Azure Container Registry of my own, but you can also push to Docker Hub and deploy from there if you like!
The final deployment command looked like this:
az container create \
--resource-group <resource-group-name> \
--name <container-instance-name> \
--image <docker-image-name> \
--azure-file-volume-account-name <file-storage-name> \
--azure-file-volume-account-key <file-storage-key> \
--azure-file-volume-share-name <file-share-name> \
Within 60 seconds, my script was up and running and tracking delivery status every 10 minutes. Over the next couple of days, the realtime delivery updates starting turning up on my phone.
They continued to roll in…
Oops! That was ‘out for delivery’ much earlier than I expected. I quickly contacted my sister and asked her to keep an eye out for the courier as I had ducked out to do an errand in the city.
Sure enough — success!
Me with my passport back in my hand, feeling relieved 😌
In all honesty, that ‘out for delivery’ text message kinda saved the day. There’s a chance we might have missed the courier without it. It was well worth the time to code this little service.
In the future this should be possible to do in a serverless environment, which would be preferable over the work of setting up a Docker container. There are still a couple of challenges in getting Chromium and Puppeteer working reliably in this scenario (see issues #515 and #603). Perhaps the next time I file paperwork I’ll be running this in an Azure Function instead! Or, maybe the courier company will implement their own solution which would be even better 😉
I hope you found this post informative and entertaining! You can find the complete code for this project on Github if you are interested in checking it out.
🙋🏻 🌏 ✈️️️️ ✈️ 🌎 👮🏾 🛂 ✈️ 🗽 🏙 ☃️ 💗
Thanks to James, my great friend and mentor, for hacking with me on Azure Container Instances this past summer. It got me excited about containers again.
The diagrams in this article feature a combination of original and sourced work. I hand coloured and made other modifications to the following icons from the Noun Project:
- ‘TXT File’ by Nikita Kozin
- ‘text message’ by Ben Davis
- ‘container’ by DPIcons
The Puppeteer logo belongs to Google and the Twilio logo is, well, Twilio’s.