Automatic Web Scrapping In The Cloud, For Free

Arief Bayu Purwanto
Bits and pieces of my mind
4 min readJun 1, 2020

Background

My local government has a website that shows public CCTVs. After tinkering with it for a few days, I became curious to know how many points are there on the CCTVs. Opening developer console and some XPATH commands gave me the number. Up to this point, I was quite satisfied.

However, a few days later, when I run the same XPATH commands, it gave me different results, a bigger number. A few weeks later, the number grew. Then I realized that the government keeps adding points after points. This had me tingled with curiosity. My premise was simple, if they keep adding points, there may be a point in time, where they also removed some. I want to know when and what’s being added or removed.

The premise for scraping was simple: I would like to know when and where CCTV points are being added or removed

The Execution

Since there is no VPS ready, for this setup, I’ve decided to use GCP. In a way, this also served as a learning process. I’ve had a meaning to checkout what GCP has to over. The last time I play with it was like, years ago. They don’t have PHP back then. However, I choose python for this exercise, in part, because my team also working on python projects. I would like to refresh my skillset.

How I setup the scraper

Here is the explanation.

  1. I setup Cloud Scheduler to send Cloud Pub/Sub called “log-functions” every day at certain after working hour time.
  2. Next, I created Cloud Function called “process_logs” that is triggered to run for every “log-functions” Pub/Sub sent.
  3. The Cloud Function will then scrap contents and store it as a JSON file. The JSON file is named with the date it is created.
  4. The generated JSON file then stored in two Cloud Storage buckets called "logger” and “public”.
  5. The one being stored in “public” gets renamed into “latest.json” for personal consumption. I’ll explain later.
  6. Files in “logger” are stored for historical purpose.
  7. The “process_logs” also check and diff produced JSON against yesterday’s data and mail the result.
  8. At this point, the process basically finished.
Now I can sleep peacefully, knowing that tomorrow morning, I’ll have the latest list of active CCTV points😂

How Much Does It Cost?

I’m not trying to create clickbaity post when I typed “Free” in the title above. Because I did not pay a single penny to Google. Check out this detailed transaction cost as proof.

One month of scraping process

As you can see, all costs are well within the limit of free tier/”always free” usage quota and the biggest cost was for Google Maps that I’ll explain below, since it should not be counted in the scraping process. However, Google Maps usage are still within the free tier quota.

Immediate Usage

Now that I have the data, the very first usage I can think of is to plot the CCTV points on the map since the original website only list CCTVs as a list. It is hard to visualize where spots are covered or not.

Want access to this Map result? Sorry, no chance. I‘m not ready to spend money for Google Maps usage✌

Fun fact: There is a bridge near my neighborhood that is quite prone to crime, especially at night. After looking at the generated maps, I saw that this bridge as 3 CCTV points assigned: one pointing in and two pointing out on both ends. Well done!

Conclusions

This has been a great exercise and very educational for me. I’m also preparing another writing that include step-by-step that can be followed, since what I do here is quite an isolated problem and I don’t think it is wise to share the codes. I might have missed something in this write up. So, may be, if you have question(s), fill free to ask me here, or on twitter @ariefbayu.

--

--

Arief Bayu Purwanto
Bits and pieces of my mind

Project Leader at AkuPeduli.org | Mozilla Representative | @bloggerngalam | @ariefbayu | blogging at https://ariefbayu.dev