How to keep your app serverless with Google Sheets as a data-storage
This is the 3rd article in a series about covering COVID-19 worldwide spreading from a data-journalism perspective. The first article was a general description of the solution behind our interactive map. The second was about working with various data-sources and rendering custom features on map.
And this one is about technical aspects of building a serverless app on Python for fetching, combining, storing and serving JSON data.
Cloud Function
Initially, the scrapping app was intended to be executed as a simple Python script using cli, as Google Cloud Function, Amazon Lambda or a Docker Container. And it worked fine on my Ubuntu server as a crontab job.
However, there was an obstacle with manual data input that required me to update COVID-19 data for Kosovo and Belarus in source code and then pushing it to repo. That’s not a convenient and resilient solution.
But, how to store and update this data keeping the app as lightwight as it is. How to make the process of updating the data smooth for people without tech background (journalists)?
There is a solution.
Google Spreadsheet as a Storage Engine
Imagine, you have to store and update data that is regularly used by your app. The data has to be securely updated by various persons and to have a user-friendly interface.
Of course you may spend few hours to setup a whole virtual machine, configuring webserver and web framework for this solution, manage user credentials.
But, what if we simply use a Google Spreadsheet as a storage engine and API-layer?
Spreadsheet Key Benefits
- Access Control Model
- Well-known UI and workflow
- Data validation
- Data export as a CSV via URL (we ignore Google Sheets API here)
The Solution
Here is our data-storage in Google Spreadsheet:
To make it easier for users to update the data and to avoid input errors, we may implement Data validation and formating for columns.
Each field in COUNTRY column becomes a dropdown with a list of countries from another sheet. Kinda of ForeignKey relationship in your model without unique indexes.
Now we have to make the Spreadsheet available as a CSV-file via URL: File → Publish to the web.
Done! Now you have a URL with your data in a readable format. You may share the Spreadsheet with designated people and let Google to manage Authentication and Authorization for you.
Obviously, it’s possible to use Google Sheets API to fetch the data. However, CSV export seems to me much more faster and easier in this particular case.
Python Cloud Function
The last thing to do — is to write some code. Here is a primitive function that fetches content of the Spreadsheet and returns a list of records:
However, that’s not enough for a serverless app. We have to fetch data from CSSE at JHU ArcGIS map feed and github repo, Worldometer website and, finally, combine the data with our Google Sheets source.
So, we starting with the installation of dependences (see requirements.txt). To store our JSON output in a Amazon S3-type Storage we have to set AWS_* environment variables. Otherwise, the app would try to store data in Redis as a fall-back to Cloud.
In the gist above Google Spreadsheet URL moved into environment variable to make the app more flexible.
Running Google Cloud Function
Adding your Python function into Cloud Console is quite obvious:
- Select how to trigger function execution (e.g. Cloud Pub/Sub Topic)
- Select source for your code (in our case we use inline editing)
- Copy & Paste function code and provide list of required packages
- Set environment variables (we set MANUAL_DATA_SOURCE_URL with a URL to Google Sheet) and click Deploy
Thats all.
Going Docker with Redis Support
You may want to go a step further and wrap the whole thing as a Docker Container. No problem folks.
Thanks for reading! You may find detailed readme, all sources of the project and Docker configuration for a Ubuntu-container with a crontab job on github.