Google Cloud Functions — A Brief Tutorial
Extracting data loaded from Cloud Storage to BigQuery with Cloud Functions
The portuguese version of this article can be read here.
There are many instances in which we need to store data for future use. With Google Cloud Platform we can use Google Cloud Storage (GCS) as a datalake to store and load data to BigQuery, a powerful analytics database for building a data warehouse, which allows us to store vasts amounts of data and make fast SQL queries without having to manage it’s infrastructure. One way to automatically import data from GCS to BigQuery is using Cloud Functions as a middleman in the ETL process described below:
The idea for this article is to introduce Google Cloud Functions by building a data pipeline within GCP in which files are uploaded to a bucket in GCS and then read and processed by a Cloud Function, which will load this data to a BigQuery table.
But what exatcly are Google Cloud Functions?
To understand Cloud Functions first we need to know what the different computing services are.
The image above displays the spectrum of the different available services, being: Infrastructure as a Service (IaaS), Containers as a Service (CaaS), Platform as a Service (PaaS) and Functions as a Service (FaaS). Notice that although the services on the left provide a higher level of control to it’s configurations, they require a specialized team to maintain it’s infrastructure. On the right, even though they don’t offer as much control, they have a higher level of abstraction and it’s maintenance is made by the service provider.
The Cloud Functions are on the right spectrum as a FaaS. Typically those services execute a piece of code written by a developer as a response to a pre-defined event, with a time execution limit and it’s scalability managed by the provider. The architecture of this kind of service is refered as a Serverless Architecture, where pricing is calculated by the number of times the function is called and the quantity of memory and the CPU utilized. This kind of service is atractive as with little effort the function is up and running without worrying about managing a server.
The diagram below shows the different kinds of computing servies offered by the Google Cloud Platform, encompassing IaaS to FaaS.
In summary the Google Cloud Functions are FaaS, in a serverless execution environment with the code being run in a fully managed environment, where single purpose functions are written and triggered by pre defined events emmited by other services in GCP. It’s payment model is based on total function invocations and computing resources used.
Setting up a Cloud Function
The idea behind this function is to instantly process a json file at the moment it’s stored in a Cloud Storage bucket and load it’s content to BigQuery.
Below we define the bucket test_function where the files will be stored
Now we’ll define a BigQuery table where the data will be loaded. We’ll call it my_table and has the following columns:
To setup a Cloud Function we have to go to the Cloud Functions dashboard and click on Creat Function:
Then we’ll fill our function configs.
Firstly we’ll need to fill the name and region for the function.
At the trigger section we’ll select the Cloud Storage option. For the event type we’ll go with Finalize/Create. Finally we’ll specify the bucket in Cloud Storage in which the function will be triggered for each file that’s created on this bucket.
In Runtime, Build, Connections and Secutiry Settings is possible to configure other settings. In Advanced there are two mandatory configurations: memory quantity and timeout limit.
The default settings for these configs is 256MB of memory and 60 seconds until timeout, but it’s possible to alocate up to 8GB of memory and 540 seconds (9 minutes) until timeout.
With all set we can procede to the code tab. There’s a handfull of language options and ways to set up our code. We’ll chose Python 3.7 and use Google’s own editor to write our code.
It is also necessary to specify the function’s entry point, that means we need to define what method will be read at the code’s execution. It is also possible to specify which Python packages will be used at the requirements.txt file.
Below is the code written for this function:
It’s important to notice that BigQuery only accepts a specific json file format called NDJSON, which is a json file were every line has a single json object, like the example below:
{"key": 1, "value": "my_value"}{"key": 2, "value": "another value"}{"key": 3, "value": "this is NDJSON"}
With everything set we can click on Deploy and see the magic happen =)
Running the Cloud Function
Now we can finally put our function to use! All of the functions logs are registered at GCP’s Logging tool. There’s also a Logs tab for each function.
To run our function we’ll upload a NDJSON file to the test-function bucket in GCS. The function logs will be available right after it’s execution.
Now we can query the data in BigQuery!
Final Remarks
The Google Cloud Funtions are a great tool for painless ETL jobs due to it’s versatility and agility for deploying single purpose event oriented functions, such as Pub/Sub and Cloud Storage, and HTTP calls.
Still it’s important to consider it’s limitations. The 9 minute execution time can make more demanding jobs unfeasible, limiting it’s use cases for less demanding workloads such as micro-batching.