How We Do Serverless ETL with Google Cloud Function

Life at Mapan
Life at Mapan
Published in
4 min readSep 20, 2018
Source: Mapan’s Creative Team

In Mapan, we strive in providing network to thousands of lower income communities with better access to valuable products and services using technology. To achieve our goal, we are taking care of many fields in businesses like communities (online and offline), order management, customer service, and supply chain. Those things are intertwined and producing a large number of data in various data formats.

The Data & Business Intelligence team in Mapan is responsible in producing insight and analytics based on available data. We have data stored in many sources like MySQL, PostgreSQL, Couchbase, third-party system, and even Google Spreadsheet. It is a big challenge to consume data from various sources since the data from each source has different structure and some data is unstructured.

Due to complex nature of our business, rapidly changing and complex business rules are inevitable. We frequently change the business logic in our ETL system within limited time. Sometimes we are overwhelmed because we need to take care the business logic and ETL infrastructure at the same time. We want to focus on the things that really matter, the business rules, instead of managing ETL infrastructure. Here the Google Cloud Platform stacks come in rescue!

Serverless ETL with Google Cloud Function

According to the official documentation,

Google Cloud Functions is a serverless execution environment for building and connecting cloud services. With Cloud Functions you write simple, single-purpose functions that are attached to events emitted from your cloud infrastructure and services. Your function is triggered when an event being watched is fired. Your code executes in a fully managed environment. There is no need to provision any infrastructure or worry about managing any servers.

In this post, I will demonstrate how we use Google Cloud Function for data transformation using the data from Zendesk API as an example. I will not explain about the data extraction here since the data extraction system is separated from data transformation. The result of data transformation will be loaded to Google BigQuery.

Architecture Diagram of Data Transformation with Google Cloud Function

The Problem

Suppose we want to track the history of assignee changes of Zendesk tickets. We can use the data from Zendesk Ticket Events API because the information of ticket assignee changes can be retrieved from there. Since we already have a job that copying the data from Zendesk API to our data lake in Google Cloud Storage, we just need to use the data from Google Cloud Storage. Here is the example of Zendesk Ticket Events data:

{
"id": 1717,
"ticket_id": 27,
"timestamp": 138561439,
"updater_id": -1,
"via": "Email",
"child_events": [{
"id": 120120,
"via": "Web form",
"via_reference_id": null,
"comment_present": true,
"comment_public": false,
"event_type": "Comment"
}, {
"id": 120121,
"via": "Web form",
"via_reference_id": null,
"assignee_id": 1231251235,
"event_type": "Change",
"previous_value": "12412512"
}, {
"id": 120122,
"via": "Web form",
"via_reference_id": null,
"group_id": 14125124,
"event_type": "Change",
"previous_value": "13623423"
}]
}

The Ticket Events data doesn’t have consistent schema because the child events contains various kind of events and every event has different structure. Hence, we need to create a function to preprocess this data before loading it into Google BigQuery for further analysis.

In this case, we will use GCS trigger for the Google Cloud function. So, whenever a new data is loaded to GCS, the function will be executed automatically.

Writing Function

The function will read the data from Google Cloud Storage, transforming the data, and load the result into Google BigQuery.

Deployment

The deployment is very easy. We can deploy our function just with this command.

gcloud beta functions deploy ticket_events_handler \
--runtime python37 \
--trigger-resource my-bucket \
--trigger-event google.storage.object.finalize

Each trigger has different option for deployment. You can read the documentation of Google Cloud Function from here.

Final Thought

After the function has been deployed, it will transform the data and load it into Google BigQuery whenever there is a new data in Google Cloud Storage. By doing data transformation with serverless approach, we can greatly reduce our time in maintaining data infrastructure and enable us to focus on the business logic. Furthermore, the event-driven approach enables us to deliver insight in real-time. Last but not least, whatever tools we use for data transformation, make sure that we can deliver insight to the business.

This article was written by Muhammad Redho Ayassa.

If you are interested in knowing more on how your skill in data engineering or data science can help society, send your latest CV to recruitment@ruma.co.id and let’s embark on an exciting journey together!

--

--