Gateway to PubSub . . . and beyond . . .

Using a single Google Cloud API Gateway to ingest data

Lee Doolan
Qodea Google Cloud Tech Blog
9 min readOct 31, 2022

--

Photo by Florian Wehde on Unsplash

Introduction

In my last blog post I wrote about a new feature enabling us to more easily ingest streaming data into BigQuery from a PubSub queue. That was pretty cool, but that’s only half the data journey.

How do we first stream data into PubSub, and in this case from outside of Google Cloud?

Luckily we can make use of Google Cloud’s API Gateway and Cloud Functions to achieve just that.

This blog post is about how we set this up and securely stream data at scale into a PubSub queue, using a single HTTP endpoint.

You can read about how this is done using the excellent Google documentation here. This blog will also detail practical implementation of these services.

Practical usage

In one of my previous roles working for an e-commerce company, I was responsible for ingesting data into BigQuery from a variety of data sources, including pulling data from traditional databases and files, to data being pushed to us in real time.

The streaming of real time data seemed the most problematic. The volumes and timing of data being pushed to us were fairly unknown, and the number of data sources plentiful and growing.

Examples of these data sources include:

  • Webhooks: 3rd parties sending us data via our subscriptions to their webhooks
  • Event Streaming: To receive real time messages from our marketplace e-commerce platform
  • Migration to serverless resources: An IT Dev team moving their monolithic applications to microservice architectures
  • Ad hoc Customer Requests: To action various customer data requests, such as data deletions, honouring our GDPR obligations

We needed a simple consistent way to ingest data into our data platform.

Problem statement: the business wanted to analyse and get data insights faster and our IT Dev teams wanted help storing their application messages in a single place.

Overview of the Method

What we ideally wanted was a single HTTP gateway enabling us to do the following:

  • Categorise the incoming data by a business function. This helps us identify where the data has come from, and what we need to do with it further downstream.
  • Add an additional field indicating the event type. This will help us parse/process the messages at a later point.

This is what we want from the URL:

https://<api-gateway>/<business-function>/<event-id>?key=<api-key>

To get this to work, we need an API Gateway, a Cloud Function to process the message, and the target PubSub topics. A topology of this as below:

Single API Gateway, 2 PubSub topic targets

The Build

The best way to approach this is build the dependencies first, so that means work backwards! That is create the PubSub topics, then the Cloud Function, and finally the API Gateway and it’s components.

For convenience I’ve included all the steps in this handy github repo, and each step can be run with a quick makefile command. Edit the makefile during the process and as required.

Code Assumptions & Notes
All code examples are presuming the API project is called ‘api-stream-project’. This will obviously need changing for your requirements.

This repo assumes you will be building a brand new project and attempts to create one. This makes removing all objects super easy by just deleting the project, but for convenience I’ve also included makefile delete scripts.

This repo assumes you are running locally, and not in Google Cloud Shell etc, so authenticate as required.

PubSub Topics
We’re going to assume we only have 2 business functions, Sales and Product, so we will set up a topic for each as below. If you are new to PubSub you can read more about them here.

Note: I’m not going to go through subscribing to these topics and ingestion into BigQuery here. For info, I covered how to do this in a previous blog here.

Cloud Function
The Cloud Function code is fairly simple python code to process the incoming message, add/extract the parameters from the incoming URL for the business function and event, and publish onto the applicable PubSub topic.

You can see the full Cloud Function code and requirements file in the github repo here:
https://github.com/leedoolan77/blog-api-gateway/tree/main/Cloud%20Function

This code can be deployed easily using these gcloud commands. Deploying a Cloud Function usually takes around 2–5 minutes. If you are new to Cloud Functions you can read more about them here.

Create API Gateway
Setting up the API Gateway includes a few steps, first creating the API, then an API config and finally an API Gateway to join those elements together. You can read more about these here.

Note: Each step can take anywhere between 5–10 minutes to set up, so be prepared. Also note, occasionally your gcloud commands may appear to hang, so would recommend monitoring/refreshing the console if you’re not sure what’s going on.

I’m not going to dwell too much on the API and API Gateway creation. They’re pretty straightforward and deployment code shown below. But I do want to talk a little more about the API Config file as that’s where the ‘magic’ happens . . .

The config file is basically a YAML formatted file, and is where these features are configured:

  • What cloud function will be invoked when a message is sent to the gateway.
  • The allowable paths of an incoming message, and what parameters will be taken from the path. We will use this for our business function and event id.
  • If we should expect an API key and what it will be called.
  • The expected output type i.e. application/json.
  • How to handle a successful and failed response.

You can view my config file here but note you will need to change the cloud function reference, and configure to your requirements.

We can now use this to deploy the API Gateway using the gcloud commands below.

Adding the API Key
The final step is to set up the API Key and restrict it’s use for this API only. You could also limit the API usage to configured IP addresses, but we won’t cover that here.

Important Security Note

An API key only indicates whether the requestor has permission to use the API and which services it may access, i.e. to block anonymous traffic and help limit usage to legitimate traffic.

Based on this other authentication methods such as OAuth should be used alongside HTTPS and API Keys if transferring sensitive data.

You can read more about this here:
https://cloud.google.com/api-gateway/docs/authenticate-api-keys

Continuing on we need to first make a note of 2 things, namely the Managed Service of the API and the Host Name of the API Gateway. We can get these using the console, or running these describe commands . . .

Now we need to enable the Managed Service using this command. Once run it may take a few minutes to be made available for the next step . . . creating the API Key.

And after a few minutes (it needs time to process) you can generate the key, and restrict it to use your API only.

Once run, you can jump onto the console here, and make a note of your actual generated key.

Testing

Now all the components are in place it’s time to test the full process. The best way to do this is send a message to our API using curl, and then monitor our PubSub queues for message delivery.

First we can set up a pull subscription to our topics (we can delete these after).

Now we can send the test messages to our API gateway using Curl, hopefully you’ve saved the API gateway host address earlier!

If successful we can run the following to pull from each subscription, and all being good we should see our messages!

Conclusion / Next Steps

That’s pretty much all the steps you need to start ingesting data from the web and into Google PubSub queues. The next steps could be to ingest into BigQuery, write to Cloud Storage, or any action you wish.

Things to remember is if you have concerns about sensitive data being sent over you will need to add additional security protocols/methods.

This method vastly expedited the development of data pipelines into Google Cloud, receiving data in real time. Our IT Dev team could start sending us data simply by altering the URL for instance with no input by a Data Engineer.

Obviously this method raises questions around data quality and loose data contract schemas, but maybe that’s a discussion for another blog.

Thanks

Also big thanks to my colleague Keven Pinto for helping test and debug this. Make sure you checkout some of his and other blogs from our great colleagues at CTS here.

About CTS

CTS is the largest dedicated Google Cloud practice in Europe and one of the world’s leading Google Cloud experts, winning 2020 Google Partner of the Year Awards for both Workspace and GCP.

We offer a unique full stack Google Cloud solution for businesses, encompassing cloud migration and infrastructure modernisation. Our data practice focuses on analysis and visualisation, providing industry specific solutions for; Retail, Financial Services, Media and Entertainment.

We’re building talented teams ready to change the world using Google technologies. So if you’re passionate, curious and keen to get stuck in — take a look at our Careers Page and join us for the ride!

--

--

Lee Doolan
Qodea Google Cloud Tech Blog

Cloud Data Warehouse Architect & Data Engineer | UK Based | https://www.linkedin.com/in/leedoolan77 | Thoughts are my own and not of my employer