Inner workings of Harness’s Cloud Billing Data Ingestion Pipeline for GCP
In this article series we will see how we use event driven architecture to consume billing data of GCP, AWS and AZURE into your data pipeline — to perform further processing and analysis on top of this data.
In this article, we will see inner workings for consuming GCP billing data. Part 2 and Part 3 of this article discusses AWS and AZURE billing data ingestion pipeline respectively.
Some background before:
GCP generates billing data for a project multiple times a day which is stored in a BigQuery Dataset. The time when GCP generates this data is not fixed. We do not yet have any direct eventing mechanism which lets us know that the new billing data has been generated. Also, billing data export as JSON/CSV is deprecated. The recommended way to export billing data is only into a BigQuery Dataset.
This left us with only a few options which could work well.
Schedule a query say, once every day. This will copy the data between the customer’s BQ dataset with our BQ dataset irrespective of whether new billing data has been generated or not. Once the data is available for consumption, we can then process and store it in any target datastore as per the needs.
Or run a job periodically which would look for any changes in the ‘last updated’ at timestamp for the source billing dataset/table. Only start the copy if ‘last updated’ has changed since last execution.
We will see the latter approach. This approach has some benefits vs the other. We get to know of any change in the customer’s billing dataset sooner. Thus we will be able to consume and process this data sooner. This approach also had less changes in our existing system -> less chances of regression and faster time to production.
Pipeline at a high level looks like this:
1) GCP ingests new billing data for customer’s project into customer’s BQ dataset.
2) and 3) Our sync service impersonates the service account. This service account has a ‘BigQuery Data Viewer’ role given in the customer’s BQ table. Sync service checks if ‘last updated’ has changed for this customer.
4), 5) and 6) If yes, we trigger a sync BQ query (transfer job). This copies the data into our BQ dataset.
(4 and 5 can be further broken up into PubSub + CloudFunction to make this operation async)
7) Once the BQ query (transfer job) finishes, we trigger an event. This goes into a PubSub topic. This event is to let downstream services know that new data is now available for further processing and storage.
8) On the other side of the topic, a CloudFunction listens to this event. This CloudFunction processes the initial billing data.
9) CloudFunction loads the processed data into destination BQ dataset tables.
This is how we ingest the GCP billing data.
Last but not least, we use Terraform to manage our cloud resources — in this case PubSub, CloudFunction and StackDriver monitoring. Setting up appropriate monitoring for your cloud resources is also very important.
Using PubSub and CloudFunctions made the pipeline even better as these services are scalable and performant.
In Part2 and part3 of this article series, we will see the billing data ingestion pipeline for AWS and AZURE respectively.
That’s it for now!
Thank you for reading.
Please leave a comment or suggestions.