Inner workings of Harness’s Cloud Billing Data Ingestion Pipeline for Azure
In this article, we will see internals of Harness CCM’s data ingestion pipeline for Azure. We have built a multi-tenant system using event driven architecture which ingest the Azure’s Usage and Charges csv file (equivalent to AWS Cost and Usage report csv) into our data pipeline.
I recommend to read earlier published articles where we talked about AWS and GCP billing data ingestion respectively (Link 1 and Link 2).
Particularly, we will see how we ingest the Azure UCF data into a GCP Big Query Dataset. Harness runs on GCP, and we use BigQuery to analyze the cloud spend.
Azure generates UCF csv once a day depending on the set schedule in your UCF export configuration in your Azure account.
Depending on the type of Azure account, this csv file has detailed billing data for one or all your Azure subscriptions.
Pipeline at a high level looks like this:
Prerequisite 1, Set up daily export of month to date costs under Cost Management service in your Azure account. Once setup, the CSV will be delivered in an azure storage account. This csv we will use for further processing and storage downstream.
We now have daily month to date costs csv available in the storage account.
This is a one-time activity.
Prerequisite 2, Create a multi-tenant Azure App in destination (Harness) Azure account and registered this app as a service principal in source (Customer) Azure account.
Creating an Azure app is pretty straightforward using ‘App registrations’ service. Once the app is created, we get the app client id and client secret. We use this for auth purposes later.
By default, app registrations in Azure AD are single tenant. To make the app multi-tenant, switch to the Authentication panel of your application registration page in the Azure portal and set it to Accounts in any organizational directory.
So why a multi-tenant app? What are the advantages?
Shorter answer is that we do away with storing customer app credentials within Harness. Plus, this makes our onboarding flow easier.
This multi-tenant app approach can be extended to perform many more tasks on the source Azure account provided necessary permissions are given to the corresponding service principal.
Register this app in source azure account as a service principal. Can be done by using the below set of az commands on bash/azure shell.
1. Login to (source account) az cli and switch to subscription where storage account is present:
$ az account set -s <subs id/name>
2. Register service principal in source azure account:
$ az ad sp create — id <app id of multitenant app>
3. Apply necessary permissions on the source storage account
$ SCOPE=`az storage account show — name <storage account name> — query “id” | xargs`
Scope looks like:
echo $SCOPE
/subscriptions/XXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX/resourceGroups/<resourcegroupname>/providers/Microsoft.Storage/storageAccounts/<storage account name>
Assign role
az role assignment create –assignee <app client id> — role ‘Storage Blob Data Reader’ — scope $SCOPE
At this point, we have a multitenant app whose home tenant is Harness Azure account and is registered as an SP in customer’s azure account with necessary permissions on the storage account where csv files are present.
This is also a one time activity.
Now, the next sequence of steps is to consume these csv files whenever they get any updates.
Step 1) Sync data cross Azure storage accounts.
“azcopy sync” cli supports syncing of azure storage accounts cross region.
Example:
azcopy sync “https://sourceaccount.blob.core.windows.net/container/directory/reportname?<SAS token>” “https://destinationaccount.blob.core.windows.net/container/directory/...?<SAS token>” –recursive
Using Azure SDK and App credentials, it is possible to generate short lived SAS tokens for the storage accounts in consideration.
This sync step needs to be done periodically in order to sync any updates at the source azure storage account level (we do this using SpringBoot Batch).
Having our own Azure storage account in the pipeline allows us to accomplish certain things. It acts as a store to replay data into our pipeline. This also makes the pipeline simpler to setup and without complicating customer onboarding steps.
Step 2) Sync data from Azure storage account to GCS bucket.
Harness runs on GCP and we use BigQuery to analyze the cloud spend.
We use GCP’s storage transfer service to sync between Harness’s Azure storage account to GCS bucket.
To accomplish this, we use a SAS token for Harness Azure storage account.
At the end of this, we have data in the GCS.
Configuring GCP Storage Transfer via console does not allow adjusting the frequency to hourly. Nor does it allow adding the PubSub topic for transfer completion events. This is however possible to do via client SDKs/RestApis.
Here is a sample python code:
import datetime
import json
import googleapiclient.discovery
GCP_PROJECT_ID = ""
AZURE_SOURCE_CONTAINER = ""
AZURE_SOURCE_SAS = ""
AZURE_SOURCE_STORAGE = ""
GCS_SINK_BUCKET = ""
GCP_PUB_SUB_TOPIC = "projects/{}/topics/blah".format(GCP_PROJECT_ID)
def create():
"""Create a one-time transfer from Azure to GCS."""
storagetransfer = googleapiclient.discovery.build('storagetransfer', 'v1')# storageaccount/container-gcsbucket_frequency
description = "AZURE TO GCS"
start_date = datetime.datetime.now()
# To run after 10 mins from the creation time in IST.
start_time = datetime.datetime.now() - datetime.timedelta(hours=5, minutes=20)
# Edit this template with desired parameters.transfer_job = {
'description': description,
'status': 'ENABLED',
'projectId': GCP_PROJECT_ID,
'schedule': {
'scheduleStartDate': {
'day': start_date.day,
'month': start_date.month,
'year': start_date.year
},
'startTimeOfDay': {
'hours': start_time.hour,
'minutes': start_time.minute,
'seconds': start_time.second
},
'repeatInterval': '3600s'
},
'transferSpec': {
"azureBlobStorageDataSource": {
"storageAccount": AZURE_SOURCE_STORAGE,
"azureCredentials": {
"sasToken": AZURE_SOURCE_SAS,
},
"container": AZURE_SOURCE_CONTAINER,
},
'gcsDataSink': {
'bucketName': GCS_SINK_BUCKET
},
"objectConditions": {
"maxTimeElapsedSinceLastModification": "3600s",
"includePrefixes": [
]
},
"transferOptions": {
"overwriteObjectsAlreadyExistingInSink": True
}
},
"notificationConfig": {
"pubsubTopic": GCP_PUB_SUB_TOPIC,
"eventTypes": ["TRANSFER_OPERATION_SUCCESS", "TRANSFER_OPERATION_FAILED", "TRANSFER_OPERATION_ABORTED"],
"payloadFormat": "JSON"
},
}
result = storagetransfer.transferJobs().create(body=transfer_job).execute()
print('Returned transferJob: {}'.format(
json.dumps(result, indent=4)))
create()
Step 3) Downstream processing.
This is where we perform data processing using a set of GCP PubSub, and GCP CloudFunctions.
CloudFunction gets triggered when there is any event in the PubSub Topic. CloudFunction then processes the CSVs and ingests into destination BigQuery tables.
Depending on the needs, one can have multiple Topics and CloudFunctions.
And that is it. That is how we ingest the Azure cloud spend data into Big Query.
If the customer storage account is behind a firewall, the above approach still works provided necessary whitelisting of IPs needs to be done to perform the azcopy sync.
As one can notice, we can have N different source azure accounts to sync data from and our system can just function in the same manner.
We use short lived SAS tokens generated programmatically which makes this process secure.
And yes, we use Terraform to manage this pipeline — in this case GCS, PubSub, CloudFunction and StackDriver monitoring. Don’t forget to set appropriate monitoring for your cloud resources!
Using PubSub and CloudFunction made the pipeline even better as these services are scalable and performant.
That’s it for today. Please leave comments or questions below. Thank you for reading.