Our Azure bill is HOW much?? — Technology Tips

Published in

OS TechBlog

11 min readJun 3, 2022

If you missed the introduction on process and tooling, click here.

This is quite a weighty post, but I have started with the technologies that we spent the most money on as a percentage of overall spending. Costs are based on April ’21 to March ’22.

The Big Players

Virtual Machines (31% of spend)

VMs are our most expensive resources in Azure by far. Avoid VMs where possible. We didn’t get into this game to run virtual machines as we did on VMWare in 2004! Can you use App Services, Logic Apps, Table Storage or PaaS databases? Can you containerise the application? Containerisation doesn’t work for all COTS workloads, but is your COTS application even the right tool for this particular job? If not, you may save licensing costs as well as Azure provisioning costs. We find anecdotally that applications that use more PaaS services cost (like-for-like) less money.

Here are some more ideas:

· Enable Auto Shutdown/Start-up where possible. This can be achieved by Azure Automation.

· For build servers add a step to check if the server is started and start the server as required for DevOps; otherwise keep the machine turned off.

· Check machine sizes are on the lowest viable size using the recommendations in Azure Advisor.

· Proactively remove machines that are no longer required.

· Consider Auto scaling options (Scale sets, Azure automation).

· Consider purchasing Reserved Instances or Spot Instances. Azure VMs and other technology types can be ‘reserved’ and paid in advance for a minimum period of typically a year at a cheaper rate than the standard PAYG model. If you have a technology that you know will be running almost continually for a year, there may be a commercial benefit in purchasing the associated Reserved Instance. Azure Advisor will also recommend this if it sees prolonged usage. See Microsoft’s information on the topic.

· Consider using cheaper “burstable” B-series VMs. Especially in scenarios where VM must be running all the time, but rarely performs heavy tasks. eg: Azure DevOps self-hosted agents.

· Request the use of Azure Hybrid Use Benefits (AHUB) where appropriate (generally if a Windows machine will be on more than 50% of the time). Note that you may have to pay for the whole Enterprise Agreement year if that’s the deal you are on, regardless of when it is applied. This is one to evaluate three months before the true-up — add it to your calendar.

Storage (21% of spend)

This is another big area for an organisation such as OS. Key points are:

Use standard tiers over premium if viable.
Proactively remove storage that is no longer required.
There are special “Cold” and “Archive” Tiers for use with data which won’t be accessed regularly. Consider these if appropriate but be cognisant of the access charges associated with these.
Ensure you use appropriate redundancy. From least to most expensive, these are LRS -> ZRS -> GRS -> RAGRS -> GZRS -> RAGZRS. Most SLAs will be met with LRS or ZRS. Ensure you have a sound reason to go beyond ZRS.
Consider Blob Storage Lifecycle Management which allows you to configure moving data automatically between hot, cold and Archive tiers OR delete blobs and containers automatically. It is a free service, although you get charged for the “List Blobs” and “Set Blob Tier” API calls.

Databases (18% overall)

I’ll start with general tips on databases and then break it down into some of the individual technologies that we use:

Do you even need a full RDBMS or Cosmos DB? Would using Azure table storage on Storage Accounts be enough to store non-relational data?
Scale down pricing tiers when not in use.
Check pricing tiers are at the minimum viable levels and not over-engineered.
Proactively remove databases that are no longer required.
Consider auto-scaling options.

Azure SQL (14%)

Some tips for saving in this area:

Scale-down databases/pools when they are not in active use like on nights, weekends, and holidays. This can be achieved by automation using tags.
Enable auto-optimisation of databases if appropriate, this may reduce your DTU requirements.
Use the performance monitoring in the Azure Portal to see if you’re using the allocated DTUs.

o Note that DTUs and vCores can be interconverted — like inches and centimetres — eg: “1 vCore of General Purpose = 100 DTUs Standard” more examples here.

o You may want to use DTUs for very small workloads.

o You may want to use vCores if you want finer control over memory vs. CPU allocations rather than throwing more DTUs at it. Also, if you use Azure Hybrid Benefits, you must use vCores (for no good reason that I can see).

Consider using Elastic pools where possible
Particularly in Dev and Test environments.
Even on a production where different databases utilisation spikes at different times.
See Microsoft guidance on patterns.
Serverless databases is another option for single big databases in a v-core model where compute will be charged based on usage. Serverless compute will pause during inactive periods. It is suitable for workloads which can tolerate some delay and have unpredictable usage patterns. More information can be found here.

Azure Database for PostgreSQL (3%)

Azure Database for PostgreSQL includes the following design considerations:

· Hyperscale (Citus) provides dynamic scalability without the cost of manual sharding with a low-level application re-architecture required. Distributing table rows across multiple PostgreSQL servers is a key technique for scalable queries in Hyperscale (Citus). Together, multiple nodes can hold more data than a traditional database, and in many cases can use worker CPUs in parallel to execute queries potentially lowering the database costs. Follow this Shard data on worker nodes tutorial to practice this potential savings architecture pattern.

· Consider using Flexible Server SKU for non-production workloads. Flexible servers provide better cost optimization controls with the ability to stop and start your server, and a burstable compute tier that is ideal for workloads that don’t need continuous full compute capacity.

· Plan your Recovery Point Objective (RPO) according to your operation level requirement. There’s no extra charge for backup storage for up to 100% of your total provisioned server storage. Extra consumption of backup storage will be charged in GB/month.

· Take advantage of the scaling capabilities of Azure Database for PostgreSQL to lower consumption cost whenever possible. This Microsoft Support article about How to autoscale an Azure Database for MySQL/PostgreSQL instance with Azure runbooks and Python covers the automation process using runbooks to scale your database up and down, as needed.

The cloud-native design of the Single- Server service allows it to support 99.99% of availability eliminating the cost of passive hot standby.

Cost optimisation recommendations:

Choose the appropriate server size for your workload (Single Server, Flexible Server, Hyperscale (Citus)).
Consider reserved capacity for Azure Database for PostgresSQL Single Server and Hyperscale (Citrus)

Description:

Compute costs associated with Azure Database For PostgreSQL Single Server Reservation Discount and Hyperscale (Citus) Reservation Discount. Once the total compute capacity and performance tier for Azure Database for PostgreSQL in a region is determined, this information can be used to reserve the capacity. The reservation can span one or three years. You can realize significant cost optimization with this commitment.

Azure Cosmos DB (1%)

Only a small one for us, but make sure you are appropriately allocating Cosmos Request Units (Rus) — see How I learned to stop worrying and love Cosmos DB’s Request Units | by Thomas Weiss | Medium

The Long Tail

App Service (8% of spend)

Use the appropriate tier for a web application. Start with the “Shared” tier for dev and test instances wherever possible.

For production workloads, if multiple instances are needed use “Custom Autoscale” for “Scale-out” rather than manual scale (which gives a fixed scale). Within Auto-scale, use appropriate range for instance limits.

Databricks (5% of spend)

Azure Databricks bills you for virtual machines (VMs) provisioned in clusters and Databricks Units (DBUs) based on the VM instance selected. A DBU is a unit of processing capability per hour.

In addition to virtual machines (instances), Azure Databricks will also bill for managed, disk, blob storage, and Public IP Address.

The cost per DBU is dependent on Tier (pricing plan) and Workload.

Details of standard vs premium can be found here: https://databricks.com/product/azure-pricing
You are also charged depending on the extent that you take advantage of Databricks capabilities (this is described as “Workload”). There are three “workload types” defining that functionality:

Data Engineering Light — Run Apache Spark batch applications
Data Engineering — Run batch applications on Databricks’ optimized runtime for higher reliability and performance
Data Analytics — Use the Azure Databricks workspace to collaborate on projects, notebooks and experiments

Databricks processing clusters

Processing instances are grouped into clusters. Choosing the minimum required number and type of instances in a cluster can deliver significant savings. Make sure you consider the cost implications when setting a cluster autoscaling range (min/max instances).

There are 5 major instance types:

General purpose
Memory optimized
Storage optimized
Compute optimized
GPU

In addition to that, there are specific categories within each type for particular performance requirements. Details of how your choice translates into a cost per hour are available here: https://azure.microsoft.com/en-us/pricing/details/databricks/

If Databricks runs out of memory on the cluster it saves data to the GRS Databricks storage by default
Geo-redundant Storage (GRS) can be very expensive. To avoid any GRS cost: If it’s possible, don’t use the DBFS storage provided by Databricks (which is GRS by default and cannot be deactivated). Instead use an external DataLake / Storage account.
In tech terms, this means, avoiding using .saveAsTable and using any location that starts with: /dbfs/…. Always use a mounted storage account or a wasb/abfss path.
Moving data between different regions is costly.
Don’t transfer data between regions (eg: UK South to North Europe) if you can avoid it.

Monitoring and Logging (3% of spend)

· Ask why you need to collect data and ensure you collect the correct data only.

· We accept that this is one that will increase as we mature especially as we add Azure Policy to enforce logging.

· Log once and use it many times where possible.

· Log to a common log analytics area.

Data Transfer (1.5% of spend)

This was 1% on Express Route PLUS 0.5% on Bandwidth.

Be mindful of data transfer charges for data going out of Azure — this is about £60 per TB so can add up! Use Express Route if you have it and where possible as this is only £10 per TB.

Data upload and internal region transfers are free.

Azure Site Recovery (1% of spend)

At Ordnance Survey we use ASR to replicate from on-prem to Azure. We are considering it for Azure-to-Azure in the future.

Only replicate those servers where it is called for by the SLA. This should be based on the list of priority services which is kept under review. Of course, dependant services and possibly key test or dev systems may be included of necessity to meet that ask.

Ensure the replica is deleted as a machine is decommissioned. It will show on the list of broken machines, so keep an eye on this list.

Logic Apps (1% of spend)

Default to the Consumption Plan pricing model (i.e. multi-tenant Azure Logic Apps).

Azure Logic Apps uses Azure Storage for any storage operations. With multi-tenant Azure Logic Apps, any storage usage and costs are attached to the Logic App. However, with single-tenant Azure Logic Apps, you can use your own Azure storage account. This capability gives you more control and flexibility with your Logic App’s data but note that the storage costs will accumulate on your storage account directly, so keep an eye on this.

Different triggers, actions, and payloads result in different storage operations and needs. For single-tenant Azure Logic Apps, you can get some idea about the number of storage operations that a workflow might run and their cost by using the Logic Apps Storage Calculator.

After you delete a Logic App, the Logic Apps service won’t create or run new workflow instances. However, all in-progress and pending runs continue until they finish and will continue to generate charges.

Application Gateways (1% of spend)

Azure Application Gateways can be stopped when they are not in use (nights, weekends, holidays, etc). This will reduce the “compute” element cost of the service. See Application Gateways — Stop — REST API (Azure Application Gateway) | Microsoft Docs.

You can automate stop/pause and start based on schedulers. We have sometimes implemented this using start and end time tags.

Use Application gateways V2 where possible to benefit from the “Auto-scaling” feature and better performance.

Kubernetes (1% of spend)

This is an area of growth for us, but the initial points are:

· Choose the correct node SKUs by understanding SLAs. Consider spot node pools and reserved instances for stable base load work.

· Shut down clusters when not in use and set auto shut down where appropriate as you would with virtual machines.

· Use limits and quotas to prevent inappropriate auto-scaling or scaling beyond reasonable limits (eg: max 10 instances, etc).

Messaging (0.2% of spend)

If you do use the full functionality of Service Bus, then ensure you match the appropriate tier of Service Bus to your specific requirements, as per this comparison table: https://azure.microsoft.com/en-gb/pricing/details/service-bus/. Premium Tier, for example, can be used to push messages from the bus to an Azure Event Grid (and Microsoft do promote this pattern — https://docs.microsoft.com/en-us/azure/service-bus-messaging/service-bus-to-event-grid-integration-concept), but the additional price you pay for this capability is considerable, so do think about alternative implementation approaches early to avoid the Premium Tier where possible.

Also, consider whether your messaging requirement needs a Queue or Service Bus, which both require the consumers/subscribers to poll for messages and costs build up if this is done regularly. Many lightweight messaging scenarios are better handled with events using Azure Event Grid, and are considerably cheaper as a result, particularly for scenarios where there are only occasional events. Event Grid now offers advanced capabilities like Dead Letter Queue, “at least once” delivery and automatic retry just like Service Bus.

Bit Players

Some other technologies are used at Ordnance Survey to a more minor extent or where we don’t have whole year figures. Some tips are below.

Functions

Use the Consumption plan to minimise costs and only upgrade to the Premium plan if you need ultra-low latency response such as synchronous request/responses, e.g. for UI-type scenarios

Azure Machine Learning

Positives

On-demand scale-up and scale-down of compute clusters for large jobs — an intrinsic part of the design to avoid wasted cost by leaving compute running after processing is done
GPU and CPU clusters with configurable idle timeouts and support for low priority instances (significant savings for dev/test work)
Container-based deployment of jobs — ease of dependency management without having to do custom deployment
Straightforward to manage via CI — e.g. the nightly pipeline to shutdown notebook VMs after 7pm

Negatives

Lack of detail in cost reporting (e.g. you can’t see specific compute instances and types, only workspaces and resource groups)
Potential hidden costs that surface over time and add up, though small (e.g. the issue we found with load balancers being billed for stopped VMs, or workspace storage accumulating a lot of experimental data that’s not easy to automatically prune)
Lack of programmatic access to monitoring information (this is apparently something that MS is working on) including no AML-specific breakdown within the Sustainability Calculator (like there is for Databricks)

There are some technologies that are part of the cost of using Azure and it’s difficult to have an impact on them. This is the percentage of our Azure spend on those items.

Advanced Threat Protection — 2%
Azure DevOps — 1%, make sure you add testing capability and users only if needed and tidy up when people move on to different projects
Virtual Network — 0.7%
Load Balancer — 0.5%, which is quite a lot for the amount we use it!
Backup — 0.5%, although we use other technologies too for backup
API Management — 0.4%
Azure Cognitive Search — 0.3%
Marketplace — 0.3%, make sure you remove something when you’ve finished with it
Event Hubs — 0.2%