Azure cost containment exercise

Raja Shekhar Chava
Jan 29 · 8 min read

It has been quite a journey for TeleTracking Technologies to design, implement and deploy a new platform on the cloud. It was a memorable experience evaluating different PaaS tools and finally concluding that creating our own platform and offering it as SaaS (on azure) to our customers was the best option.

Once we embarked on this journey, we scrambled our jets and went in full force designing, developing and hosting the new cloud platform. It was kind of a free-for-all when we were creating the cloud resources for different environments. Our primary goal was to create the skeleton of the platform and showcase the minimal viable product(s) as soon as possible.

The environments were multiplied and so were the number of cloud resources. Who wouldn’t appreciate the speed and convenience of adding few more VM’s (Virtual Machine) without any additional physical space or immediate CapEx (Capital Expenditure).

The best part of cloud hosting is zero CapEx and only the OpEx (Operating Expenditure). Those hourly resource charges may seem small but running them 24x7 (regardless of the usage) will result in OpEX costs quickly creeping up. For our own internal reasons, we also ended up maintaining multiple versions of the software both in production and non-production.

Over time, the cloud spending began inching higher month over month until it reached to a couple million dollars a year. It did not take much to realize that our hosting costs are not sustainable.


Cost Containment Exercise

A team had been formed to reduce the cloud costs. We quickly evaluated the monthly hosting costs across our subscriptions; prepared a list and type of resources, monthly spend on them etc. and came up with a plan.

We focused our efforts on three main areas.

  • Delete- Remove unused resources
  • Right size- Evaluate and choose the correct size for the resources
  • Shutdown- Shutdown/downsize the environments when not in use

Once the plan had been finalized, we rolled our sleeves up and swung into action.

Delete

As part of this exercise we identified the billable resources and broadcasted the list to all potential owners in the organization. We waited for the teams to claim their resources and document their purposes.

Over time, we learned that an eye-popping number of resources were either unclaimed or no longer needed. Further investigation has revealed that many of the resources that are marked for deletion were created based on arbitrary standards. For example, in some instances it was standard that an environment needed X number of VM’s regardless of the immediate usage.

After a couple of email broadcasts, all the VM’s that are no longer required and their corresponding resources like storage disks, NIC’s etc., were purged. This was a huge win which showed a quick downward trend of monthly billing.

Lessons Learned:

  • Always review new resource requests
  • Assign resource owners
  • Create temporary resources in Azure labs subscription and make sure they are deleted after use

Right Size

Our environments were created at different times and by different teams. Lacking a standard or understanding of the correct size in most cases; lot of VM’s were created with different configurations. There were around 30 different VM sizes in use before our reassessment. Most of them had the same or similar purpose but are part of different environments.

We did an internal audit and conducted in-person interviews to understand the requirements (memory vs computer optimized, standard vs SSD disks etc.) and explored the Azure VM offerings based on the needs. The B-Series (burstable) VM’s are truly a blessing for our usage. As the name suggests these VM’s have the ability to burst the CPU performance(by using the accumulated credits) when needed and accumulate credits when the usage is low. Also, we limited the SSD disk’s to certain VM’s like SQL, BizTalk server, etc.

Of course, we were not right the first time, but eventually right sized all them. These changes resulted in recurring savings on billing.

Lessons Learned:

  • Always review new resource requirements and understand their purposes
  • Identify the right size resource
  • Start with minimal configuration whenever possible. For example, an Azure disk size can be increased but not the other way around.
  • Use Microsoft cost prediction tools to pick the right configuration. Learn more here.

Shutdown

Non-production environments like Development, QA, and Staging do not need to be up and running all the time, or at least that is the case for us. Regardless of usage, a cloud resource accumulates costs for as long as it is running.

We have come up with an elaborate shutdown script to eliminate/reduce costs when the environments are not in use. This script has been scheduled to execute the at end of each day and performs the following actions.

  • VM’s are de-allocated when not in use
  • Change the SSD disk to standard when the VM is de-allocated
  • Reduce the price tier of SQL elastic pool, and app service plan, pause the Mongo Db cluster, etc.

Lessons Learned:

  • A paused Mongo cluster is 90% cheaper than a running one
  • An app service plan can be set to its lowest tier when the app services are not in use
  • There is no need to pay for SSD disks when the corresponding VM is de-allocated

Apart from these activities, we also decommissioned a couple of environments and reduced the number of tenants in each environment.


The result of this exercise was a huge success and reduced the monthly bill by a couple thousand dollars.

This was a huge success but we haven’t stopped here. More changes were introduced after this exercise.

Tagging: All the resource tags were standardized. Each VM has been configured to auto shutdown by default. One of the tags is “owner”, which lists the team/individual who owns it. This has helped the teams keep tabs on their spending.

Automation: Any new resource should only be created using automation scripts. This ensures resource standardization. Teams were instructed to create test resources in Azure Labs subscription. This subscription has been configured to delete all resources at end of the day. Since they have been deployed with scripts, deletion of resources did not bother the teams.

RBA (Role Based Access): Introduction of RBA has taken care of permissions and the ability to manipulate resources by teams.

Review Board: Any changes to the infrastructure will be reviewed by the change review board.


Here are a few areas to concentrate when looking for ways to reduce Azure hosting costs.

App Service Plans: Whether the underlying app services are running or not, app service plan charges continue to accrue. There are couple of ways you can reduce these costs.

  • Each app service plan can host more than one app service. See whether you can tack more services under the same plan.
  • Scale up - There are multiple pricing tiers available under each plan. See whether you can scale to a different tier and still get the expected application performance. For non-production environments may be Standard(S1–3) tier is enough. Evaluate performance metrics for the past couple of months to pick the right size.
  • Scale out - The ‘Production tier’ will allow auto scale out of services. For example, S-tier allows up to 10 instances of app service running based on the load. It can be configured to automatically scale down when the load is less.
  • You can always change to a lower price tier when the environment is not in use. For example, the S3 tier will cost $0.40/hr vs the S1 at $0.10/hr. It may look like a small amount, but for our setup with 12 plans per environment and 4 different environments, scaling back to S1 when environments were shut down for 8 hours a day saved us few thousands of dollars per month.
#Change app service plan to lower tier to save costs when the environment is down.$plans = (Get-AzureRmAppServicePlan -ResourceGroupName “Dev_ApplicationServices1”).Where{$_.Sku.Size -eq ‘S3’}$plans += (Get-AzureRmAppServicePlan -ResourceGroupName “Dev_ApplicationServices2”).Where{$_.Sku.Size -eq ‘S3’}foreach ($plan in $plans){ #Set the app service plan to small(S1) $0.10/hourSet-AzureRmAppServicePlan -ResourceGroupName $plan.ResourceGroup -Name $plan.Name -WorkerSize Small -NumberofWorkers 1}

SQL Elastic Pool: SQL elastic pool can host multiple databases and can be configured per your needs. We have consolidated all our non-production environments to one pool with a lower price tier.

  • Price Tier - A standard elastic pool can be configured up to 3000 DTU’s and supports up to 200 databases. You can evaluate the eDTU usage metric to determine the correct size.
  • Auto scale- SQL elastic pool does not support auto scaling. But you can configure alerts and invoke a runbook to dynamically increase/decrease the DTU’s based on your needs. Read my article here to learn more about invoking runbooks.
#Reduce DTU’s to the lowest(50 DTU, 50GB storage) as the environments are shutting downif((Get-AzureRmSqlElasticPool -ElasticPoolName “Devpool” -ResourceGroupName “Devgdatastores” -ServerName “dev01”).Dtu -gt 50){set-AzureRmSqlElasticPool -ElasticPoolName “USADev-PremiumPool” -ResourceGroupName “DevUS_A_DataStores” -ServerName “devusa001” -Dtu 50 -StorageMB 51200 -DatabaseDtuMax 50 -AsJob}

VM’s and Disks: The size and the attached disks(OS, Data) will impact the operating cost of a VM. Evaluate past usage to determine the correct size for your needs.

  • Always start with a small size for both VM and the disks. VM size can be changed when it is de-allocated. Similarly, disk size and type can also be changed when they are not in use. Keep in mind that disk size can only be increased.
  • B-Series- A burstable series VM will accumulate credits when it is not used to capacity, and the credits will be consumed when there is a burst in demand. Credits are accumulated/maintained as long as the VM is running and will be lost if it is de-allocated.
  • Reserved instances - If you are sure of the size and need of the VM for the foreseeable future, you can always reserve the VM’s at a fixed price. This will ensure that you are not susceptible to price changes and offers a discount in price.
#Change the OS and Data disks to standard from SSD
#Standard_LRS, Premium_LRS, StandardSSD_LRS, and UltraSSD_LRS
$DiskConfig = New-AzureRmDiskUpdateConfig -AccountType $disktype = "Standard_LRS" #Premium_LRS, StandardSSD_LRS, and UltraSSD_LRS$vm = Get-AzureRmVM -Name VmName -ResourceGroupName $ResGrpName$SProfile = $vm.StorageProfile#Change the data disk type to Standard(if Premium) and Premium(if Standard)$dataDisks = $using:SProfile.DataDisksforeach ($disk in $dataDisks){$ddisk = Get-AzureRmDisk -ResourceGroupName $ResGrpName -DiskName $disk.Name$tags = $ddisk.tagsif($ddisk.SKU.Tier -ne $TargetDiskTypeName){Update-AzureRmDisk -DiskUpdate $DiskConfig -ResourceGroupName $ResGrpName -DiskName $disk.Name#set the tags back$res = Get-AzureRmResource -Name $disk.Name -ResourceGroupName $ResGrpNameSet-AzureRmResource -Tag $tags -ResourceId $res.Id -Force -AsJob}}#Change the OS disk type to Standard$OSDisk = $using:SProfile.OsDisk$ddisk = Get-AzureRmDisk -ResourceGroupName $using:ResGrpName -DiskName $OSDisk.Name$tags = $ddisk.tagsif($ddisk.SKU.Tier -ne $TargetDiskTypeName){Update-AzureRmDisk -DiskUpdate $DiskConfig -ResourceGroupName $ResGrpName -DiskName $OSDisk.Name#set the tags back$res = Get-AzureRmResource -Name $OSDisk.Name -ResourceGroupName $ResGrpNameSet-AzureRmResource -Tag $tags -ResourceId $res.Id -Force -AsJob}
Raja Shekhar Chava

Written by

I am a Software Development Manager at TeleTracking Technologies. My technology interests are Azure cloud, DevOps and IaaS.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade