A Very Helpful Service

Published in

CarMax Engineering Blog

8 min readMay 30, 2024

This is the story of an innovative little product team and our quest for a new way of managing background tasks for a business-critical service. For six years, our service filled this need with Azure Functions, Microsoft’s serverless platform for .NET and other languages. That solution endured through our late-2019 upgrade from .NET Framework to what was then .NET Core, and our migration from in-process to isolated-process Azure Functions in early 2023.

Over the course of 2023, however, we noticed that our maintenance burden for our Azure Function apps was increasing substantially. As our product scaled, some of our functions reached runtimes several minutes long; some would even hit the platform’s 10-minute timeout occasionally, even if their average runtime was far shorter. Troubleshooting tasks on our Kanban board languished “in progress” for weeks with no resolution, and each new problem added to our general sense that we should consider a fresh approach to the impacted functionality.

Azure Functions is an elaborate platform that uses a myriad of bindings to enable all sorts of creative input and output scenarios. In our case, however, a close look revealed that we weren’t really leveraging this advanced functionality. All of our functions were timer triggers, HTTP triggers (including an Event Grid subscriber), or Event Hub or Service Bus listeners. Most of that is easy enough to do in a standard ASP.NET Core application. So, we explored what other approaches we could take to handle the one piece that is less straightforward: timer-based, or scheduled, tasks.

It turned out that Azure Functions wasn’t as necessary to deal with this requirement as we previously thought. We were able to develop a new architecture within Azure that met all of our needs, significantly reduced the amount of esoteric code in our solution, and substantially improved the reliability and performance of our background infrastructure. Let’s look at how it fits together, starting with the centerpiece of it all.

App Service

The core of the new system is about the most unassuming thing you can imagine (in Azure): an App Service. We call it Helper, and it’s a satellite service that lives alongside our main product in Azure, sharing Key Vault, App Configuration, and other resources. It’s a .NET 8 application in the same solution and GitHub repository as the main product, enabling a lot of code to be shared and reused between the two as well.

Helper has a range of endpoints that correspond to short-running (generally seconds long) recurring tasks that we use to keep our service happy and healthy. It also sports a reusable pattern for Event Hub listeners using the ASP.NET Core hosted services feature. And it’s set up as an Event Grid subscriber as well; in fact, it’s now the Event Grid subscriber for our service, a role that used to be handled by satellite Azure Functions. The app runs as a pair of instances on App Service’s powerful and cost-effective Premium v3 tier. So far, even the smallest SKUs in this tier have been more than sufficient to handle everything we need, even with a very high volume of incoming events. Our Azure Functions were already running on a dedicated App Service plan due to some scaling challenges we experienced on the Consumption plan, so this wasn’t a significant cost increase for us.

Two other pieces of infrastructure interact with Helper to handle scheduled tasks and longer-running tasks.

Logic Apps

We explored a number of different approaches to handling scheduled tasks without Azure Functions, including a rather advanced product called Azure Batch that proved overly sophisticated for our needs, but the one that ultimately stood out to us was Azure Logic Apps. With Logic Apps, it was relatively simple to use the Azure Portal to design a workflow that would run on a schedule, retrieve an API key from Key Vault, and then use it to make an HTTP call to Helper. Once crafted, this flow was reusable for any HTTP call we wanted to make, thanks to infrastructure-as-code.

The Bicep definition for a Logic App is somewhat arcane, but after we’d figured out the proper workflow once using the Azure Portal designer, we were able to export the template and create a Bicep module that abstracted away the details in favor of a simpler pattern we could repeat for each scheduled task. A reusable Key Vault connection and user-assigned managed identity for all our Logic Apps completed the package. Creating and using a Key Vault connection in the Portal designer gives you what you need for the workflows in the exported template. The Bicep for the Key Vault connection resource itself, however, is a little tricky to figure out, so I’ll share a reduced example here:

resource keyVaultConnection 'Microsoft.Web/connections@2016–06–01' = {
  name: 'logicapp-kv-connect'
  location: keyVaultLocation // like 'eastus'; must be the same region as your Key Vault
  properties: {
    api: {
        id: subscriptionResourceId('Microsoft.Web/locations/managedApis', '${keyVaultLocation}', 'keyvault')
    }
    displayName: 'logicapp-kv-connect'
    // these next ones aren't documented (hence you'll see warnings), but they work fine
    parameterValueType: 'Alternative'
    alternativeParameterValues: {
        vaultName: 'your-keyvault'
    }
  }
}

Logic Apps have a robust interface in the Azure Portal, with a complete log of runs that can be explored, as well as the ability to trigger an off-schedule run manually at any time. Having access to this history without being affected by Application Insights sampling has proven very useful. What’s more, we’ve been using this approach in production for months, and the scheduled executions have been very dependable. We now have 55 Logic Apps and counting in our Azure subscription (including sandbox, QA, and production environments). Costs scale with the frequency of running the apps, ranging from less than a dollar per month for workflows that run five times an hour or less, to several dollars for tasks that run every minute, making it a particularly affordable solution for the hourly and daily tasks that comprise most of our needs.

Container Instances

Our service requires plenty of scheduled tasks that only take a matter of seconds to complete, but there are also several that take longer: anywhere from a couple minutes to a few hours. There are multiple reasons why it isn’t practical to run these directly in Helper, including the potential for deployments to interfere with them. For this purpose, we’ve substantially iterated upon and expanded the use of a system we call Batch Jobs, originally created for previous task architectures we employed, to manage longer-running scheduled and on-demand background tasks.

Batch Jobs are container-based workloads that run on the Azure Container Instances platform. Our Batch Jobs worker is a console app that lives in our main solution and is used to prepare and execute a single Batch Job command using ASP.NET Core style configuration and dependency injection. Each available Batch Job is a class that follows a uniform contract (defined by an abstract base class), providing a consistent API to the worker, but supporting virtually any operation we can do within our codebase. The worker and all class library code needed for the Batch Jobs are deployed as an image to an Azure Container Registry. Because containers can live as long as required to complete their tasks, there is no need to “fan out” large-batch operations using Service Bus or Storage queues; a single container can run an entire process from beginning to end without that extra overhead.

Helper has endpoints that are used to create Batch Job containers using the images from ACR, monitor their statuses, and clean them up after completion. We partnered with CarMax’s cloud engineers to create a custom Azure role, which gives Helper the ability to administrate these short-lived containers without requiring an unnecessary level of access to more durable or sensitive Azure resources. The most recent iteration of Helper can also programmatically determine and provide the names and options of available Batch Jobs, to power dynamic developer experiences such as admin portals, as well as alert developers to unexpected behavior in the system. Today this piece uses runtime reflection; we’re comfortable with this due to the limited volume and relaxed performance requirements of the feature, but we hope to eventually leverage a compile-time approach like source generation instead.

We think the Batch Jobs system works best with Helper and Logic Apps in the driver’s seat, but it can also be managed by other means. We’ve done some work to make key elements of our Batch Jobs pattern available internally as a NuGet package to be reused across several services of our own, and to be used by other CarMax product teams that expressed interest in using our pattern. Another team in our organization is now using our package to power Batch Jobs for their newest service, but with Azure Functions as the management layer.

Caveats

The Azure Functions product exists for a reason. As discussed in the intro, our use-cases for it were somewhat limited, making it relatively easy for us to move to a different solution.

Migration might not be as obvious a choice for your own team. You may be using complicated patterns of input and output bindings that would require a lot of code to replicate in a standard ASP.NET Core app. The amount of developer time our team could reclaim from Azure Functions maintenance was more than enough to offset any cost increase from going from a Consumption plan for Azure Functions to a Premium v3 App Service plan, but that may not be the case for every team.

We chose to create Helper as a separate App Service because our product is a high-volume, high-criticality service, and we wanted to maintain dedicated compute resources for that, without having to share them with background tasks that include also-high-volume event processing. Other teams may find it more cost-effective to have Logic Apps call endpoints that are built directly into their main service.

Our takeaway for you is not that all teams should immediately migrate from Azure Functions to an architecture like ours, but rather that some teams in a position similar to ours pre-migration might benefit from doing so. As always, consider pros and cons and decide what is best for your product and its specific needs and requirements.

Conclusion

The outcome of this architectural change has been very positive for us, in terms of both reliability and the reduced maintenance effort. Since migration, we’ve been able to continually add new functionality to Helper using the ASP.NET Core patterns we’re most familiar with, shifting a lot of time and effort from troubleshooting back to innovation. We’re excited to be sharing this story now in the hope that others may benefit in various ways from our findings and be inspired to develop your own creative solutions!