The path to the cloud
Data science allows us to answer questions from the broadest possible perspective without sacrificing the fidelity of the individual customer experience.
For example, we might want to know how customers get started in the cloud. How do they go from running everything on their own infrastructure to their first successful cloud deployments? The answer is complicated by the fact that everyone’s journey is unique. Mine will have been different from yours, and Microsoft’s will have been different from your company’s. But patterns emerge when we apply a few basic analytical techniques.
The setup
I have a couple of goals for this article: First, to identify the path that most data science teams take to getting started with Azure, and second — since this is a data science article — to show how we find that answer. The analysis itself is a bit geeky, so if you’re more interested in the results, feel free to jump to the end where I draw a few conclusions.
Let’s start by looking at Azure Machine Learning (AML), not Azure’s classic drag-and-drop Machine Learning Studio, but our more recent API-based service that makes it easier to develop, deploy, and manage machine learning (ML) models. How do data science teams go from being new to the cloud to doing machine learning at scale? Is AML among the first services they deploy, or is it something that only shows up later in their lifecycles? We can expect to see both patterns, of course, but which is more typical?
Combinations
We know that ML models require compute, and they require data, so we can expect that AML will be used in combination with both of those services. We don’t know, however, what other services customers might also have used, so we need to consider all of Azure’s services.
To do that, we create a matrix of customer-service combinations: Each row is a customer, each column is an Azure service, and we use Boolean values to indicate whether or not customers used individual Azure services, giving us a convenient numerical representation for our analysis.
Next, we simply look at the data.
In Chart 1 below, each dot represents a minimum combination of services: The x-axis is the number of services in a combination, and the y-axis is the percentage of customers who had deployed that combination. The shading represents the number of times that combination of services appeared in the data. The more instances, the darker the dot. (I’ll explain the red dot in a moment.)
The distribution is fairly skewed. Most combinations contain fewer than 10 services, the knee in our curve. In some cases, though, customers used scores of services before AML, indicating very complex cloud infrastructures. Those customers are hardly typical, though.
One thing that Chart 1 doesn’t show is how unique these combinations are: 92% occur only once. Because Azure has so many services, there are lots of ways to combine them!
Chart 2 shows us that diversity. At just 7 services, customers deployed nearly 400 unique combinations of services. Across the entire sample, there are thousands of unique combinations. Given so much diversity, we might naturally assume there are no “typical” patterns.
However, within those multiple unique combinations, there are common subsets of services.
Iterating through every combination, looking for common subsets, is definitely a brute-force exercise, but we can speed things up by evaluating only subsets that occurred naturally within our sample. We’re only interested in the steps that customers actually take, not in the ones they could theoretically take.
Turns out, before using AML, 28% of customers (the red dot in Chart 1) deployed a combination of services that contained Virtual Machines (along with the requisite bandwidth and storage), Virtual Network, Azure App Service, and SQL. I chose that point because it has the greatest point-to-point change in the percentage of customers (beyond just bandwidth and storage, which are part of nearly every Azure deployment).
Permutations
Of course, we also want to know when customers deployed these services, because that will give us a better sense of their stepwise journey.
To make things a bit easier, let’s simplify our perspective by looking at the deployment sequence of just the seven services (our subset plus AML) at a monthly grain. Ideally, we could look at the entire sequence of services — first to last — for every customer, but even for just seven services, there are nearly 50,000 possible permutations. (The Fubini number is 47,293.) Indeed, over half of customers (53%) deployed our subset of services in an order that was unique to them.
One way to reduce that complexity is by clustering. Because ordinals are discrete — first, second, third, etc. — we can quantify the similarity/difference between customers’ permutations by measuring their Manhattan distance. Consider the example in the table below: Customer A deployed App Service first, and customer B deployed App Service third, so the absolute value of the distance between A and B for that service is 2: |1–3| = 2. Across all services, the total distance between A and B is 6: 2 + 1 + 0 + 1 + 0 + 1 + 1 = 6.
The smaller the Manhattan distance between customers, the more similar their deployment sequences. A distance of 0 means that customers A and B deployed their services in the exact same sequence.
After calculating the Manhattan distance between every pair of customers, we can look at the overall distribution to get a sense of how far apart they are. The max distance is 21, but such large differences are rare. The distance between most customers is in the 2–5 range, and half are ≤4.
So while customers’ permutations are different, most aren’t that different.
We have to remember, though, that we’re looking at the pair-wise distance between every customer and every other customer. Each one will be near its neighbors and far from others. And, taken as a whole, there will be clusters of neighbors. Ultimately, we want to find the centers of those neighborhoods, because they’re the representative examples — the medoids — of typical deployment sequences. To find them, we can use an ML algorithm, partitioning around medoids (PAM).
Clustering is an unsupervised technique, so one important decision is the optimal number of clusters. (We can’t find the centers of our neighborhoods until we know how many neighborhoods there are!) In Finding Groups in Data, Leonard Kaufmann and Peter Rousseeuw (who developed the PAM algorithm), recommend testing all possible cluster counts and selecting the one that returns the largest average silhouette width (ASW). To minimize the number of clusters, I opted for a slightly different approach, striking a balance between ASW and the point-to-point increase in ASW.
According to Kaufmann and Rousseeuw, a value between 0.51–0.70 represents a reasonable structure (Finding Groups in Data, pp 87-88), so at six clusters, our primary cluster is stable.
As you can see, the resulting within-cluster distances are much smaller than they were when we looked across the entire population:
Clusters 1 and 5 are both stable, and most of their members have a within-cluster distance of 0–2, so their medoids can’t be far from the clusters’ members. Clusters 1 and 5 are also the largest, containing 25% and 21% of the all-up population, respectively.
Initially, we wanted to identify the deployment sequence of just our primary cluster (cluster 1), shown here. Cluster 5 is interesting, though, too, because it’s nearly as popular. In it, the sequence number of every service in our subset is one, meaning they’re all deployed together.
Conclusions
One of the nice things about clustering algorithms is that they allow us to identify the patterns that occur naturally in the data. With this analysis, we had hoped to get a better understanding of data science teams’ path to the cloud. Specifically, we wanted to know if AML is one of the first services they deploy or one that shows up only later. And while we suspected both might be true, it’s a bit surprising how equal they are.
The phased approach is slightly more popular. Customers start with a foundation of core services, then incrementally add services that support more sophisticated solutions. In fact, when we apply this same analysis to other Azure services, we find that the early stages are almost always identical. In a lot of cases, this is because customers are migrating existing solutions: They package their on-prem servers into VMs and redeploy them to the cloud. Later, to better manage those workloads in this new environment, customers add services like App Service, Active Directory, Key Vault, Security Center, and others.
Because migration is such a popular path, Microsoft offers tools and resources at the Azure Migration Center, step-by-step assistance with the Azure Migration Program, and a service, Azure Migrate, each of which can help customers through the migration process.
As we’ve learned, though, it’s nearly equally common for customers to go directly to AML. One reason is that AML makes it remarkably easy to build cloud-native solutions. Everything you need — compute instances, datasets, models, pipelines, etc. — is managed through an AML workspace. To get started, you can create a workspace in a number of ways:
· The Azure Portal for a point-and-click interface
· The SDK for Python or SDK for R for use with your language of choice
· An Azure Resource Manager (ARM) template or the AML Command Line Interface Extension to automate or customize workspace creation
· Visual Studio Code, using the AML VS Code extension
Once your workspace is set up, you can create a compute instance, which is a fully-managed cloud-based workstation for data scientists, pre-installed with the most common tools, including Jupyter, JupyterLab, and RStudio. You can build and deploy models by choosing from a selection of CPU or GPU instances. And for larger tasks, you can scale up from single- to multi-node compute clusters.
For people interested in learning more about AML, Microsoft has designed a learning path, “Build AI solutions with Azure Machine Learning service”, a series of freely available training modules that provides a hands-on introduction to the service and its capabilities.
And, finally, for people interested in developing deeper expertise in cloud-based ML solutions, Microsoft offers certification for data scientists: See Microsoft Certified: Azure Data Scientist Associate — Learn | Microsoft Docs.
Ron Sielinski is on LinkedIn.