Microscaling on Ice — For Now

Published in

Microscaling Systems

8 min readFeb 9, 2017

When Liz, Ross and Anne started working together on Microscaling Systems in 2015 we wanted to explore what was possible with containers and orchestration. Back then, Docker containers were still a new concept for packaging and deploying applications. We realised that beyond this use case, containers also offered many of the operational benefits of virtualisation but with much faster (100+ times) instantiation and shutdown speeds, lower overheads and programmable control. We still believe the adoption of containers in data centres will change infrastructure and operations everywhere.

At the start of 2017, sad as it is, we feel the timing isn’t right for us to build an enterprise product based either on SLA-driven container scheduling (Microscaling) or container metadata (Microbadger).

So, the three of us are going to freeze things for a while to pursue other fascinating projects. However, we’ll still talk and write about what we’ve learned on Microscaling and Microbadger and about what we learn on our new projects.

With 10K badges in use, we know Microbadger and the Microbadger API are valuable resources for the community so we intend to keep them running for as long as possible and we’re in discussions with lots of folk to find a home for Microbadger.

We’ve also been writing up our thinking and publishing it so nothing is lost. We’re open-sourcing more code at microscaling/microscaling and we have delivered one last real world demo that pulls everything together and demonstrates what containers+orchestrators+metadata+custom schedulers can do.

From Microscaling Systems, ‘bye for now. We’ll be back.

In the meantime, here’s the story of how we got here…

The Story of Microscaling Systems

Back in 2015, our plan was to investigate ways in which container technology could be used to cut the cost of operating data centres and reduce their environmental impact. We wanted try real-time, SLA-aware over-subscription to increase resource utilisation. We called this process Microscaling.

Microscaling lets enterprises “hot bunk” their data centre resources at super-speed— enabling the mix of applications running within a data centre to dynamically change, in real time, in response to current user demand. For example, you could ramp up the number of web servers at times of high user demand and then replace some of those processes with data-analysis batch processes when demand fell.

However, while we were researching and prototyping this technology, we saw a potential problem with getting hold of good enough metadata to guide the decisions of our scheduling agent. We needed metadata like

Which applications are tied to which metrics? Or
Which applications are most important?

Why Did We Care About All This?

Traditionally, if additional processing capacity is required by an application in a data center it’s provided by adding new machines. I.e. the size and the power requirements of the whole data centre are increased.

Note: sometimes older machines are swapped out for more efficient new ones, which can provide more processing for the same power and running cost. Great! However, this is still the same philosophical approach of resource expansion.

Because increasing the size of a physical data centre is a relatively slow process, enterprises often buy and operate more on-prem resources than they need “just in case”. This results in over-provisioning (the acquisition and operation of machines that are not productive) and considerable inefficiency. To add to this inefficiency, applications often only use a fraction of the resources of the machine they’re running on, so servers are powered on but not fully used. In 2012, Gartner estimated that worldwide data centre efficiency was just ~10–15%. AWS have recently come up with similar on-prem estimates. That is very poor.

We wanted to investigate an alternative provisioning approach, where existing infrastructure was dynamically re-purposed by switching off less critical applications, at any given time, in order to provide additional space for applications that were more important to the enterprise at that moment.

Environmental Impact of DCs

In 2016, The Economist magazine estimated data centres use ~2% of the world’s electricity, and the demand for data centres is predicted to rise by between 50% and 400% by 2020 (the Economist, again).

Projected growth in data centre demand outstrips the rate at which the world is expanding its renewable energy production. The additional electricity requirements will therefore have to be met by fossil fuel burning. We believe it doesn’t have to be this way.

Could it Work?

We reckoned our theoretical “hot bunking” approach required container instantiation speeds of <5 seconds in real world environments. Although we had read that results of <1 second had been achieved in labs, it was unclear whether these speeds could be reproduced in common field conditions such as a cloud data centre. Our first job was therefore to attempt to replicate sub-5 second container instantiation times on a common enterprise infrastructure.

Note — containers themselves only take milliseconds to start, they are only processes after all. It’s wrapping them in engines and orchestrators and service discovery and load balancers and all that other good stuff that adds the delays. Lambda is based on container technology and they seem to have done some clever stuff, leveraging strict behavioural constraints, to get instantiation times closer to the raw milliseconds.

We chose to use Amazon Web Services (AWS) and Microsoft Azure as our test environments as they were the two most popular enterprise cloud hosting environments. We also realised that we’d need to use an orchestrator as part of our set-up. Any results we achieved would have to be reproduced with at least one, preferably several, of these orchestrators, which would probably slow down the speed at which we could instantiate or destroy containers.

We therefore built a software prototype and base algorithm that would test our sub-5 second hypothesis on several orchestrators with the Docker container engine. This required us to create an orchestrator abstraction layer (a common interface) that would work with multiple orchestrators. We open sourced this work and talked about it at several conferences and meetups. We also built the code into “Microscaling-in-a-box”, a self-contained demo so folk could play with a custom scheduler out-of-the-box and Microscale their own images.

We found we could achieve the target sub-5-second speed (4 seconds, in fact) on AWS and Azure with several orchestrators plus the Docker container engine. To do it, Ross had to alter the default configuration of the orchestrators, which was fiddly but doable.

Stability and Control Theory

We also wanted to experiment with various scenarios that simulated real-world demand to ensure our basic algorithms would not cause system instability. We chose to simulate web traffic and load on message queues.

During testing with several message queue applications (Amazon’s SQS and the open source NSQ) we found our original base algorithms led to system instability and over-correction. Liz speculated that control theory, typically used in physical systems rather than in software, could be useful in this situation. She therefore added simple control theory to our own algorithms, which stabilised the system significantly. Again, we open sourced this code, Liz talked about it at several conferences and we included this in the Microscaling-in-a-box demo.

Metadata

As previously mentioned though, we soon realised our tools needed metadata about the desired behaviour of any cohort of data centre applications (which were the most important, which SLAs did they need to meet, restrictions on their runtime environment, location etc..).

A mechanism already existed for specifying build-time metadata for a containerised application (Docker Labels, which were drawn to our attention by the excellent Gareth Rushgrove of Puppet), but we were unsure how widely this mechanism was used. So, we started investigating the current state of provision of metadata for public images.

We decided to prototype a SaaS tool that would programmatically extract and display all the metadata associated with any public container image on Docker Hub to see if they commonly included the basic information required for us to Microscale. We called this SaaS Microbadger. Extracting the metadata required us to reverse-engineer some of the contents of an image registry, which was somewhat fiddly, but we did it.

We looked at public images on Docker Hub, the most popular registry, and discovered something pretty shocking: fewer than 4% of public containerized applications included even basic metadata like what the application was called! Another 4% included incorrect metadata they had accidentally inherited from the CentOS base image!

At this point, we set about trying to raise the profile of accurate metadata and highlight to the container community the advantages of creating and sharing metadata for images. We helped create the community project label-schema.org and we gave out free Github and Docker Hub badges to promote awareness of accurate metadata. Then we talked, and talked, and talked about it at conferences.

Full Circle, Microbadger is now Microscaled

Finally we tied everything together. Our production Microbadger system uses two containerised, asynchronous mini-services. One handles incoming user requests and is time sensitive (if people ask for information on our website about an image we want to serve at least the basics very quickly). The second handles the longer running tasks of talking to Docker Hub and getting the latest full data on any specific image (this service runs on request: on the call of a hook and hourly). The second task is important but less time sensitive.

We hooked up our Microscaling scheduler to check the length of the queue to our time sensitive mini-service (our SLA) and start and stop instances of our two services to maintain our target queue length (i.e. ensure the time sensitive service was responding promptly) whilst minimizing our use of AWS paid-for resources.

To fully complete the circle, our Microscaling scheduler uses the Microbadger API to get the key metadata on the 2 containerised services it controls (basically which is the urgent service and which is the less-urgent one).

What Next?

Did metadata get any better from our yelling about it? We hope so. There are now over 10,000 images on Docker Hub and GitHub with our metadata-awareness badge and industry leaders like Kelsey Hightower are talking about metadata.

Kelsey & Anne’s Holiday Webinar- How Metadata Stole Christmas

Kelsey Hightower and I held a webinar on Monday to talk about containers, labelling, annotations, Docker, Kubernetes…

medium.com

We still believe that containers+orchestrators+metadata will be the foundation of intelligent applications that sit on top of orchestrators like Kubernetes and improve the security, efficiency and resilience of production infrastructure. This should start to happen in 2017–2018.

On Microscaling, you can try it yourself using our open source Apache2 microscaling code as a base or write your own custom schedulers from scratch

Start with a good orchestrator like Kubernetes or Nomad and you should be able to improve your resource utilisation significantly using their default deploy-time bin-packing.
You can then write custom schedulers that will microscale for you. If you want to learn more about this you can read about how we use microscaling to control the Microbadger service. You can also play with the code of our scheduling agent on github, which should teach you about basic control theory as well as controlling orchestrators like Kubernetes or Mesos.

Good luck from us all and see you soon,

Anne Currie, Liz Rice & Ross Fairbanks