Route to TMCloud

Published in

TOPMONKS

3 min readJul 16, 2018

TopMonks is a company with long Docker experience. We adopted Tutum (later known as Docker Cloud) for our infrastructure exactly 3 yrs ago. Tutum allowed agile development and deployment of containers to our Digital Ocean hosted servers. Most of the time it worked well.

One year ago we changed the way how we use Docker Cloud — from dedicated machines per stack to more cluster oriented architecture, where we can share resources and improve resilience. We were inspired by Docker Swarm. Unfortunately, it was not supported in Docker Cloud at the time, so we had to create our own infrastructure to support autodiscovery and loadbalancing of our services.

Fast forward to Mar 20. Docker announced closure of Docker Cloud management services. So we had 2 months to implement our own solution because there is no other service like Docker Cloud. After some research, a decision was made: We will use Docker Swarm and our very own Swarmpit to replace Docker Cloud Management UI to run our containers in almost the same way as we used to. We also wanted to continue to run our servers in familiar Digital Ocean.

So the work started. I created some Terraform templates to provision complete environment with Swarm workers, managers, Digital Ocean loadbalancer and convenient DNS entries. It also created the Swarm and deployed Swarmpit into it.

But it did not work…

Problem № 1 — Unreliable provisioning of Ubuntu machines

I was unable to reliably provision Ubuntu machines with latest Docker and security updates. It failed often. In my measure, every 3rd or 4th machine failed to update. When you are provisioning 12 nodes cluster, you get a mess.

Problem № 2 — The lack of persistent volumes

Even if your services are stateless, not all infrastructure things can work without persistence. In Docker Swarm you have two options:

Restrict service to always deploy to the same node.
Install Docker Volume plugin with the persistent backend.

In the first case, it is resignation to reliability, so second case is the way to go if you want more robust systems. Unfortunately, there is none official plugin for Digital Ocean. RexRay S3FS plugin looked promising but we were unable to make it work.

Problem № 3 — Networking issues

During our tests of the infrastructure, we found long latencies and even frequent HTTP 502 error codes from the Digital Ocean Loadbalancer. It looks like Digital Ocean infrastructure has some serious networking issues and it is also root cause for the first issue.

New direction

After some timely tries to fix the issues, we gave up and started looking for alternatives. Fargate is not available in our region jet. So we tried to use official Docker CE implementations for Azure and AWS. We were able to provision Swarm clusters with all the desired infrastructure in both environments successfully. In the end, we decided to go with Docker for AWS because we use many AWS Serverless services, so we wanted to stay as close to them as possible. I created new Terraform template from the PoC solution and now we are able to provision Swarm environments with Swarmpit and all the infrastructure on demand.

Conclusion

What looked like a straightforward migration was actually difficult journey full of obstacles. Only thing that we planned to use and worked well was Swarmpit. Convenient and easy to use UI for swarm management is crucial part of our daily workflow. We learned that not all clouds are equal — some are hiding thunderstorms.