dotScale: When Scalability Matters

Fabrizio Scandola

Published in

Elements blog

10 min readJun 11, 2015

TL;DR

Leading industry thinks containers are the most promising technology to pursue scalability. They are hard to manage, but a lot of solutions are coming up better and better (mainly from Docker and Google).
Distributed systems can help better resource management, but this process needs to be accurately and finely tuned. We are getting there.
Choosing a good database to support growing applications depends on the specific needs. Don’t think that NoSQL is the only way to go.
Accept failure when you are dealing with a huge amount of machines interconnected and sharing computing resources. Tune your software to react when these failures occur. Basically this requires you to decide between Consistency -when atomicity is important and your data needs to be always in a safe state- and Availability -when you have a lot of processes that need to proceed no matter what-.
Have fun!

Paris, Capitale de l’Amour

In the heart of Paris, the European capital of Love, a few steps away from one of the world’s best known cabaret which goes under the name of Moulin Rouge, the beautiful Théâtre de Paris hosted the third edition of dotScale, one of the most important European conferences on Scalability, DevOps and distributed systems. Elements, of course, could not miss the call and a group of five of us had the pleasure to attend the conference.

What is Scalability?

“Scalability is the ability of a system, network, or process to handle a growing amount of work in a capable manner or its ability to be enlarged to accommodate that growth.”
– Wikipedia

The above definition of Scalability, retrieved from Wikipedia, seems like a pretty straightforward and interesting concept. It’s a great goal for anyone building any piece of software and wanting lots of people to use it. Sounds like something everyone should try to achieve and live happily ever after. Too bad it’s so beautiful and promising in words but turns into a very complex task when it comes to actually implement an architecture that can support it.

It’s not just me saying so: a big group of nerds experts from all over the world is still trying to figure out the optimal way to do it right (if there is an optimal way) and a lot of interesting discussions are still open. Luckily enough I had the pleasure to hear the opinion of many of them sharing their knowledge on this hot topic.

The Industry trend

I want this blog post to be useful for the reader and what I’d like to do is giving you an idea of what the real leading industry thinks when it comes to Scalability. Therefore it must be immediately clear that we are not talking of immutable truths because, simply, there is not an unique vision on this difficult matter (speakers also had slightly different opinions on the same subject).

What I can do for sure, instead, is speaking in term of trends, providing you with a good starting point to let you deepen your knowledge on your own.

That said, as an outcome of the whole conference, I can definitely tell you that when it comes to Scalability, the industry is moving towards an extreme virtualization. Cloud services, if it was not clear already, appear to be the way to go. But they are only the starting point of an even bigger exasperation: the leading trail seems to go in the direction of containers, a technology that have been out there for some time now, that seems however to find confirmation in being the right path to follow. We are surpassing both Virtual Machines and the concept of having to deal with an entire Operative System in favor of isolated sandboxes where it’s easy to spawn processes without having to care about the whole server architecture.

Containers, worth knowing

To sum up briefly the main idea beyond containers, I will start with a new question: why would we want to virtualize an entire machine when we could just virtualize a small part of it?

This is what some important department of well known companies around the world asked themselves and their answers brought platforms like Docker to life.

It is no surprise that most of the talks were focused on this. As Ben Firshman (product manager at Docker) let understand, the advantages of pursuing this road could potentially be enormous. I say could because, still, this is no drawback-free path. Containers allow simplified deployment, letting the developer focus on packaging an application in a single distributable component without having to care about the configuration of the whole environment at run time. The theory around this is awesome, but again, we should think in terms of real-case scenarios. Real software asks for elasticity. Spawning a new node, or, translated, having more computing capacity or storage available, can take very limited amount of time (which will probably keep dropping year after year) but we should also consider the spin-down time which is the time between no longer requiring computing power and no longer paying for it. That is what we care about. As for now this is a concrete bottleneck in the process because even on advanced clouds like Ec2 it takes more than half an hour to spin-down. What will really change the way we develop will be the moment when adding or reducing computing capabilities will require just seconds. Or, to report Ben’s words, when it will be possible to have machine on a per-process basis. Great things to imagine.

When Google takes the field

If this is not enough, John Wilkes, researcher at Google, wanted to add even more on this concept and virtualizing basically everything. He told us how at Google they asked themselves: how can the containers be organized and managed as the resources are consumed in the underlying environment? How to efficiently allocate resources in a totally decentralized service made of containers? The attempt to answer these questions put the building blocks to the release of Kubernetes, a cool service of well managed containers hosted directly on Google Cloud Platform that I highly encourage you to check out, if you have never heard of it.

These big heads at Mountain View decided to deeply test what are the most efficient ways of allocating computing capabilities when not required (also in the context of the same container!). They defined different levels of resources (in terms of memory and CPU) that they wanted to keep free for each process of their applications (to tackle unexpected workloads) and shared what was left. Studying these results, which should have been quite interesting given the amount of data you can retrieve just from the hosting architecture of something like Gmail, led them to refine this technology to finally create a service that was so good it was worth sharing. What they are doing with Kubernetes is basically building a new abstraction layer with the fundamental principle that system administrators should only care about the behavior and the performances of the desired services and not of the containers where these services are hosted (or of the infrastructural resources where they are running on).

*A good summary of the industry’s trend in terms of scalability*

Databases, they always matter

So far I just talked about machines and their interconnections but there’s another fundamental topic when thinking in terms of large applications: databases. They are those pieces of software that are holding up almost every service we are using in our everyday life; so heavily used that they are one of the main aspects of a growing strategy. No surprise, a lot of talks were focused on this subject. Let me ask you: if you were planning to build the next big application, say the new Facebook, what database would you choose?

Chances are many hipsters will answer something like MongoDB, or Cassandra, because, of course, NoSQL is meant for these kinds of things. Do you want to know what people like Paul Dix, creator of InfluxDB, think about it? Of course, the answer is that there is no answer. It really depends on which key characteristics you want from your database. Let’s dive a bit in his talk to make it clear which are some of the aspects that you might want to consider when choosing a database to sustain the growth needs of your application.

He talked about the specific case of those types of software built to record time series data like log files, sensor reports and so on. These, for example, are measurements that can be very frequent and also very irregular. A system like that is peculiar and therefore requires peculiar solutions. It’s immediately clear when there is need for a lot of WRITE operations with basically no UPDATEs (Who ever updates a log entry? Or temperature measurements?). Moreover, it would be nice to have the ability to DELETE old data in big batches for removing useless entries that nobody needs anymore (would you still care about the logs of your applications from last year?). But this is just scratching the surface; what if you need to partition your data, which can also be useful for the same deletion task. What partitioning criteria would you use? Timestamp? Another unique general field? A combination of them?

As you might see there are lots of things to consider. This case is quite interesting not because of the specific case study -which might not concern even remotely your specific needs- but because it forces you to think in terms of “what you want to achieve” or “what are your specific needs”. At this point it is quite obvious that even only asking “which is the best database for scalability” is quite a stupid question.

Just for the sake of completeness, I tell you that Paul ended up creating his own system which, again, is a worth-to-check for everybody.

NoSQL? YeSQL!

Just to reinforce my thesis that everything is relative, Simon Riggs, core developer at PostgreSQL, brought some valid argument to support the thesis of NoSQL being not the only suitable database to sustain big applications. And coming from a guy who spent the last 30 years of his career working with databases (10 of which with PostgreSQL), his opinion might be of some interest. Of course his talk sounded a lot like a huge advertising campaign towards his product, but he illustrated some very nice features that can make PostgreSQL a valid alternative also in a distributed context. Speaking of scalable ecosystems where everything is so fragmentary, it’s impossible to pretend that the database can be entirely hosted on one single machine.

One of the core aspects to consider when growing is how easily you can face the problem of storing and serving increasing amounts of data. This goal can be achieved both vertically, increasing the CPU and/or the RAM of the single machine, and horizontally, splitting the database across multiple servers. The latter, which is also the absolutely preferred solution, is known with the term of sharding and it’s widely supported by PostgreSQL which offers also very interesting solutions to assign unique identifiers to data sets in order to make them highly available even in such a partitioned environment. If you’re interested in better understanding how a relational database can serve the purpose of scalability through sharding here you can find an interesting blog entry that I discovered googling around, just to ease the assimilation of these vague and theoretical concept.

On top of this, Simon illustrated also other interesting features that PostgreSQL embeds to reach the goal of being flexible enough to suit very big architectures (e.g. massive parallel query) but going into details would bring me out of the scope of this article. It’s important that you consider the incredible amount of interesting possibilities to sustain a growing application, so that you can start getting an idea of the right approach when dealing with such complex themes.

Keep calm and accept failure

To wire up everything, anyway, no matter if you think in terms of storage solutions or computing capabilities, I must tell you something bad: if and when you will decide to build an architecture made of virtualized containers which share huge amounts of information hosted on big databases… something will fail. And I am not saying it can, it just will fail. In a distributed system it is simply something you must accept. It’s basically everybody at dotScale telling you this.

There are assumptions that we take for granted when building applications that share memory, which break down as soon as nodes are split across space and time.
– Robert Greiner in this blog post

This is quite easy to picture out: the more nodes you add, the bigger the probability that some of them will fail. That is why partitions is something that you must simply tolerate. And this leads you to consider even more deeply which approach to adopt, because there are a lot of variables to deal with.

In the end, what is the lesson learned? Well, it might seem that I’m leaving you with more problems than solutions but I wouldn’t see it like that. I personally think this is simply a fascinating subject and we should simply shift our focus from the problems towards all the amazing things we will be able to build pursuing this hard road. Of course it will be hard; it will be hard as hell. But what would be a developer’s life without hard problems to solve?

Just boring.

Don’t forget to follow us on Facebook and Twitter!

Originally published at www.elements.nl on June 11, 2015.