Advocating YAML as DSL

Pavel Potapenkov
Feb 23, 2017 · 7 min read

Domain specific languages are everywhere. Some scripting language embedded into some system — is a DSL. Formatting syntax in the online text editor — is a DSL. SQL — is a DSL. Even syntax of config files — is a DSL (rather limited most of the times).

Everything is quite simple when we talk about simple config files. That ones, where you configure on what port to listen to new connections and at what host some useful API is living on. Generally you may use whatever syntax you want — it won’t make great difference, especially if you don’t have enormous amount of different properties. Things are a bit trickier with embedded scripting languages. If you need something turing-complete, you’ll have to choose wisely. But that’s not the case I want to discuss here.

A tiny bit of context

To describe what kind of DSL I’m talking about, I’ll give two examples of systems, that are nearly useless without good language, that can describe their configuration. But first, couple of words about environment. Imagine rather big infrastructure. Let’s say you have several thousands of servers. And this is not some monolithic cluster — all these servers are divided into couple of hundreds of different groups, each of which doing some work. Some of them are frontends of service A, some — are backends of service B, and some — are a part of computing cluster C. And so on, and so on, and so on.

So, the first example is torrent-based system for deployment of some static data. For example, you may need to generate something like search index and distribute it over all servers, that will serve users’ search requests. But you don’t want to allow just anybody to upload new data to the torrent-tracker. So, you have at least two requirements: to describe by whom (or from what clusters) the particular dataset is allowed to be uploaded; and to describe who should download this particular dataset. Also, there may be some additional options, like TTL or how many older generations you want to store (ability to rollback really quick— is priceless). And since we have rather big infrastructure, let’s assume, that there are several hundreds of different datasets, that are being generated in different places, and are being downloaded by different clusters. This seems like we will have pretty huge configuration file. And once in several days some new datasets, or generator clusters, or client clusters are appeared.

The second example is graph plotting system. When you have such broad infrastructure, you definitely want to plot lots of graphs. And likely you want not to just collect some generic data, but create reasonably complex rules. For example, you may want a graph, that represents a rate of successful requests on example.com with url, that starts with /some_handler or /blah, but excluding /some_handler_internal, from clients with user-agent, that contains Firefox, that came via IPv6, and who’s request serving duration was greater than 300 milliseconds. Such requirements may sound insane to some people, but believe me — that’s not the most complicated rule, that you may want to use. And my experience says, that on average there will be much more, than one new graph per day. So the configuration format should be a convenient tool, that allows you to describe such graphs without titanic effort.

The pain of choice

So I hope these two examples will help to understand what kind of DSLs I’m talking about. There’s lots of possible choises, and when deciding what language to use (or create from scratch) for describing such configuration, one should consider several factors:

  1. Target audience. Some strict and easily parseable format is good from technical point of view. XML or JSON (or something else, but with similar functionality) are very easy to parse and therefore using them will save development time. And they are relatively convenient if you understand them — so we may assume, that for tech people such formats will be a good solution. Like in the case of data-deployment system. But in the case of graphs not only tech people would like to edit them. Some data may be crucial from business point of view, but not from technical one. Have you ever tried to explain your business-development team, what XML is? It’s insanely hard, and much easier solution will be to create some intermediate tool, or make configuration syntax less strict and more close to the natural language.
  2. Complexity of the language itself. The less complex the language is, the more freedom you have in choosing your DSL. In data-deployment you basically have only four different possible expressions: where data is generated, where it is deployed, how long it’s TTL is, and how many copies to store as backup. From this point of view you may choose any language. And what about graphs? You’ll need at least several comparison operators (some of them may use regular expressions) and aggregation operators like “and” and “or”. And if you have some monitoring or analytics based on data from graphs (graphite, for example, has API that will give you raw time-series data), then you likely will want to store some additional parameters, like limits of “normal” request serving duration, or maximum of allowed errors, or something like that.
  3. Complexity of possible expressions. Once again: the less complex your language expressions, the more freedom you have. In data-deployer rules are mostly simple and it may be even tempting to use natural-like language — it may save lots of keystrokes, if you have to write something as simplistic, as dataset ds-0001 deploy to group_1, group_2 in your configuration. Similar syntax seems to be a good idea for graphs on the first glance. But let’s write our previous example of graph description in syntax similar to any programming language: $hostname == "example.com" && ($url ^= "/some_handler" || $url ^= "/blah") && !($url ^= "/some_handler_internal) && $user_agent ~= "Firefox" && $ip ^= "::ffff:" && $request_time >= 300ms. Not very readable, isn’t it? And if you change different special characters like “&&” to there textual analogues like “and”, it will still be not so pleasant to read and write. And the main source of errors here will be operations priority and order of evaluation. Just remove round brackets from that expression and try to figure out what it will actually match with. And that’s only “query” without any additional info like name of graph and possible alerting limits! So, obviously we need something a bit more strict.
  4. Amount of expressions. The single expression (though rather complex), that describes single graph is a total mess. Should I say, that if you have several thousands of them, they will be completely unmanageable? And that simplistic syntax of data-deployment system may be messy too. If you have long lists of groups, that particular dataset should be deployed to, and/or you wish to tune TTL and amount of older versions on per-group basis, then writing configuration on one line will be too inconvenient to read — it may become wider than your screen. So you’ll have to use some sort of multiline syntax, ant that will ruin your ability to quickly find all datasets, that are deployed to particular group. Or if you choose to describe generators and deployments in different expressions, you may end with situation, when they will eventually find themselves in different ends of configuration file.

And the silver bullet is…

Actually, there’s no silver bullet when deciding what DSL to choose. But as for me, I really think that YAML — is as close to be that bullet as possible. First of all, it is strict: it describes data structures in the manner similar to JSON. It’s syntax is actually a superset of JSON syntax, but allowing to write much less special characters if there’s no ambiguity. It also allows you to write come expressions on the single line, and stretch other ones across several lines. And if you use it’s syntax well, you’ll get really readable configuration. And there are open-source libraries, that will do parsing of YAML for you — it’s definitely a huge advantage unless you want to write you own parser for your own DSL.

YAML is definitely good for tech people. It will even be okay for some of your non-tech colleagues, especially if we’re talking about not very complex syntax. And for all other contributors of our configuration files we can write tools convenient for casual people, thus using YAML as intermediate representation. You invoke something like update-config.py "deploy dataset-1234 to group_8" and it loads configuration, parses it, changes, and saves it back. It’s trivial with YAML, but may be insanely tricky with home-brewed syntax.

And let’s look at our graph example, written in one of possible YAML representations:

And:
- hostname: { equals: "example.com" }
- Or: [{url: {starts: "/some_handler"}}, {url: {starts: "/blah"}}]
- Not: {url: {starts: "/some_handler_internal"}}
- user_agent: { contains: "Firefox" }
- ip: { starts: "::ffff:" }
- request_time: { gte: 300 }

Or the same configuration without most of special characters:

And:
- hostname:
equals: example.com
- Or:
- url:
starts: /some_handler
- url:
starts: /blah
- Not:
url:
starts: /some_handler_internal
- user_agent:
contains: Firefox
- ip:
starts: "::ffff:"
- request_time:
gte: 300

Oddly enough it’s very readable and even looks pretty natural if you don’t pay attention to different brackets and colons. Second version is more sparse and looks more convenient to some people. As for me, I prefer more compact notation — it fits better, when you have lots of graphs in one file. Well, editing such configuration is not exactly the easiest thing in the world, but it’s not hard and since modern text-editors highlight syntax, it will be not so easy to forget some brackets or miscalculate indentation.

Conclusion

So, YAML is a very powerful tool when designing DSLs of particular type. It can be highly readable when formatted right, and writing new configurations (or editing existing ones) is fairly easy too, allowing you to create rather complex configurations, that look neat. And since it is a worldwide standard, it is easily parseable by third-party libraries — that saves lots of time during development and allows other people to easily create tools, that will either use your configuration as an input source for some purpose, or use your configuration as intermediate representation while implementing alternative front-ends for configuring your systems.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade