Automating NoSQL Database Builds

A “Python to the Rescue” Story That Never Gets Old

Published in

Capital One Tech

11 min readJul 20, 2016

That Feeling when your application is so big — and so important — that you just know there’s going to be a dramatic expansion in the storage needs sometime soon. “The ‘Cyclopean’ analysis tool is about to be walloped by another line of business, an acquisition, and a spirited level of organic growth all at the same time. Oh, and the doubling time was just cut in half.” Looks like someone needs to build the new database server farm.

Viewed from a distance, the idea of provisioning a NoSQL database server farm in a situation like this seems pretty simple. Allocate servers. Allocate storage. Build the software. Do all the other database configurations required to meet enterprise standards, InfoSec standards, and production support standards. Isn’t this what DevOps is all about? It’s just “Hey! Presto! Database!” Right?

When we dig into the details, though, it isn’t quite so simple. And yes, this is going to turn into a “Python to the Rescue” story.

What’s Involved? No, Really?

The basic outline of provisioning a database server farm requires several resources. The most important resource being patience. These tend to be large servers with A LOT of disk storage. (30Gb to 60Gb RAM.) For some databases — like Cassandra — it seems most sensible to build the nodes serially. In this scenario, the first and last nodes are treated differently from the others, with the first node containing seed information the others need. Therefore, it seems simplest to avoid building database details until all the nodes are built and can start sharing roles, users, and other definitions. This drags out the time to build the next release of the “Cyclopean Database.”

Viewed abstractly, we’re going to do the following things to create each individual node of the server farm:

1. Run a Chef recipe from a provisioning server to build each node.

2. Update the domain name.

3. Create entries in any configuration management database that’s used to track assets.

4. Schedule backups (if relevant.)

Most of this is just API calls; they’re pretty straight-forward, especially in Python 3. (Hold that thought for later) The Chef recipe problem is where the real work shows up. We need to strike an appropriate balance between using simple Chef scripting and maintaining flexibility. Also, the recipes need to have enough parameters so we can avoid tweaking them every time there’s a change in what we’re building.

The problem is that the pace of change to the environmental setup is brisk because these aren’t generic installations: they’re customized for our enterprise needs. Yesterday’s enterprise best practice is today’s “barely-good-enough” practice. We don’t want constant tweaking of the Chef recipe. One alternative solution is to gather data dynamically. But, a recipe that gathers data dynamically means we might have trouble recovering a piece of dynamic data in order to rebuild an existing server.

Where’s the middle ground?

Finding Flexibility

Building the Chef recipes forced us to see that there are a large number of parameters that drive Chef. So many that we needed external tools to collect the data and build something useful with it. The parameters fall into a number of buckets:

- Things the user must provide. Application names (“Cyclopean”). Estimated sizes (“1Tb”). Their line of business, and their cost center information. Clearly, this drives the database building request.

- Things which the enterprise needs to provide. Cloud configuration details, subnets, other corporate-level details. Some of these configuration details change frequently. There are API’s to retrieve some of them. Doing these lookups in a Chef recipe seemed wrong because the recipe becomes dynamic, and loses its idempotency.

- Things which database engineers need to define. Naming conventions, storage preferences, sizing calculations, and other environmental details. This could be part of the Chef recipe itself.

- Things which are unique to each database product. For example, the unique way to define users, roles, privileges, connections to an enterprise LDAP. We need the flexibility to let this vary by line of business. Rather than create many variant recipes, we’d prefer to drive this as part of the overall collection of parameters for the build.

- Things which are simple literal values, but will change more often than the recipe will need to change. Version numbers of the distribution kits, for example, will change but need to be parameterized. Chef attributes already offer pleasant ways to handle this. The 15-step attribute precendence table shows how a recipe can provide default attributes, an attribute file can provide values, plus an environment can override the values.

Our goal is to minimize tweaking the Chef recipe and experience shows this involves a lot of parameters. For some of our NoSQL database installation recipes, this can be as many as 200 distinct values.

Python gives us a handy way to gather data from a variety of sources and build the necessary attributes and configuration options so that a — relatively — stable Chef recipe can be used. (remember where we said in the intro this would become a “Python to the Rescue” story?) The idea is to cache the parameters with the Chef recipe, letting us rebuild any node at any time. Having a static template file gives us the needed idempotency in our provisioning.

The next question to address is designing the Python app so it can support repeatable but flexible builds.

Naïve Design with Python Dictionaries

Let’s talk Python. The parameters for the recipe can be serialized as a big JSON (or YAML) document. Python makes this really easy if we create a dictionary-of-dictionaries structure, this can be serialized as a JSON/YAML file trivially. (It’s a “json.dump(object, file)” level of trivial.)

How do we build this dictionary-of-dictionaries?

Let’s look at a storage definition as an example. We have some parameters that need to go into our Chef recipe. The details involve some calculations and some literals. We can try this:

storage = {
    'devices': [ 
        'device_name': '/dev/xvdz',
        'configuration': {
            'volume_type': get_volume_type(),
            'iops': get_iops(),
            'delete_on_termination': True,
            'volume_size': get_volume_size(),
            'snapshot_id': get_snapshot_id(get_line_of_business()),
        }
    ]
}

Emphasis on the word “try”.

This will build a tidy dictionary-of-dictionaries data structure. The details are filled in by functions that acquire and compute values. For consistency, we could even wrap literal values as functions to make all the parameters more uniform:

def device_name():
 return ‘/dev/xvdz’

The problem is that the functions either have a number of parameters or they’re forced to use global variables. It turns out that there are many external sources of configuration information. Passing them all as parameters is unwieldy; a configuration namespace object would be required on every function.

Some of the computations are stateful. For a concrete example, think of a round-robin algorithm to allocate database nodes among data centers and racks: each node’s assignment leads to updating global variables. A function with a side effect like this is a design headache and a unit testing nightmare.

Declarative Python

How can we provide a better approach to using parameters instead of globals? And how can we have stateful objects that fill in our template?

Our answer is to use a declarative style of programming. We can — without doing any substantial work — create a kind of domain-specific language using Python class definitions. The idea is to build lazy objects which will emit values when required.

Sticking with the storage example, the approach would look like this:

class Storage(Template):
    device_name = Literal("/dev/xvdz")
    configuration = Document(
        volume_size = VolumeSizeItem("/dev/xvdz", Request('size'),
        "volume_size", conversion=int),
        snapshot_id = ResourceProfileItem(Request('lob'),
            Request('env'), Request('dc'), "Snapshot"),
        delete_on_termination = Literal(True),
        volume_type = ResourceProfileItem (Request('lob'),
            Request('env'), Request('dc'), "VolumeType"),
        iops = ResourceProfileItem (Request('lob'),
            Request('env'), Request('dc'), "IOPS", conversion=int)
)

For this, the details are created by instances of classes that help build the JSON configuration object, and instances of classes that fill in the items within a configuration object. There’s a hierarchy of these classes that provide different kinds of values and calculations. All of them are extensions to a base Item class.

The idea is to build an instance of the Template class that contains all of the complex data that needs to be assembled and exported as a big JSON document for the Chef recipes to consume. The subtlety is that we’d like to preserve the order in which the attributes are presented. It’s not a requirement, but it makes it much easier to read the JSON if it matches the Python trivially.

Further, we need to extend Python’s inheritance model slightly so that each subclass of Template has a concrete list of it’s own attributes, plus the parent attributes. This, too, makes it easier to debug the output.

We’re going to tweak the metaclass definition for Template to provide these additional features. It looks like this:

class TemplateMeta(type):
    @classmethod
    def __prepare__(metaclass, name, bases):
    """Changes the internal dictionary to a :class:`bson.SON` object."""
        return SON()
 
    def __new__(cls, name, bases, kwds):
    """Create a new instance by merging attribute names.
       Sets the ``_attr_order`` to be parent attributes + child attributes.
    """
    local_attr_list = [a_name
        for a_name in kwds
            if isinstance(kwds[a_name], Item)]
    parent_attr_list = []
    for b in bases:
        parent_attr_list.extend(b._attr_order)
    for name in local_attr_list:
        if name not in parent_attr_list:
            parent_attr_list.append(name)
    kwds['_attr_order'] = parent_attr_list
    return super(TemplateMeta, cls).__new__(cls, name, bases, kwds)

The metaclass replaces the class-level __dict__ object with a bson.SON object. (Yes, we use Mongo a lot.) The SON object preserves key entry order information, much like Python’s native OrderedDict.

The metaclass definition also builds an additional class-level attribute, _attr_order, which provides the complete list of attributes of this subclass of Template and all of its parent classes. The order will always start with the parent attributes first. Note that we don’t depend on the parents all providing an _attr_order attribute; we actually search each parent class to be sure we’ve found everything.

The substitute() method of a Template collects all of the required data. We could produce the JSON data here, we prefer to wait until the output is requested.

def substitute(self, sourceContainer, request, **kw):
    self._source = sourceContainer
    self._request = request.copy()
    self._request.update(kw)
    return self

The parameters for building out the data come from three places: a sourceContainer which has all of the various configuration files, the initial request that specifies the details of how many nodes for the next release of “Cyclopean”, and any keyword overrides that might show up.

The output comes when the template is emitted as a SON object that serializes in JSON notation.

def to_dict(self):
    result = SON()
    for key in self._attr_order:
        item = getattr(self.__class__, key)
        value = item.get(self._source, self._request)
        if value is not None:
            result[key]= value
    return result

All of the attribute-filling Item instances have a common get() method that does any calculation. This can also update any internal state for the item. The Template iterates through all of the Items, evaluating get().

The get() method of each Item object is given the configuration details. Instead of free-floating globals, the Template has a short list of tightly-defined configuration details; these are provided to each individual Item.

This avoids relying on a (possibly vague) collection of global variables. Bonus! Since they’re objects, stateful calculations do not include the terrifying technique of updating global variables. State can be encapsulated in the Item instance. Unit testing works out nicely because each Item can be tested in isolation.

This gives us something that’s highly testable and not significantly more complex than the naïve design. We can have a stable, simple Chef recipe. All of the lookups and calculations to prepare values for Chef are in our Python application. Specifically, they’re isolated in the definition of the Item subclasses and the Templates.

The Value of Python

There are two reasons why Python works for us:

1. Flexibility.

2. And flexibility.

Firstly, we have the flexibility to modify the JSON documents that are used for Chef provisioning. The Chef scripts are tiresome to debug because each time we test one it takes — seemingly — forever to provision a node. The documents which are the input to the Chef recipe can be defined via unit tests and it takes under a second to run the test suite after making a change. Each new idea can be examined quickly.

For example, consider a change to the way the enterprise allocated subnets. Yesterday, “Cyclopean” was on one subnet and life was good. Now that it’s becoming huge, it has to be moved and the databases split away from the web servers. The specifications for the subnet went from a simple Literal subclass of Item to a complex lookup based on environment and server purpose.

We used to have this:

class SubnetTemplate(Template):
    subnet_id = Literal('name of net')

Now we have this:

class Expanded_SubnetTemplate(Template):
    subnet_id = ResourceProfileField(Request('env'),
        Request('purpose'), 'Subnet')

And yet, this change had no impact on the Chef recipe. It added a bunch of details in configuration files, and some additional lookups into those files. Changes we can design and unit test quickly.

Secondly, we have the flexibility to integrate all of the provisioning steps into a single, unified framework. Much of the work is done through RESTful API’s. Using Python 3 and the new urllib makes this relatively simple. Additional libraries for different cloud vendors fit the Python world-view of extension modules to solve unique problems.

We use a Command design pattern for this. Each step in the build process is a subclass of NodeCommand.

class NodeCommand:
    """Abstract superclass for all commands related to building a node.
    """
    def __init__(self):
        self.logger = logging.getLogger(self.__class__.__name__)
 
    def __repr__(self):
        return self.__class__.__name__
 
    def execute(self, configuration, build, node_number):
        """Executes command, returns a dictionary with 'status', 
           'log'.
 
        Sources for some parameters::
 
        build_id = build[‘_id’]
        node = build[‘nodes’][node_number]
 
        :param configuration: Global configuration
        :param build: overall :class:`dbbuilder.model.Build` 
               document
        :param node_number: number for the node within the sequence 
               of nodes
        :returns: dictionary with final status information plus any 
                  additional details created by this command.
       """
       raise NotImplementedError

One of the most important subclasses of NodeCommand is ChefCommand, which executes the chef provisioning script with all of the right parameters.

Using multiple Command instances means that we can — via simple imports — wrap a lot of features into a high-level Python script. And the integration doesn’t stop there. The provisioning automation engine is made available via a Flask container. The import statement lets a Flask container provide the complex batch script capabilities to any internal customer who can write a curl request or a short Python script.

Python To The Rescue

That Feeling when your application is so big — and so important — that you just know there’s going to be a dramatic expansion in the storage needs sometime soon…

We think we’ve found a way to provide advanced TechOps services directly to the lines of business as code packages, as well as RESTful web services. And we think that Python is an integral part of meeting the business need for building NoSQL databases at scale.

We tried to use Chef directly, but we wanted more flexibility than seemed appropriate for the tool. The idea of constantly tweaking the recipes didn’t feel like the best way to have a tool that would reliably recreate any server.

We tried to create Chef parameters using relatively naïve Python, but this lead to too many global variables and too many explicit parameters to functions. It become a unit testing nightmare.

After catching our breath, we realized that a declarative style of programming was what we needed. We didn’t invent a new DSL, we merely adapted Python’s existing syntax to our needs. A simple class structure and a metaclass definition gave us everything we needed to build the configuration parameter files we needed.

Now, we can create vast server farms for the next release of “Cyclopean.”

For more on APIs, open source, community events, and developer culture at Capital One, visit DevExchange, our one-stop developer portal. https://developer.capitalone.com/