Demystifying Configuration in Python

Tips and tricks to seamlessly handle configuration and settings in your python project

Published in

cgnal-tech

10 min readDec 28, 2020

Introduction

Any data science project (from a simple analysis pipeline to more complex applications involving AI engines that support data-driven decisions) with a slightly more ambition to go beyond a spot-on analysis and/or a notebook implementation requires some sort of configurations and some sort of settings management.

In particular, in any data-science project, configuration settings can be very useful to

Specify the data-bindings for the application, pointing to the right database, tables, and files (UC1) which may very well differ between dev, certification, and production environments.
Specify, change, and update the parameters that control the training of AI model as well as where new models should be stored (UC2).
Specify and select models to be used in production or any prediction pipeline (UC3).

Handling configurations properly is a need that clearly arises in all use-cases we deal with, and therefore finding a common suitable, structured and efficient solution for managing them is extremely important, crucial to provide a single standard backbone structure to all projects.

In particular, all projects we develop tends to have a common standard structure:

data
A folder containing static files to be used by the application. Generally, this folder is ignored by git.
pkg
Python package to be installed with all classes, functions, and abstractions needed in the scripts/notebooks.
scripts
Minimal python scripts that use all the functionalities developed in the python package pkg
notebooks
Data Exploration and Reporting Notebooks that use all the functionalities developed in the python package pkg

Requirements

First, we define some set of requirements that we would like our configuration system to be able to handle.

Handle defaults and application configuration (R1). Those coming from a Java/Scala background are very familiar with the concept of having a defaults.conf and an application.conf files, i.e. that basically means having a central set of default configurations, integrated into the code base, that can be overwritten by some application-specific configuration. For instance, as far as data-bindings needs (UC1) are concerned, when taking the very same application to develop then certification then production environments, most of the settings might be the same (name of the tables for instance), except for the address/name of the database or the user that need to access to it and run the application. Thus, the default value can be overwritten in the different environments by an application-specific file, whereas most values stay unchanged. This becomes also very important when dealing with permission and users, which should be kept secret and specified only at the application level.
Hierarchical in the structure (R2). Often setting values can be many and having a hierarchical structure would definitely help in providing some sort of order and allowing better readability of the code. Having for instance a section that provides all data bindings (UC1), a section that provides the parameters to be used in training (UC2), and a section that specifies which model ought to be used in production (UC3) can greatly simplify the code and make each parameter a lot more understandable with respect of very complex names, flattened in a single-level structure.
Package and Python Integration (R3). The configurations should integrate easily with the Python package being developed. One of the great features of Python is its unbeatable ability to allow fast data exploration/analysis and try-out things on-the-fly. Therefore, the configuration system should allow for seamless integration with the data-sources (UC1) integrated with the python package. Ideally, reading files/data should be as easy as importing dataset with sklearn
`from pkg.dataset.images import my_data`
or
`from pkg.models.rnn import my_neural_net_classifier`.

Available solutions

In the following we will dive into some solutions we have been reviewing and the final pattern we have adopted and implemented for our Python projects.

Having in mind the requirements above, we reviewed different solutions for handling and managing configurations. A good review of all the solutions, with a code snippet showing how to apply them, is available here (https://martin-thoma.com/configuration-files-in-python/). In the following, we rather focus on the pros and cons of the different approaches, concerning the requirements outlined above.

Python files (.py extension). This consists of using plain python files within the package to provide all settings for the application. The settings are therefore directly installed and integrated (R3). Hierarchy (R2) can be easily achieved via classes and subclasses, and different Configurations for dev, certification, and production environments can be provided via different classes. However, these should be defined up-front, before installing the package, with little ability to modify them afterward. Changing the model (UC3) or some parameters for the training (UC2) means re-installing the package. Not very practical indeed. Defaults and application configuration are poorly handled with this choice. Besides, this way of handling configurations is somewhat discouraged by the python community.
INI files (.ini extension). INI is a file extension for an initialization file format, as the name implies, used historically by Microsoft Windows. INI files are plain text (ASCII) and are used to set parameters for the operating system and some programs. INI files are becoming increasingly standard in the Python community for handling configurations, and the main python3.x library for handling configuration, configparser, does use INI files as default. configparser allows for multiple specifications of defaults and applications INI files (R1) and provide a one-level-deep hierarchical structure, with sections containing key-value pairs (R2). Deeper nested structures are however not allowed and, more importantly, although standard, the INI file does not support the specification of other values than scalars, booleans, and strings. More complex structures, such as lists or arrays, cannot be used as the value of key-value pairs. Lists specification would thus require some sort of workaround, such as character splitting (i.e. by “,”) or multiple configuration keys, either way, not very elegant or general.
JSON files (.json extension) JSON files, in the form of nested dictionaries, do resolve both issues of INI files by allowing a nested structures (with no limitation on the depth) (R2) and arrays/lists as values of the key-value pairs (R3). A library that allows reading default and application configuration files do exists (R1) and thus it has been a good candidate for handling configuration…if YAML did not exist.
YAML files (.yaml or .yml extension). YAML is in fact a superset of JSON files (for an example of a YAML file please go to https://yaml.org/) which builds on top of all good advantages of JSONs on hierarchical structure and flexibility. Moreover, it also provides a set of very convenient extra features, among which:
(a) is visually easier to look at
(b) has the ability to reference other items within the file using “anchors.”
(c) It can be fully integrated with Python code by forcing the type of values used, use user-defined functions, and allow direct class instantiation within the file
Today, there exist several libraries that allow you to parse YAML files effortlessly, such as this one that we used in the following, i.e. https://github.com/MartinThoma/cfg_load.

Implementing our configuration system

Once the standard file format has been chosen, we then proceed with the implementation of the configuration system.

First of all, using the API provided by cfg_load (https://github.com/MartinThoma/cfg_load), we create some utils functions to read and merge defaults and applications configuration files:

import cfg_loadfrom functools import reducedef load(filename: str):
    '''
    Load file
    '''
    return cfg_load.load(filename, safe_load=False, Loader=yaml.Loader)
def get_all_configuration_file(application_file: str='application.yml'):
    '''
    Get all configuration files available to the application
    '''
    confs = [os.path.join(path, application_file)
    for path in sys.path if os.path.exists(os.path.join(path, application_file))]
    env = [] if 'CONFIG_FILE' not in os.environ.keys() \
          else [os.environ['CONFIG_FILE']]
     print(f"Using Configuration files: {‘, ‘.join(confs + env)}")
     return confs + env
def merge_confs(filenames: List[str], default: str="defaults.yml"):
    '''
    Merge the configuration files
    ''' 
    print(f"Using Default Configuration file: {default}")
    return reduce(lambda agg, filename: agg.update( load(filename) ), filenames, load(default))

Basically, the code above, merge the defaults.yml provided with all files named application.yml available in the application paths and with an optional CONFIG_FILE specified via an environmental variable, with the following rule of precedence:

defaults.yml < application.yml < CONFIG_FILE

As mentioned above, besides its flexibility and readability, YAML also allows a powerful integration with Python code by specifying

Use of “anchors” to be referenced within the same configuration file
Type of values
User-defined Functions
Direct Instantiation of User-defined Classes

The usage of these 4 features can be seen in this simple example. Mimicking the use case UC1 and UC2 to the bone, imagine you have created a model and you want to apply it to a specific dataset, binding it to certain input and output files.

The configuration file, to the very essentials, should therefore allow the specification of the model and the data bindings to be used.

Therefore, we create two classes: one for the model specifications and one for the data bindings by extending the yaml.YAMLObject class:

class DataBindings(YAMLObject):
    def __init__(self, input: str, output: str):
        self.input = input
        self.output = output

    def __repr__(self):
        return "%s(input=%s, output=%s)" % (
            self.__class__.__name__, self.input, self.output
        )class Model(YAMLObject):
    def __init__(self, model_path: str):
        self.model_path = model_path

    def __repr__(self):
        return "%s(model_path=%s)" % (
            self.__class__.__name__, self.model_path
        )

Since we will need to specify many paths (for the input/output files and the names), it would be good that our configuration can adapt to different platforms where it may run in. Therefore, for our convenience, we define a function (called tag-handler in the YAML specs) that, by invoking the os.path.join function of the standard python library, can build within the YAML file compatible paths to the platform where it runs:

import osdef joinPath(loader, node):
    seq = loader.construct_sequence(node)
    return os.path.join(*seq)

All tag-handler must be registered to the YAML library to be used when parsing the YAML files, via

yaml.add_constructor('!joinPath', joinPath)

We now have all the ingredients to build a platform-independent configuration file, defaults.yml, that instantiates classes directly:

cgnal:
  root: !!str &root this/is/my/root/folderdata: !!python/object:config.DataBindings
    input: !joinPath [*root, input]
    output: !joinPath [*root, output]
  model: !!python/object:config.Model
    model_path: !joinPath [*root, model]

It is worth noting two extra powerful features in this file:

!!str that forces the following value to be a string, otherwise an exception is raised.
&root defines an “anchor” that can be later referenced via *root. This is very useful to define the root path only once, and then reference it when defining, through the joinPath tag-handler to make it platform-independent, the full path for input and output files and model path.

For sake of keeping things simple, all the python code has been placed in a Python file, named config.py, which might represent the configuration module that is available to all projects.

Great! We now have our configuration system in place, ready to be used!

Configuration System in Action

The code above can be seen in action by opening an iPython shell, and parsing the file above using the utility functions we have defined:

Python 3.6.5 (default, May 14 2018, 18:42:25)Type ‘copyright’, ‘credits’ or ‘license’ for more informationIPython 7.7.0 — An enhanced Interactive Python. Type ‘?’ for help.In [1]: from config import *In [2]: conf = load("defaults.yml")In [3]: confOut[3]: Configuration(cfg_dict={‘cgnal’: {‘root’: ‘this/is/my/root/folder’, ‘data’: DataBindings(input_file=this/is/my/root/folder/input, output_file=this/is/my/root/folder/output), ‘model’: Model(model_path=this/is/my/root/folder/model)}}, meta={‘filepath’: ‘/Users/deusebio/work/tutorials/yaml/defaults.yml’, ‘creation_datetime’: datetime.datetime(2020, 12, 8, 23, 3, 15, 661681), ‘last_access_datetime’: datetime.datetime(2020, 8, 8, 23, 3, 45, 659043, tzinfo=<DstTzInfo ‘Europe/Rome’ LMT+0:50:00 STD>), ‘modification_datetime’: datetime.datetime(2020, 12, 8, 23, 3, 15, 662018, tzinfo=<DstTzInfo ‘Europe/Rome’ LMT+0:50:00 STD>), ‘parse_datetime’: datetime.datetime(2020, 12, 8, 21, 3, 48, 338088, tzinfo=<UTC>), ‘load_remote’: True}, load_remote=True)In [4]: conf["cgnal"]["data"]Out[4]: DataBindings(input_file=this/is/my/root/folder/input, output_file=this/is/my/root/folder/output)In [5]: conf["cgnal"]["model"]Out[5]: Model(model_path=this/is/my/root/folder/model)

where you can see that our configurations have been directly parsed as objects into our code.

…towards a fully object-oriented configuration handling

Already with this at hand, one could achieve a very flexible and handy configuration system. However, with little effort, one can step up a bit further and create an object-oriented implementation for managing configuration which can be even used across projects.

To do so, we need to create, within the config.py module, a BaseConfig class with some utils function for sub-leveling the configuration object:

class BaseConfig(object):
    def __init__(self, config: Configuration):
        self.config = config

    def sublevel(self, name: str):
        return Configuration(self.config[name], self.config.meta, self.config.meta["load_remote"])

    def getValue(self, name: str):
        return self.config[name]

This object can then be used to derive some sub-configuration classes. For instance, we may imagine having a particular structure for defining file-system configuration or MongoDB configuration, as persistence layers:

class FileSystemConfig(BaseConfig):    @property
    def root(self):
        return self.getValue("root")

    def getFolder(self, path):
        return self.config["folders"][path]

    def getFile(self, file):
        return self.config["files"][file]
class MongoConfig(BaseConfig):    @property
    def host(self):
        return self.getValue("host")    @property
    def port(self):
        return self.getValue("port")    @property
    def db_name(self):
        return self.getValue("db_name")    def getCollection(self, name):
        return self.config["collections"][name]

These objects can be generally used for reading persistence layer configurations across all projects in a standard way.

Once all of this has been configured, integrating the configuration in your python package becomes a piece of cake. Expanding the example above, we add a storage specification in our configuration file, and we reference relevant “anchors”:

cgnal:  storage:    fs:
      root: !!str &root "this/is/my/root/folder"
      folders:
        models: &models_path !joinPath [*root, models]
      files:
        output_file: &output !joinPath [*root, output]    mongo:
      host: localhost
      port: !!int 27017
      db_name: "ingested_data"
      collections:
        input: &input "input_collection"  data: !!python/object:config.DataBindings
    input: *input
    output: *output  model: !!python/object:config.Model
    model_path: !joinPath [*models_path, model]

The structure of the configuration file, which is application-specific, can be defined within the specific python package, which we will refer to as “app.py”.

Here we can then create dedicated objects to mirror the hierarchical structure of the configuration file:

class StorageConfig(BaseConfig):    @property
    def mongo(self):
        return MongoConfig(self.sublevel("mongo"))    @property
    def fs(self):
        return FileSystemConfig(self.sublevel("fs"))


class CGnalConfig(BaseConfig):    @property
    def storage(self):
        return StorageConfig(self.sublevel("storage"))    @property
    def data(self):
        return self.getValue("data")    @property
    def model(self):
        return self.getValue("model")configuration = CGnalConfig(
     BaseConfig(merge_confs(get_all_configuration_file()))\
        .sublevel("cgnal")
)

And we can now use the configuration system in an easy, structured fashion. To spicy things a bit, we also add another my_app.yml file, which is provided via CONFIG_FILE environmental variable, where we override the MongoDB database name.

cgnal:  storage:    mongo:
      db_name: "production_database"

The merged configuration (defaults+application specific configuration) can finally be easily loaded and used as follows.

deusebio$ export CONFIG_FILE="my_app.yml"deusebio$ ipythonPython 3.6.5 (default, May 14 2018, 18:42:25)Type 'copyright', 'credits' or 'license' for more informationIPython 7.7.0 -- An enhanced Interactive Python. Type '?' for help.In [1]: from app import configurationUsing Configuration files: my_app.ymlUsing Default Configuration file: defaults.ymlIn [2]: configuration.storage.mongo.db_nameOut[2]: 'production_database'In [3]: configuration.storage.mongo.getCollection("input")Out[3]: 'input_collection'In [4]: configuration.dataOut[4]: DataBindings(input=input_collection, output=this/is/my/root/folder/output)In [5]: configuration.modelOut[5]: Model(model_path=this/is/my/root/folder/models/model)

WOW! Wasn’t this magic?

Conclusion

Using YAML, its functionalities, and some object-oriented programming, we have implemented a robust, flexible configuration system that can be used across different projects, abstracting a standard schema to be re-used and allowing us to fully satisfy all the requirements R1, R2, and R3 listed above and able to seamlessly address our use-case UC1, UC2, and UC3.

That’s all Folks! Hope you have enjoyed the article and you will start using YAML straight away!

If you liked this post and you’re a smart data scientist or data engineer have a look at our open positions in CGnal.