Python Template for Data Processing

A template for Python services and CLI tools useful for DS teams

Oleksandr
DevOops World … and the Universe
31 min readMar 12, 2020

--

This article is an overview of a project template useful for creating Python console apps and utilities. The primary target field of its application is data processing. It is generic enough to be a good fit for other areas, though.

The template is called a simple Python CLI app template and it is available at GitHub as oblalex/simple-python-cli-app-template.

Why?

One may ask “Why somebody needs yet another template at all?”. The answer is: to decrease operational and maintenance burden in the realm of the modern application of Python, especially when it comes to cloud data processing.

We can observe a surge of projects in the field of data processing arisen in recent years. This is due to the spread of cloud computing and due to the hype around machine learning, deep learning, artificial cognition and other areas people believe will somehow guess out magic solutions to their poorly defined problems.

This “somehow” implies there’s no need to write imperative code, to think out structure and logic, to predict impact with consequences and many other things included into everyday job and culture of software engineers.

Instead, you need to do a lot of calculus, math optimization, data thrashing and guessing: an area called “Data Science” these days. As a good rule of thumb to keep in mind is that anything that calls itself a science probably isn’t, so I’d call this area “Data Scientology”. (Don’t get me wrong here: people doing hard “Data Science” are really competent in math. It would take me a couple of years to learn at least the basics of math domains they are involved into. But the way “Data Science” solves problems is not scientific at all. Check out “What Is Science?” chapter from “Philosophy of Computer Science” by sir William J. Rapaport for details. Also find a time to read the rest of the book as well.)

Data scientology commonly deals with tons of data that have to be collected, validated, cleansed, joined, analyzed, stored, fed into model training and model serving, transferred to visualization dashboards, served to end-users, and included into many other activities. As a result, data scientology is tightly coupled with another area called “Big Data”.

In its own turn, big data offers technologies able to process petabytes of data in a reasonably short time: from tens of minutes to several hours, depending on how much data you have and how bad your infrastructure, code, and solution are. Of course, to be able to use such technologies one may need a pretty bulky infrastructure spreading way beyond a single machine.

This is where all those cloud providers come in and now somebody has to cope with all operational burden that follows: be it public, private or even hybrid cloud. Everyone called buddies dealing with operations simply as “admins” yesterday, but today they are rebranded as “DevOps”, despite no one of them is related to the “dev” part even closely.

As a result, data storages are spread across multiple machines and disks now: they are split into shards, duplicated as replicas, clustered by fields and so on. Somebody has to analyze data, figure out access patterns, define data transformations needed, select proper storages, devise data schemas for optimal storage usage and querying time, select geographical locations for storage, ensure data retention complies with local governmental policies, security best practices and so forth. This is a job description of “data architects”.

Finally, someone has to implement collection, serving and all transformations needed to be applied to data in transit and at rest. This is what “data engineers” do. These are just software engineers sandwiched between data scientologists, data architects and admins. And this is important: usually software engineers are the only beings in those diverse gangs skilled, hardened, and paid for implementation of components that will be robust, maintainable, consistent across a whole system, and able to outlive initial business goals without human intervention into them.

It comes clear pretty fast that all these people will bring a zoo of their technologies into a project. TensorFlow, Theano, Keras, PyTorch, pandas, numpy, matplotlib, AutoML, CloudML, Translate, Transcribe, Rekognition, Cloud Functions, Lambdas, Hadoop, EMR, DataProc, Spark, Beam, Dataflow, Kinesis,Splunk, Airflow, Cloud Composer, Glue, Argo Workflows, Docker, Kubernetes, Kubeflow, Container Registry, Artifactory, Jenkins, CloudWatch, CloudTrail, StackDriver, DataStore, FireStore, MongoDB, MemoryStore, Redis, ElastiCache, Persistent Disk, EBS, FileStore, EFS, S3, Cloud Storage, CloudSQL, Spanner, RDS, DynamoDB, Redshift, BigQuery, BigTable just to name a few… I’d also mention BigCSV and LittleJSON(just kidding, these two do not exist yet).

And all of these people with all of their technologies have to co-exist in the same ecosystem, understand each other, adapt their solutions to new situations, and be able to onboard new people.

Now it’s a time for Python to show up. Data scientology and data processing are flooded up with Python. And there are reasons for that.

Python is a pretty nice programming language: it helps one to focus on a problem and allows one to express solutions in a quite human-comprehensible manner. This is why people like it. And this is achieved by Python’s flexibility and expressiveness.

Many prefer it to other languages when performance, multithreading and type checking are not an issue. And usually they are not, because problem solving requires an ability to quickly explore data, to glue components and services together, to plan and schedule pipelines, to shape out prototypes, and so on. When performance is an issue, one implements core components via a compiled language and provides Python bindings for them (like in case of numpy or TensorFlow).

Being an easy programming language, Python offers a low entry level. However, software engineering is not defined by a programming language only. Programming is just a small part of system design, implementation, product delivery, maintenance, and retirement. They require a solid level of organization, vision, discipline and these are not easy at all.

Unfortunately, in Pythonland there’s no way to tell what is a “standard”, “good” or “right” project organization and what components are needed for a specific solution because of Python’s flexibility: nothing is true, everything is permitted.

Oftentimes, this is a struggle unless guides and policies already exist in a team or are established company-wide. And in casual reality they do not exist. Almost none of the projects even has a glossary for other people to understand what a team is talking about. In the last 50 years, people haven’t progressed much in using computers beyond applying them as electronic typewriters, but this is a separate story, though.

So, having too much freedom can be a struggle. And this is definitely a struggle for people who are familiar with software engineering but are new to Python. (The story is the same with the modern JavaScript and front-end development using React, by the way). And it’s much worse for others like data scientologists and admins, who’s primary focus lays outside the software development domain. One might think it’s even worse for those newcomers invading the industry today, but it’s not: they don’t even know about the existence of things they should be worried about, they are blessed and simply happy.

Usually all those “data science”, “machine learning” and other “big data” projects deal just with a bunch of messy CSVs of several Gbs in size. They can be processed via a single bare-metal machine or a cloud VM instead of a zoo of machines and cutting-/bleeding-edge technologies. Nevertheless, hype is hype, budget burning is budget burning and many people are pretty happy in their suffering and survival on caffeine.

And usually this is how such projects are started: a couple of people shape out scientological core using local Jupyter notebooks or 72 tabs of infinitely long mess in their Notepad++ stored in Google Drive and then they try to build a system around all that. Obviously, no one except them understands what’s happening. So they have to take roles unusual to themselves — like software and data architects, engineers, and admins — to build at least a prototype.

Subsequently, many projects and components within a single project tend to end up with own innovative structure and organization. This leads whole projects and infrastructures into a messy bazaar and integration hell.

It would be nice to step aside and let them all enjoy all their endeavors, but almost each of their project comes to a certain point in time where they face significant performance and maintenance issues they could not cope with any more.

Afterward, somebody gets hired to join “an enthusiastic team” and to “just optimize” or to “simply fix” results of their efforts of application of “amazing technologies”.

Paraphrasing Sun Tzu, a bigger mess can be avoided if a lesser mess is dealt with systematically. And the elimination of a small software mess starts with a conscious and reasonable selection and organization of components.

If you or your team is going to start a new Python project, application, utility or component, or if you are going to refactor such, this article is for you then.

The template proposed here can be used as-is or as a reference. It is pretty handy in the data processing world and it is generic enough to be applied in other areas successfully.

The Template

Features

The template provides essentials for creating, packaging, and managing command-line interface (CLI) applications and utilities. Among them:

  • Passing config and parameters ⚙️, overriding and validating them 🧪.
  • Setting-up logging 📄.
  • Definition of a distributable package 📦 with executable commands 🕹.
  • Helpful utilities 🛠.

Usage

Once again, the template is available as oblalex/simple-python-cli-app-template. It is a template for cookiecutter template engine, hence cookiecutter has to be installed at first:

The usage is straightforward, as in case of any cookiecutter template:

Alternatively, a gh shortcut can be used for GitHub-hosted repositories:

During the invocation, cookiecutter will prompt template parameters which are discussed in the subsection below.

Refer to the cookiecutter's documentation regarding .cookiecutterrc file, --no-input and --replay arguments in case user-prompt is unwelcome.

Params

The template accepts several parameters for configuring a project being created. Parameters have types and default values. Refer to the template’s “Params” docs section for full reference. A brief overview of the params list includes the following:

  • project_name — a name of the project.
  • package_name — a name of a Python package to create.
  • executable_name — a name of an “eggsecutable” file to create during package installation.
  • class_name_prefix — a prefix used for package’s class names.
  • env_var_name_prefix — a prefix used for names of env vars.
  • project_short_description — a short description of the package being created.
  • project_url — a URL to the project’s repo or home page.
  • version — the project’s version.
  • author_name — a name of the package’s author or owning team.
  • author_email — an email of the package’s author or owning team.
  • author_username — a user name of the package’s author or a group name of an owning team.
  • create_author_file — a flag specifying whether to create or not an AUTHORS.rst file.
  • command_line_interface — a variant of a command-line parser to use.
  • config_file_format — a variant of a config file format to use if needed.
  • logging_time_zone — a variant of a time zone to use in timestamps for log records.
  • logging_include_hostname — a flag specifying whether to include the hostname into log records or not.
  • logging_format_json — a flag specifying whether to format log records in json or not.
  • use_pytest — a flag specifying whether to use pytest as tests runner or not.

Example Invocation

Let’s create a Python project and walk through the questions asked by the template.

Invoke the cookiecutter in a directory where a new project directory will be created, say in a projects directory:

1) The first thing being asked is the project’s name. As is the case of any cookiecutter param, a default value is specified in brackets.

The project’s name is used for documentation purposes mainly. Also, default values of several other params are derived from the project’s name. Let’s specify the project name as Example Python App and see what default values other params will get:

2) Next, a name of the target Python package should be specified. Notice how the default value is calculated from the project’s name specified previously:

This name also will be used as the project’s directory name.

The template creates only a single top-level package. If you need to create a namespaced package, say company.project.component, this should be done manually.

3) After that, a name of an executable should be set:

The application will be accessible by that name after the package installation.

4) Class name prefix specifies a prefix for names of classed created during instantiation of the template (e.g., for a name of the root exception class):

5) As for the env var name prefix, it defines a prefix for names of environmental variables used to override config params:

6) Short description is… well, a brief description of the project.

This is used in the documentation, description of the project’s repository and package. It is usually displayed in package repositories such as PyPI.

7) Usually projects live in source repositories. It’s useful to specify an address where the package’s source code resides, but this is optional, though:

8) All deliverable artifacts should have a version:

See PEP-440 for details. And please, don’t start version numbers with 0: your goal is to create a final working solution, so aim at the result from the very beginning. If not sure, use pre-release or developmental version suffixes.

9) Name of the package’s owner, project’s team lead, or owning group:

10) Contact address of the owner or the owning group:

11) The owner’s user name or group name:

12) Create or not an AUTHORS.rst file:

13) Select a parser of command-line arguments: built-in argparse or external click:

14) Select the format for a config file if needed:

15) Specify which time zone to use in log records:

16) Include or not the host name into log records:

17) Output or not log records in json format:

18) Configure the package to use or not to use pytest as tests runner:

Watch the whole template invocation process:

The figure below shows a file structure of the project just created:

The Package

Let’s examine the package example_python_app created inside the project:

main.py

Intuitively, the entry point for examination is the main.py file containing the main() function. It describes how the application is being constructed and run:

The contents listed above were generated for the params used previously. The result may vary for other param values, but the plot remains the same:

  • Read config and apply overrides if needed.
  • Set up logging.
  • Run logic, exit on success, fail otherwise. Log execution steps.

Detailed steps for the example above are the following:

  1. Declare imports of system and local objects.
  2. Declare a module-level logger.
  3. Declare the main() function taking zero arguments and returning an int exit code.
  4. Build a parser of command-line arguments and try to parse them.
  5. Get a config file path and a name of a config section to read config values from.
  6. Set up logging.
  7. Try to run(…) the application logic and indicate the beginning of execution in the log. If the logic requires arguments, this is the place to pass them.
  8. Log an exception with a traceback if any error occurs. There’s no ability to handle or to recover from errors at this level or at any higher level, so just set the exit code to -1. Handle more specific errors if needed.
  9. Indicate successful end of execution in the log and set the exit code to 0.
  10. Return the exit code from the function.
  11. Allow the module to be run directly.

logic.py

The next obvious point of interest is what is being run. This is defined in the logic.py module containing the run() function:

It’s pretty self-explanatory what’s going on here, but let’s walk through:

  1. Declare imports of system and local objects.
  2. Declare a module-level logger.
  3. Declare the run(…) function returning None. If the logic requires arguments, this is the place to declare them.
  4. Enlog a single message containing name and version of the executable.

    Always prefer to define log messages in what is called “old style format”: this allows utilization of lazy evaluation and stringification of arguments meaning this will be done only if they are needed. For all other string-formatting cases use the “new style” and “string interpolation” due to performance and readability reasons.

There’s no requirement to keep the whole logic in a single module: if the module grows too big, split the logic into separate modules and group them into a package. Or extract auxiliary parts into utility modules. It’s Python, you decide.

version.py

The logic.py module from above imports the __version__ object from the local version.py module. It’s hard to recommend a generic way to track the project’s version: one may use semantic-version package or configure the package to extract version from git info. All that depends on how one is going to use this information. In our case the version is just a string for simplicity:

exceptions.py

For the sake of decency, every distributable package should have own subsystem of exceptions. And every subsystem of exceptions starts with a base package-specific exception class inherited from the built-in Exception:

All other package-defined exceptions should use this class as a base. This allows handling or filtering out exceptions of a specific package when needed.

args.py

At a high-level perspective, any application can be treated as any other function. Usually, functions accept arguments that might be required or have default values.

Even if an application accepts no arguments, it’s still good to provide at least -h and --help arguments to provide a minimal description and usage examples.

Arguments in the current example are parsed via built-in argparse module. Construction of a parser is defined in the local args module:

This piece is a pretty standard and straightforward. However, note the usage of ArgumentDefaultsHelpFormatter: this automatically adds information about default values to help messages for each argument.

The resulting help message for the example application is listed below:

The args.py file will not be created in case click was selected as the arguments parser: arguments will be defined via decorators applied to the main() function. For example:

To use click or not to use — is a matter of choice. However, having the args.py file can be handy no matter what parser is used, for example, to keep arguments’ validation logic separated.

utils.py

Oftentimes, there is a need to keep lonesome auxiliary objects or functions, which are loosely coupled with any specific business logic and belong to abstract categories with no concise names or no names at all.

However, most of the modern knowledge bases still use dendritic data structures forcing artifacts to be put into at least some category. And software sources — representing only a single perspective of knowledge — are stored in dendritic file systems which force us to keep all entities in files. And files must have names. And there are file naming constraints set by file systems. So, this can become one of those two really hard problems.

Although it is possible to keep every auxiliary entity in a separate file, this can clutter the project’s directory pretty fast. To prevent this from happening, it is recommended to group auxiliary entities into a single file. Hence, such a file will represent an abstract superset of abstract categories.

Interestingly, it might be pretty reasonable to represent such a concept with ⊞ — a cross encapsulated by a square: the most abstract signs of a concept and an object. But that would be practically inconvenient and technically not possible.

So, proposed candidate names are utils, helpers, aids, tools, misc, etc, extras, and so on. It’s hard to tell which one to pick for a specific project to avoid collisions with the project’s domain terms, but utils is a good and pretty common choice.

In our example, utils.py contains a single function update_nested_dict() used to override values of dictionaries by mutating them:

This function is used in config loading, which is discussed in the following subsection. You might not need it if you prefer to use alternative solutions.

Another helpful utility function, which is not included into the template, is format_f — a formatter for real numbers:

It is optional, but it can become useful in logging, e.g.:

Again, if the utils module grows, it can be converted into a package or related entities can be simply extracted into a separate module.

config.py

Configs are static inputs for applications. They can be treated as application-specific programming languages: changes in config values can affect the behavior of the same application tremendously. Achieving a good and easy-to-use configuration process can be tricky.

We’ve already faced a way to pass inputs to an application in one of the previous subsections: command-line arguments. However, they are only a part of the configuration pipeline.

Another way to pass static params to an application is to use environment variables (“env vars”). One should keep in mind that env vars are global variables and those are always frowned upon. Nevertheless, env vars can be the only way to configure an application or a function sometimes. Usually they are used to tune up user’s environment, to pass values into badly designed software, containers, lambda functions, Java applications using Spark, and so on.

Finally, it’s common to use config files to pass params to services and reenterable applications. As it was mentioned earlier, the Template allows setting up reading of configs from YAML or JSON files if needed.

Not surprisingly, a single application may use all three input methods at once. Obviously, method precedence should be established to avoid clashes and debugging nightmares. The precedence proposed by the template is based on the volatility of values each method allows:

  1. config file — this method has the lowest priority, as configs are not expected to be changed frequently; they can be big and slow to change. Additionally, changing config files can be impossible due to permissions or because they are burned into a Docker image.
  2. env vars — can be used to override values set by config files. Env vars themselves can be stored in files, like .env, ~/.bashrc, ~/.zshrc, /etc/profile, /etc/environment, /etc/default/service-name, and so on.
  3. CLI arguments — these have the highest priority and they override values set by config files and env vars.

While giving the lowest priority to config files might be intuitive, priorities of env vars vs CLI args are debatable and depend on how an application is going to be configured and used.

For example, env vars can be set during an application invocation, just like CLI arguments:

At the same time, CLI arguments also can be defined in files, be it a Dockerfile, a Kubernetes manifest, a systemd service definition, exec() system call or whatsoever.

The precedence of config methods proposed by the Template is reasoned by the observation that CLI args are easier to change. Hence, they are supposed to be changed most often during development and have the highest priority.

This is not always true and the precedence might need tuning up for a specific case. To do so, one will need to change the try_to_load_config() function:

This function was already mentioned previously in the description of the main() function. It loads configs and applies overrides or throws an exception in case of errors:

  1. Take an optional path to config file.
  2. Take an optional section name. This might be useful if a single file contains parameters for several applications, like in a case of a pipeline.
  3. Take any other parameters passed from the main() function. Usually, these are CLI args. They are passed separately because they are already parsed by this moment.
  4. Take a copy of default config values as use it as a config instance.
  5. Try to read a whole config file or its single section to override existing config.
  6. Try to read env vars to override existing config.
  7. Override existing config with CLI args if provided.
  8. Validate resulting config against predefined config schema.
  9. Return resulting config object if no errors occurred before.

Two things were not mentioned before: config schema and config defaults.

Config schema is a data structure used to define all config params of an application. This provides any developer with a quick and clear understanding of how the application’s behavior can be affected.

Schemas are defined using JSON Schema specification which is enforced by jsonschema library.

The template defines only a single config parameter used to configure logging:

The parameter simply defines a required logging dictionary. The details of this dictionary are not specified, as the standard dictionary schema is implied. Python will validate it anyway and its structure can be understood from an exhaustive example provided by the config defaults:

Furthermore, there is an object specifying a mapping between config parameters, which can be overridden via env vars, and the names of those vars:

Finally, config objects are validated using jsonschema.validate() function. This provides a simple validation: only config structure and types of values are checked. Usually, even this can early prevent many issues from happening. Custom validation logic can be added, of course.

logging.py

Logging is very important. And doing logging right can be very difficult. If the look of logging defaults from the previous subsection are not convincing, checkout Python’s logging flow:

Python’s Logging Flow — https://docs.python.org/3/howto/logging.html#logging-flow

What makes logging hard is its conceptual overload: it is a bus that multiplexes work output, debugging, tracing, error reporting, auditing, monitoring of progress, health, performance, resources utilization, and custom metrics. All these activities can trigger myriads of events coming even from a single application or service. Furthermore, all these events can have different levels of importance. And authors of different libraries, frameworks, applications, and services can have different reasoning behind the way they do logging if they have any reasoning at all.

Multiplexing of all that happens because many developers are still ideologically stuck in times of teletypes. It is not a surprise there are legions of technologies trying to help people to dredge anything out of the resulting mess: Python logging, logrotate, log4j, rsyslog, syslog-ng, statsd fluentd, logstash, beats, ELK, logplex, zipkin, opentracing, GCP logging, AWS logging, sentry, datadog, graphite, and riemann to name a few.

Not coincidentally 12-factor methodology requires developers to throw all log messages on stdout: fire events and forget — let admins to do their stream capturing, filtering, parsing and service plumbing. In our example stdout is also used as the output stream by default.

Besides the selection of an output stream, another important logging decision to make is to choose a format of log records. Should it be a plain text? What fields, in what order, and in what formats they should be represented? Maybe HTTPd? Or maybe CSV? JSON? Protobuf? Something else?

Answering the question “who is your direct logs consumer?” will help to make such a decision. If it’s a human — go with a plain text. If it’s a machine — like a stream processing pipeline or a centralized logging service allowing fancy queries — go with JSON: humans will be able to understand messages as-is and there will be no need to write a gazillion of log parsers. The decision must be made early, as it will affect how logging is done in code.

Our template uses the plain text format by default. Log records are formatted to include the following information:

  1. Timespamp in UTC with microseconds included.
  2. Logging level.
  3. Hostname, process ID and thread name.
  4. A text message.

This can be exemplified as the following:

Such a format is not possible to achieve with Python “out of the box”. Hence, a little trickery defined in the logging.py module is needed.

Developers used to Python 2 might be warried by giving a local module such a name, as it clashes with Python’s system module. The solution to avoid this was to use a special absolute_import feature. This is not needed in Python 3 anymore, as imports are absolute in Python 3 by default. And Python 2 is dead already, by the way.

The first trick to mention is an injector of host names: this is a simple log record filter adding a hostname string attribute to each log record:

Another trick is to allow inclusion of microseconds into the date format and to allow timezone selection for output timestamps:

Alternatively, JSON format can be used for log records instead of the plain text. This is covered by python-json-logger library with one extra little trick, which simply combines facilities of the library with our previously defined extensions:

This formatter will output log records as JSON messages:

Messages in such format may work nicely with handlers different from stdout. However, it’s important to keep in mind that JSON serialization has its price. One may need to tune up performance by making a custom logging handler to work in the background and to use optimized serialization libraries.

Config Examples

All possible ways to configure an application should be explorable and understandable. However, having a config schema, defaults, args, and env vars mapping defined is not the end of the configuration story.

Every application and every utility is created to be used or configured by some humans. And people shouldn’t be forced to understand programming languages, frameworks, libraries and specific philosophies like this to be able to use applications and tools.

Give them config examples for obviousness. Create at least 2 config files:

  • one with every possible config param listed along with default values (e.g., config.default.full.yaml);
  • and one with a minimal set of params required (e.g., config.default.minimal.yaml).

Put those and others needed into the examples directory.

Documentation

While code in Python can look pretty self-documenting, it is still only one “shade” of the whole solution and providing extra clear information can be invaluable.

Fortunately, there are many ways in Pythonland to get through Docuwoods: docstrings, doctests, Sphinx, and Read the Docs.

Noteworthy, the de facto documentation format for Python is reStructuredText (rst) and all docs in the Template are written using it. The format has a certain learning curve, but once mastered it will serve almost all of your needs.

Check out httpie repository, as it has exceptionally good documentation. Also feel free to check out several of my repositories if you need other examples: il2fb-ds-airbridge, isotopic-logging, il2fb-mission-parser, nvidia-gpu-monitoring, python-object-extractor, verboselib, il2fb-ds-config, il2fb-heightmap-creator, il2fb-game-log-parser.

It is a good practice to keep documentation inside the docs directory. This is especially important when using documentation tools like Sphinx (example). However, the template being discussed doesn’t force anyone to use any of such tools, so the directory is not created automatically. Only stubs for essential pieces are provided: README.rst, AUTHORS.rst, and CHANGELOG.rst.

README.rst

This one is mandatory. At least describe what project’s deliverables are, how to install and how to use them.

AUTHORS.rst

A good one to have. Not for boastfulness, but to let people know whom they can contact in case of questions.

CHANGELOG.rst

Obviously, all changes can be described via commit messages in a version control system like git. But in practice, no so many people are organized enough to write concise and meaningful commit messages and to avoid merge commits when they are not needed.

Ever seen commit messages like Fix stupid errors, le comma., Merge branch 'staging', Fix for dev deployment, Fix for dev deployment, again, Increment image version, Merge branch 'staging', Yes commas hate me., fix, Merge branch 'staging', Merge branch 'staging', Clean up, roll back, test, upd, quick fix? I bet you’ve seen.

So, the first thing to do is to collect hand-written and hand-signed acknowledgments from your teammates stating they have read and understood the commit messages guidance. Well, it doesn’t have to go that far, but you got the point.

The second thing to do is to add a changelog file with a brief and useful description of important changes. Don’t be too Googley and avoid nonsense like:

  • “Fixed what needed fixing and squished some bugs.”
  • “Bug fixes, stability improvements, repairs to time-space continuum, etc. etc.”
  • “We added shinier bells and whistles. Because who doesn’t like a good upgrade??”
  • “Fixed bugs, improved performance, took out the garbage, mowed the front lawn, and now we need a little nap.”

Be specific. Again, take a look at CHANGELOG.rst from httpie. Pay attention to how version numbers are linked to intertag comparison pages. This is brilliant.

Distribution and Delivery

No matter who is going to use or configure your applications and tools, all they need are finished reusable artifacts, which they can treat as black boxes. And this is how they should treat the software you and your team create. You don’t want to know or watch how sausages are being made, do you?

The quintessence of any application development is packaging it as a finished good, which one will not be ashamed to deliver into production or to share with a community.

There is an extensive official Python Packaging User Guide already, so there’s no need to explain this activity here. Alternatively, it is possible to freeze sources or to create a system package. This article covers only the key points related to the template proposed.

LICENSE

It is important to chose a license for open-source projects and to provide a repository with a LICENSE file. Even if you do not care, stating that explicitly is good. However, this is barely applicable to internal projects. As data processing is usually a purely internal activity, the Template does not include a LICENSE file.

External dependencies

For a quite long time, setuptools with its setup() function was the de facto packaging tool in Pythonland. It provides several options to declare dependencies, for example, install_requires, setup_requires, tests_require, dependency_links, and extras_require.

Additionally, pip was Python’s de facto package manager. It uses one or several requirements.txt files to define external dependencies for an application. It’s not uncommon to see virtualenv or venv hanging around as well.

Both pip and setuptools use interchangeable dependencies format, but this does not make life easier when one tries to define dependencies via setuptools for the first time.

Nowadays, the focus is shifting from requirements.txtto Pipfile and Pipefile.lock generated and managed by Pipenv. Other alternatives to setuptools are Flit, hatch, and Poetry.

However, people say there is a difference between requirements.txt/Pipfile and options of setuptools: former two are preferred for applications and the latter one is preferred for libraries.

As for the Template, it does not care what you are going to distribute and defines its dependencies in several pip-friendly files inside the requirements directory:

  • requirements/dev.txt — dependencies needed or useful during development only.
  • requirements/dist.txt — dependencies required for runtime.
  • requirements/setup.txt — dependencies needed during installation.
  • requirements/test.txt — testing dependencies.

These files can be used by pip directly. Also dependencies defined within them are parsed and passed to setuptools during package creation.

setup.py

setup.py is the build script for setuptools: it declares the package and metadata needed for the package distribution. The declaration is done by an invocation of the setup() function and here we are going to explore how this is done in the Template. Alternatively, the setup() function is available as a part of the standard library.

Before describing the steps listed below, let’s look at auxiliary functions and variables used.

__version__ — package’s version simply imported from version.py:

README and CHANGELOG— contents of the respective files:

find_packages()a function imported from setuptools to avoid manual listing of packages and subpackages:

itertools.chain a standard function used to combine several sequences into a single one:

INSTALL_DEPENDENCIES, SETUP_DEPENDENCIES, TEST_DEPENDENCIESlists of non-package dependencies extracted from requirements/*.txt files. Usually those are links to repositories.

INSTALL_REQUIREMENTS, SETUP_REQUIREMENTS, TEST_REQUIREMENTS — lists of package dependencies (package name & version) extracted from requirements/*.txt files.

BUILD_TAG — contains a short hash name of the latest commit in the current branch (if available) and if the current branch is not a stable branch. Useful for developmental builds.

Finally, it’s time to look at the invocation of setup():

  1. Definition of the package. Pretty self-explanatory.
    Pay attention to the name — this can be any name constructed from letters, numbers, _ , and -.
  2. Listing of Python namespaces, packages, and “eggsecutables” provided by the distribution package.
    Pay attention to console_scripts — this defines a mapping between locations of functions and names of executables used to invoke them.
  3. Definition of dependencies.
  4. Listing of trove classifiers. Can be used to filter packages in a packages repository. For example, by Topic :: System :: Logging. The full list of classifiers is available here.
  5. Information about the package’s author. Can be a human or a team.
  6. Release tagging options.
  7. Tells whether the package is zip-safe or not zip-safe. In most cases, packages are zip-safe, meaning they can be installed and loaded by Python as a single ZIP file. This can help with utilization of the file system. But there can be issues if a package expects to be able to access its source or data files as normal files.

setup.cfg

Static options and metadata passed to the setup() function can be configured via a setup.cfg configuration file. This can reduce boilerplate code in certain cases.

MANIFEST.in

If you need your package to include not only sources, then you have to specify a MANIFEST.in file. A common checklist is:

  • README.rst
  • CHANGELOG.rst
  • AUTHORS.rst
  • requirements files (requirements/*.txt)
  • localization files (.po, .mo)

Additionally, one might want to include docs and tests directories.

Auxiliary stuff

Try to do your best to prevent any non-developer from gaining access to your sources. It can be tricky in case of Python, so at least ensure non-developers don’t have write access to your software repositories.

And by non-developers I usually mean admins. Don’t allow them to dig in application’s guts, explore and change whatever they like just because that will make a job easier for them. You need no merge commits from totally unrelated infrastructure automation and configuration activities. You need nobody shooting your sources with sed: not in a repository nor elsewhere beyond your field of responsibility (like after packaging an application but before releasing it into production). You need nobody treating your docs and examples as production configs. You need nobody pushing secrets, keys, and access tokens into source repositories. You need no staging and production configs, artifacts, and dependencies in your development environment. You need no Terraform, Ansible, Chief, Puppet, Jenkins, Docker, Kubernetes, GitLab, Airflow, start.sh and other stuff flickering around. You need no dependencies with a Spark cluster on top of Kubernetes spawning for 10 minutes just to develop and test processing of 100 MiB CSV. No way.

If anybody needs to define configuration and provisioning steps for your applications, let them do this in their own repos. They can always include yours as dependencies or a subrepos.

As for your repositories, define config examples. Define meaningful args descriptions. Define documentation and usage examples. Define and bake a package. Release it to a public or a private artifacts repository. And seal all that with a welder.

Development Commands

The last thing worth mentioning is the Makefile provided by the Template. It includes common and pretty straightforward commands useful during deveploment:

clean — remove all build, test, coverage, and Python artifacts.
clean-build — remove build artifacts.
clean-pyc — remove Python file artifacts.
clean-test — remove test and coverage artifacts.
lint — check style with flake8.
test — run tests quickly with the default Python.
dist — builds source and wheel package.
install — install the package to the active Python’s site-packages.
install-e — install the package into the active Python’s site-packages in editable mode via pip (useful during development).
dev-deps — install development dependencies via pip.

Application Invocation

Finally, it’s time to see how the application created can be invoked. To be able to do this, the application needs to be installed, for example, via previously mentioned make install or make install-e commands. As the result, an “eggsecutable” will be created with the name provided earlier:

This “eggsecutable” is just a normal Python script with executable permission. All it does is loading and executing the function mapped to its name:

Calling this executable by name will run the application:

The Bottom Line

This was a trully long read covering best practices and pieces of philosophy behind creation and distribution of Python command-line applications condensed into the simple-python-cli-app-template.

It is not a “silver bullet”, however: one may use it as-is or adapt for specific environments and release processes. For example, one may not need all that setuptools stuff, but go with Docker images cooked own way. This is Python and you have freedom. You are free to break the rules if you know what they are.

Python environment can look like a pretty messy place and there’s still much room for it to improve. I hope this article can help those of you stepping into Pythonland to find landmarks and hints to ease routing of your own path.

--

--