Documenting your Hive schemas

In my current job as Data Engineer I encounter one typical issue in software development, documentation.

I seen this in many other projects, as developers we should write documentation of our outcome, but as you know it is usually boring or super easy to forget.

Problem

After building an ETL and store the data in some new table we need to create a new Hive schema to allow people to query those tables via Hive or Impala.

Currently, to share that knowledge with other teams, like analysts or data scientist, we need to manually update the Wiki where we hold the documentation.

That is hard to maintain because:

  • There is not way to enforce documentation, unless, someone periodically checks the wiki and compares with the actual data schema in the whole DWH.
  • It’s easy to update an existing schema and then to forget about new changes and do not update the docs.
  • It’s hard to write a well documented schema, we are currently using a Wiki tool and it’s super painful and slow to write everything in the format we want. We have a template, even though, it takes time to have everything in place.

Solution

Taking PHPDocs, JavaDocs or AnyOtherLanguageDocs as inspiration I built Documentr, a tool that helps you to write docs in a simple way, helping you to have the docs up to date, always.

Those are some advantages I found using this tool:

  • Easy way to write all the relevant info for that schema in the schema itself.
  • It enforces documentation in the Code Review process, if you forgot about the docs, the Pull Request is not going to be merged.
  • This tool can be automated in a way that after each commit in the repo where you store your schemas a pipeline can be executed and update the docs.
  • This tool displays which schemas are not properly documented, so again, you can enforce people to take care of it.
  • It generates graphs of your table so you can easily understand the table structure and its relationships.

How it works

With this tool you can easily transform this schema:

Schema without documentation

Into a documented schema, just like this:

Documented schema

As you can see we extend the regular comment in the SQL language adding some methods, at the moment we support those:

Table level:

  1. @author : Defines who created the schema.
  2. @version : Current version of the schema, everyone can be in the same page.
  3. @description : Small text describing the purpose of this table.

Field level:

  1. @reference : Defines to which other database/table/field this field is related to.
  2. @example : Defines an example value for this field.

After the table has been documented, we can execute the documentr.py script and generate a new documentation release.

At the moment we use some funky regular expressions to parse your schema, but I am working in a new release using Antlr, to give a scalable and elegant solution.

Some funky regular expressions

I am still working to make this work with all the possible combination of fields and types.

When it parses all the tables the script stores the results in JSON files as metadata files, to be used later on to build the website/graphs. In this way we have a simple way to handle all the information stored in your schemas.

Example of generated metadata

Last step generates the actual documentation website. At the moment we have a single default template, but you can build your own just following the example of the existing one. It uses Jinja2 as template engine.

Autogenerated website with all the docs

During that process if you defined the parameter --with-graphs it will generate the graphs displaying the relationships of that table with others. For this purpose I decided to use the PyDot library.

Graph showing the relationships of fact_tables.active_customers with other tables.

This project it’s currently in an alpha phase, we expect to have some stable version with final features in few weeks.

How to execute

To execute this process you basically need a folder with all your schemas. You can use as many subfolders as you want, the script is going to find all the sql files in the given directory.

So, those are the possible parameters you can use:

  • --path : Defines were your schemas are.
  • --engine : The database engine you are using, at the moment it only supports Hive.
  • --metadata-destination : Where we are going to store the intermediate metadata.
  • --docs-output : Where the script is going to store the final docs webpage.
  • --template : Which template you want to use, by default it uses the default template.
  • --with-graphs : Boolean value, if True it generates the graph for each table.

Generating the graphs

As I stated previously, I decided to use PyDot to generate the graphs. It is an easy tool to build an image for a given graph.

In order to make PyDot works you need to install a couple of dependencies, one is given in the requirements.txt but the other one should be installed in your computer.

We have to install GraphViz , in MacOs with Brew you can execute this command in your terminal: brew install GraphViz .

Basically we need to build a Graph given the nodes and the edges (relationships of each node). Here you can find an example of how it works:

The result looks like this:

The result of the previous script

Conclusion

In my opinion documentation it is a very important topic and we need to think seriously about how to improve it.

In my experience documentation it is super important when you are new in the company, or if you are working in a new department.

If you are able to share some docs and give some background about all your systems, how to find the data, how it looks, etc… that person is going to be productive much faster.

I will appreciate to receive some feedback or contributions in the GitHub repo.