How we improved our data import process for database updates.
The new release of Trase has brought about many visual improvements to the homepage and better entry points into the tools offered by the platform. Much more has changed under the hood though, as the platform received a deep overhaul of the database and back-end focussed on a single goal: to facilitate a better way of importing new datasets into the platform.
In the beginning was the database. That is definitely true of Trase, which started off with a unique dataset describing worldwide supply chains for Brazilian soy. These data were visualised using a sankey diagram, which allowed unprecedented insight into how commodities travel through nodes of the supply chain. New data would follow and new ways of presenting it would be needed. But at the time the priority was to get it out in the open; practical matters of expanding the dataset would have to come later. Now the time has come.
Better data management practices were the aim of an ambitious refactoring of the Rails back-end and PostgreSQL database, which serve data to the website. Our starting point was a database which required a lot of manual intervention after every update to the underlying dataset, which consisted in replacing one version of the database with another and checking if the website still works. It mostly didn’t.
That was largely because increasingly sophisticated visualisations required an expansion of the data model to include the configuration required by the front-end to display the data correctly. Yet the core dataset, which remained the source of data updates, never fully incorporated the visualisation aspect and key information was often missing or incorrect.
To illustrate the kind of problems we were having let’s use a vaguely relevant example of a website about cats. The original dataset contained just basic properties of cats:
That data allowed to render a very attractive but arguably basic website:
Obviously, sooner or later you’ll want to present your data in new ways. How about putting the cats on a map? Easy, all we need to do is to store the cats’ primary geographical locations. Let’s add that information to the “cats” table.
Now the cats are on the map, but the data model changed and those changes need to be considered when adding new cats to the system. Yet if the idea of geolocating cats was never fully incorporated in the source of data updates, some cats might end up stranded off the map.
This could be a big issue for the front-end, which expects all cats to have a location. The back-end does nothing to protect the front-end from accepting data which potentially breaks the functionality of the website. Our data update process has been marred by issues ranging from “something is not displaying” to “website crashed”, all due to mismatched data and configuration.
What we identified as a root cause is that our data model grew to incorporate core data about supply chains and visualisation-related configuration, as well as the ever more complex links between them, but the rules regarding this new makeup of data have not been formalised or captured in any form of automated validation. That made the dataset update process extremely brittle as well as frustrating for managers of the source dataset. We needed a separation of concerns and a decisive overhaul of the data update process.
We started by analysing the relational database structure. I described that process here. At the same time as enforcing constraints in the database we took a hard look at separating concerns present in our database. The goal was to allow to update the configuration separately from core data. Back to the cats database, the “blue” sections constitute core data. The “yellow” sections are configuration.
We actually took the printout of Trase database schema and coloured it in with felt tips to better see where our problems lie. The principle we have adopted is that blue and yellow data are not allowed to live in the same tables. Each such table had to be split in two or more, resulting in our database model being divided between “blue” and “yellow” tables with referential integrity constraints between them.
What that managed to achieve is that visualisation-related configuration no longer gets in the way of updating the core data directly from the source dataset. That makes the whole process a lot simpler conceptually and easier to automate.
The days of replacing the entire database with a new version are gone now; only the “blue” section of the database is replaced fully via an importer script, while the “yellow” section is preserved between updates where relevant. The integrity constraints we’ve placed on the database do a good job of completely aborting an update upon detecting violations. This way we manage to avoid issues in the core data, which previously had opportunities to creep in unnoticed.
Once new core data is in place, configuration typically requires an update (as in geolocating any new cats). We have developed a consistency check to detect areas that require attention, which for now is supposed to guide a human in amending the data.
In our example the data manager would get a message like “Cat Robin needs to have a location record”. We’re already thinking of ways of further automating of this step where possible in our dataset, but for now we have concentrated the effort on making it easy to update manually using an admin interface rather than ever having to go to the database directly.
The data update process is now much more controlled and reliable. It lends itself to further automation and performance improvements, but as we’re testing this solution, which keeps unearthing previously undetected data errors, we realise we have never before had so much insight into the quality of the data and confidence in it. We have overhauled the entire back-end to support this and even though none of it is visible for visitors to the Trase website, it gives an amazingly satisfying feeling of knowing there are no skeletons lurking in our closet.