Where do you start on an architecture?

It depends on where you begin.

You can take bottom-up or top-down approach. In the past, I have focused on defining relational database entities. This means defining what each table is, what columns and types it contains, the relationships and 1:1, 1:N and N:N cardinalities.

I always thought that relational database normalization was the most cut-and-dried processes of software development. There’s a lot of mathematical rigor behind the set theory that drives normalization, though most analysts internalize the normalization rules pretty quickly and think consciously about those rules very often.

However, there is one subjective part about data analysis: defining the entities. This is the subjective part of data modeling, and depends on business domain.

Take, for example, addresses. If I’m a goods manufacturer that ships products, my database contains a table for addresses that will have 2–4 address lines, a city, a region, a country, and perhaps a postal code. Those can be thought of as atomic — I deliver package to said address and that address is represented in one row of a table.

If you’re designing a car navigation system, however, one table per address is a very inefficient way to store data. In such a system I have millions of addresses, many of which are similar to each other: similar streets, towns, regions, etc. So, for each country there are regions, each region there are towns, each town has streets and each street has house numbers. So, one table for country, one for region, one for town, one for street and one for house numbers. And no postal codes.

Determining how to divvy up related data among tables and columns, as well as defining relationships (not to mention keys and indexes) is what takes the longest amount of time in data analysis. Once you have that settled, and have described the processes that handle it (another longish endeavor), the code practically writes itself.

One can take an object-oriented approach as well, which is kind of going from the middle out. In OO analysis, you define a hierarchy of atomic data components, but instead of tables and columns you have classes and properties. The breakdown of the data is generally one of containment rather than relational (associative).

Once the object model is defined, you can string them together as parameters to method signatures, group those into component interfaces, and on up to the top-level. Conversely, you can annotate or otherwise describe how the data of each object is to be persisted in table and columns (or as documents in a NoSQL database).

The third approach is to define your top-level interfaces, the resources (data objects) they take and return, and then build the controllers to handle the interface operations, which then talk to the business logic, and on down to the persistence layer. There’s some transformation and repackaging of the data along the way, but the evolution is traceable from top-to-bottom and bottom-to-top.

This is, in fact, the approach I am taking now: I’ve elected to document a RESTful API using Swagger (specifically, swagger-node). The Open API Specification gives me the vocabulary not only define the URIs that identify resources to be access through HTTP, but also the parameters and response data as well.

This is all written in YAML (Yet Another Markup Language), which can be parsed into all sorts of useful things, like a colorful interactive visualization of API and Model, or as a basis to build scaffolding for an application that has the router, controllers, mocks either stubbed out or at least given an organized folder structure. In my case, swagger-node builds or connects to such components, and my document becomes my interface contract to clients using my API (and between internal components as well).

Why did I take this approach? As I first said, “It depends on where you begin”, and I am beginning from a legacy implementation that meets all necessary requirements to be defined as a Big Ball of Mud. There really is no data model, even though there is a database, there is business logic, and there is web site sitting on top of all this. But no formal API definitions, or well-defined objects, or even a database schema (no foreign keys, sorry). Separation of concerns is not to be found.

By using REST, documentation, and generated scaffolding, I am hoping to push down a good design all the way down to the components, methods, and database. It will be painful, but I think the only approach that can be taken with little risk of failure.

Wish me luck.