Requirements of Modern Data Architecture

By Stephen Simpson, Senior Principal -Technical Architecture at QuantumBlack

The concept of Big Data announced itself at the beginning of the decade and soon became fashionable. It was no big surprise that organisations quickly began to realise its business potential — a 2013 report from McKinsey Global Institute estimated that Big Data could generate an additional $3 trillion in value every year in just seven industries.

For those working with Big Data, initial results were generally modest and technically harder to realise than expected. The industry soon focused on bringing Machine Learning into the process. In theory this involved ingesting closely-related disparate data sources into a Data Lake, analysing the data to uncover fresh new ideas that delivered significant business value, and then providing simple, easy to use, production tools to enable people to act upon these new insights.

However, there are fundamental challenges involved in working with data that remain stubbornly unresolved today. These include, but are not limited to, incomplete data sets poorly captured across time; inadequate data capture and storage mechanisms; the complex interleaving and cleaning of disparate, inconsistent, patchy and time-based data sources; regular delivery of new or improved services and tools into production; sharing common information consistently both within the organisation and across cross-industry supply chains; and establishing and maintaining trust across the entire infrastructure.

On top of all this, those working with data must accept that the ambition for a “single view of the truth” is unrealistic — data is simply a proxy for reality and different business problems require different proxies. When we analyse data, we’re interpreting someone else’s recollections.

Consider the advice of John Tukey, considered to the be the father of Exploratory Data Analysis:

“Far better an approximate answer to the right question, which is often vague, than an exact answer to the wrong question which can always be made precise.”

Tukey’s wise words — written over half a century ago — remain an apt lesson for today’s data architects. The building blocks of data projects must be dynamic and gradually evolving, rather than being completely defined before any business impact can be delivered.

While it is accepted that today’s approach to data warehouses is unnecessarily limiting, there are practical ways to improve the data architecture, which can enable the earlier and more dynamic delivery of innovative solutions. Here are seven examples of how to go about this:

1. Minimise the complexity caused by building-upon, co-existing with or rationalising current EDW (enterprise data warehouse), MDM (master data management) and ODS operational data store) tools and environments. While they do present significant limitations to the target data architecture we require, they are the starting point for modernisation initiatives: that is where knowledge and expertise is concentrated, and they have valuable tool sets.

2. Business-led initiatives should flow from right to left, from business value back to the appropriate data sources— which is not the usually the case today. The data architecture needs to facilitate this process of gaining an ever-improving understanding as the business initiative definition becomes clearer. Additionally, the same data must be applicable to a wide-range of use cases, each of which may have significantly different tolerances of accuracy and correctness, relative to each other.

3. Knowledge capture is likely to want to take a “highest business-value, horizontal slice” use case-led approach. We need the notation and tools to provide a natural, quantitative and consistent recording, linking and re-use of these discoveries to provide worthwhile business ontologies. While the notation and tooling are arriving it’s a question of building-up the expertise to understand, combine, extend and take advantage of the additional understanding it provides.

4. The data architecture needs to support continuous evolution of the system by providing improved automation capabilities. There are three key areas: firstly, the data ingestion, preparation and storage phase need to be well-layered so that downstream changes don’t always require upstream processes to be re-run. Secondly, the implementation of the machine learning systems should include a complete rebuild capability, including model serialisation and the provision of model servers at scale. Lastly, meticulously–crafted loosely-coupled components, so that they can be tested and deployed completely independently of each other.

5. Capturing data with an event-based architecture may be a better way doing things than a traditional relational database-based, record-centric style. This allows natural, consistent, acquisition of data value changes over time, and facilitates the interleaving of data sets; something that is important to most machine learning models. This style of architecture may well become the new system of record for transactional business systems, because it naturally standardises and simplifies increasingly onerous compliance requirements.

6. Algorithm and hardware efficiency has improved immeasurably in recent years. However, this trend will only accelerate further, and the data architecture must put us in a position to support future ideas and innovations with (hopefully) the minimum of compromise and re-work.

7. We need to ensure the safety of what is produced. This includes deep, standardised, diagnostic statistical capabilities, fortified by strong governance. We need to guard against data imperceptibly changing over time, leading to stale or hurtful models. Improved explainability tooling is required to understand how models produce their insights, particularly with the latest Deep Learning techniques. Our systems need to recognise and warn against potential privacy and bias issues.

The data formats and metadata management to underpin the above need to be determined up-front and subsequently committed to. And the loose-coupling mechanisms necessary to support regular and reliable releases need to be engineered with care. Otherwise, these matters can be addressed largely independently of each other, at a pace and with an emphasis that supports the client’s specific priorities. This iterative approach seeks to avoid the traditional huge up-front cost that takes too long to realise, offers an unclear payback, and is difficult to measure.

We’ll be covering these topics in more detail in future blog posts, but the advice above should provide a useful starting point for any organisation exploring machine learning strategies — and asking how data architecture can be designed to deliver dynamic, innovative results.