ABN AMRO’s Data Integration Architecture

Published in

ABN AMRO Developer Blog

13 min readMar 13, 2018

With this blogpost it’s my pleasure to offer you the chance to have a look at the cool initiatives that we do at ABN AMRO. I want to emphasize that our Data Integration Architecture pertain to a common problem area. The ideas that are discussed in this article are not restricted to only ABN AMRO or is not industry specific. The audience is anyone with an interest in data, data architects, data engineers or solution designers.

ABN AMRO’s Digital Integration & Access Layer Architecture

The architecture that we have developed helps our architects, engineers and solution designers to pick the right building blocks to deliver value for our business and customers. With this architecture we strongly believe we can improve our agility, while at the same time having control over our data integration and distribution. We also believe that this pioneering work will advance the practice of our enterprise architecture. Interested? Please, keep on reading!

The concept of connecting Ecosystems

Before I start talking about our future state architecture and initiatives it’s good to go back at the point where our journey started. Almost two years ago my colleagues and I were challenged to transform the current Data Warehousing Architecture in a way that it meets current and future business requirements and supports the data driven culture within ABN AMRO in an agile way. Besides the business requirements for analytics and business intelligence, the team also needed to respect data management principles (including governance), legal obligations and our internal policies. The requirements also included constraints around streaming data, usage of unstructured & external data, embedment of analytics in our operational processes and API usage.

The modern architecture must meet all these different requirements, but at the same time to fit into an ecosystem with FinTechs, external data- providers and consumers. This led to us with a belief that an “architecture for ecosystems” is key to achieve success in the area of data integration. We strongly believe that data will be much more distributed, especially when we move to the Cloud and interact, collaborate more and more with external parties. Most likely we’ll reach a point in the future where most of the data we use is external rather than internal. To keep control of our data distribution and usage in this large space, we need an “architecture for ecosystems”.

Current situation
When we started, we followed the good practice of identifying and examining the existing situation. Our existing architecture is like what most enterprises have and can be best illustrated by the conceptual diagram below:

Beside the opportunity to leverage from the latest Big Data and Analytical technological trends, we also must control it from a Data Management perspective. Data lineage is an important objective. Whenever the data schema changes, we need to be able to track the data and the changes. Judging the truthfulness and quality of the data is also very important. The typical Data Warehouse design is based on the principle of the 90’s to bring all the data together and integrate everything at first. Taking huge amounts of external and unstructured data into account, this approach is outdated and no longer viable. From a governance perspective it is also important to have clear insight in ownership. Allowing data providers to have control over the distribution and consumption is an important new aspect. And off course, we must not forget agility.

Our ‘every application has a database’ hypothesis

Before we move on I would like to share our hypothesis. Our first hypothesis is that every application (at least in the context of a banking application) that creates data has a ‘database’ (organized collection of application data). In our view, even stateless applications that create data have ‘databases’. In this scenario the database typically sits in the RAM or in a temp file, rather than being persisted on a disk in a ‘traditional’ fashion.

Consequently, when we have two applications, we hypothesize that each application has its own ‘database’. When the interoperability between these two applications takes place, we expect data to be moved from one application to another.

Common understanding of integration
Another crucial aspect when it comes to data movement is that data integration is always right around the corner. Whether you do ETL (Extract, Transform and Load) or ELT (Extract, Load and Transform), virtual or physical, batch or real-time, there’s no escape from the data integration dilemma. The data- interoperability and integration aspect will frame our architecture. This lies in the fact that an application database schema is designed to meet the application’s specific requirements. Since the requirements differ from application to application, consequently the schemas are expected to be different and data integration is always required when moving data around.

Data Provider and Data Consumer
For our integration architecture we have adopted and adapted the philosophy of ‘Service Orientation’ and TOGAF’s Integrated Information Infrastructure Reference Model (III-RM). An application or a system is either a data- provider (producer of data) or a consumer (consumer of data). Since we’re part of a larger eco-system, we expect data- providers and consumers can also to be external parties.

With this fundamental concept of data- providers and consumers, we defined a set of principles:

Clear ownership
Data Quality is maintained at the source, which is provider’s responsibility
Understandable data, which means with definitions, labels and proper metadata
No data consumption without a purpose
Don’t do integration when it’s not required.
If no data integration with other sources is required, solve the issue at your own side
Data consumers can become data providers when distributing data. If so they must adhere to the same principles.

First view of our ‘Digital Integration and Access Layer’ Architecture

With the concept of data- providers and consumers, how does the integration and interoperability work? In the middle we have our ‘solution’, what we call as our Digital Integration and Access Layer (DIAL). Let’s walk through some of the aspects.

Access: From a data consumer’s perspective we ideally want to create a single place, or a layer, where data consumers can ‘explore’, access and query the data in a consistent manner, at any time at any speed. ‘Make data available’ is the motto. Data consumers ideally shouldn’t have to worry about the availability. Whether the data must be physically present in this layer depends on the non-functional requirements.

Integration: crucial part of the architecture is transformation/integration. In our DIAL architecture we set a hard requirement that the data transformation between data- providers and consumers will be only done once. So, no initial transformation to an enterprise model. No ‘IBM Information FrameWork’ or languages which are often called ‘Esperanto’. In our approach, data is in a Provider format or in a Consumer format. Consumers set the requirements. We accept harmonisation of data and that an additional step might be needed if data heavily overlaps, but in our architecture, this is only allowed on a consumer or domain level. By doing so, we strongly believe the agility will increase significantly. By letting the enterprise model go, data providers and consumers can change at their own speed, and everything is loosely coupled.

Metadata: How does a data consumer understand what the data means if no enterprise model is used? This is where our metadata comes into the picture. Metadata is of the essence. For the ‘understandable’ data principle we use business metadata. For integration and transformation, we require lineage metadata. By making our architecture metadata-driven it supports an approach that data can be ‘reused’ for other consuming applications by providing insight. As a result, this means all data consumers have access to the enterprise metadata catalogue from which they can see the available data, schema’s, definitions, lineage, ownership, quality, list of sources, etc.

The last and final part is security. The DIAL architecture also takes care of the data delivery agreements between Data Providers and Data Consumers. The routing of the data is based on the metadata agreements and labels. This allows data providers to be in control of the distribution, because whenever the metadata changes the routing changes.

Digital: The way we want to implement our metadata requirements is to have also some machine learning/AI. Eventually it should become a self-learning semantic layer allowing all parties to directly interact with it. This layer will also include a taxonomic service to better see the true potential and value of the data with relations to our business capability models, technical models and so forth.

Engineering the Architecture with Architecture building blocks

From an engineering perspective we have distinguished the following capabilities and patterns to distribute & transform/integrate the data:

Raw Data Store (RDS)

The Raw Data Store acts as a ‘read-only cache’, where data can be taken from. See it as a ‘Command Query Responsibility Segregation’ extension of the operational system. Because we don’t do any upfront transformation, the format is ‘raw’, which means that the data structure is inherited (based on architecture guidelines) from the source. With ‘raw’ we also imply that no new data can be created and that the context of the data cannot be changed. If any of these happen, we expect consumers to extract the data and new ownership will be required.

The benefit of the RDS is that the developers can refactor the operational system, without the requirement to change the RDS. Queries against the RDS won’t add load on the operational system. The RDS must be kept up to date, which can be done via ingestion techniques like Batch, Change Data Capture (CDC), Micro batches, Replication or Synchronization.

From a Data Governance perspective the Data Provider owns and controls the data in the RDS, which also means controlling the access, determining who or which application has access. Also, part of the Data Governance is the Data Delivery Agreements. If a contract has been made with a ‘table’, the data provider knows that there is a dependency. The provider can still engineer or refactor the operational system, but some backwards compatibility must be provided.

We envision multiple RDS’s for different use cases and different RDS’s can also sit and share the same technological platform. The Raw Data Store is technology agnostic. This means that the RDS can be a relational system or a document store based on the data structure. Cassandra, Hadoop (HIVE2), MongoDB, RedShift to mention a few, are all valid platforms. A data provider can extend itself to one or multiple of these environments. This makes the RDS an ideal place for consumers, because there’s differentiation in the way data can be consumed (large volumes versus small volumes at a higher speed). In a scenario, where multiple RDS’s sit on the same environment, it’s logical to also bring the master and reference data to such a shared environment. The metadata is responsible for disguising the transactional-, master- and reference data. The environment with multiple RDS’s is also the place where coherence and integrity checks across systems can be performed. It is also expected that this has a positive effect on data quality.

Service Orientation

For real-time application-to-application communication or volumes which can be taken out directly from the application or system we use light-weight integration. We use the terminology ‘service orientation’. Request/response is the most well-known pattern. Decoupling and integrations happens by putting a communication bus or an integration component in between. Examples are Enterprise Services Busses or API Gateway’s. Regarding the metadata in this architecture, we expect all services to be registered in a service registry and owned by a data provider. Consumers should be listed consequentially looking at the same Data Delivery Agreements philosophy we have. The transformations/translations should go into the metadata store, so also for Service Orientation we follow the same thinking we have for the Raw Data Stores.

Streaming

The final pattern is the streaming data pattern. Whenever a state of change happens, an event or trigger is created. The event/trigger will be forwarded to a streaming platform, from which distribution towards data consumers takes place. Data can also be persisted in a message queue. In the DIAL architecture this ‘queue or key/value store’ is also called a ‘Raw Data Store’ whenever it is used to facilitate multiple consumers. The topics, subscriptions and distribution are in the architecture always controlled via the metadata.

Combining the different patterns

We acknowledge there might be a slightly overlap between the patterns, but when combining the different patterns, the Integration Architecture looks like as follows:

Patterns in DIAL are also complementary. Events or requests on the API Gateway can be input for Streaming and can be ingested in the Raw Data Store. Every pattern has a set of consuming patterns. For the Raw Data Store, as an example, consuming patterns can be ETL, ODBC access, pull after poll, push subscription on files or direct access via Business Intelligence solutions.

Integrated Data Store
On the right side of the DIAL Architecture you see the data consumer’s solutions and the concept of an ‘Integrated Data Store (IDS)’. In our future architecture we want all future applications to only rely on a single application database. Databases, which store data for multiple applications, should be avoided at any time.

The IDS is symbol for a large variety of different use cases, e.g. Business Intelligence, Analytics, Operational applications, etc. The application database is setup specifically to satisfy and address the specific business requirements. Consequently, the data model is expected to be very specific. The data model can vary from highly normalized to dimensional. Also, the integration steps can also vary, from a single integration step to a situation where additional steps are required (data cleansing, additional harmonisation, etc.).

All these new types of applications, which directly consume, transform and store the data, we call Integrated Data Stores. Since the data schema is changed we expect new ownership and consider the new data as new ‘golden’ source. All these new applications are required to be registered in our metadata repository, what we call as the List of Golden Sources (LoGS). We have also set this principle of registration for our transitional/operational systems.

We acknowledge that the DIAL Architecture abandons the Enterprise Data Model in favor of a higher agility. This is also in line with our ‘connecting eco-systems’ thinking. The Enterprise Data Model in our architecture has made place for disciplines like metadata management, data governance and data quality. By having Data Management in place, we ensure supervising the data distribution and foster the data reusability.

Data movement across chains
Going back to our DIAL reference architecture we also see an arrow on the right side which goes all the way back to the Digital Integration and Access layer. This is because a data consumer can also become a data provider when data is distributed again. So, when applications want to share/distribute data with other applications the patterns of the DIAL architecture must be re-used again. To illustrate the data interoperability between applications, see the picture below:

When data distribution takes place, it must always use the principles and patterns of the DIAL architecture. Since the DIAL architecture relies on metadata, we can keep track of the data no matter what pattern is used. The architecture guidelines make an exception for data distribution within the ‘bounded context’. The ‘bounded context’ roughly sets the responsibilities and boundaries of the domain or business area. When the context or responsibilities change, decoupling using DIAL is required.

Metadata components
To flesh out the details of the metadata stream, please have a look at the picture below.

Each RDS is expected to sync all the schema information with the metadata repository. The DIAL Architecture have ‘centrally managed’ ETL capabilities, which write the lineage to the metadata repository automatically, but a data consumer can also pick its own pattern. In case of non-automated situations, there will be a manual requirement to provide the metadata lineage. Although the RDS’s have been visualized in the image. The same approach we’ll use for Service Orientation and Streaming.

What is the relation to MicroServices?

You might wonder: what is the relation between microservices and the DIAL architecture? The microservice architecture is an application architecture, while our DIAL architecture is an integration architecture between applications. A microservice is an independently deployable unit, which is part of an application. Many microservices together form an application. The application level (bounded context) is exactly where we draw the line. Within the boundaries of the application our developers have certain freedom (so they can use GRPC, Thrift, etc.), while across applications we require the DIAL patterns to be used for decoupling. In certain scenario’s we also expect that the patterns used in our DIAL architecture and microservice architecture can overlap. When a shared infrastructure, like Kubernetes, is used, we expect the metadata to be there, to assign which microservices belong to which application.

Wrap up

The Digital Integration & Access Layer is for ABN AMRO the Architecture for our architects, developers, engineers and solution designers, so they know how to deliver the highest value for the business. To summarize, the main advantages are:

Clear insight in the data supply chain
Insight for both data providers and consumers in data consumption and consumer’s requirements and responsibilities.
A much higher agility, since we cut out the additional step of integration needed in the current architecture and removal of dependencies with other domains
Much easier to access and find data
Insight in the meaning of data, quality and ownership
Much better security, because having labels on attribute level we can enforce attribute-based access.
The opportunity to leverage much quicker from the latest trends and developments

About the author(s):
Piethein Strengholt, Technology Architect
Piethein is Technology Architect for ABN AMRO. He is part of a high performing team of technology enthusiasts with a passion for the latest developments and trends. Are you interested to join me on this journey? Please let us know!

Many thanks to Bernard Faber, Henk Vinke, Dave van Wingerde, Fabian Dekker and Reijer Klopman for the cooperation.

Want to help to build the bank of the future?
Take a look at our vacancies within IT! What will you do?