Hazy: Synthetic Data to fuel Rapid Innovation

Dr Alexander Mikhalev
Nationwide Technology
8 min readNov 19, 2020

The banking industry is rapidly changing and the need to innovate to remain competitive has never been more critical. Transformation is being driven by factors such as new entrants, ease of switching, digital banking services as well as technological developments in other sectors which are leading to increased expectations from people about how they manage their money. However, the way banks and building societies are innovating is also changing: in-house innovation teams are increasingly working with third parties to innovate faster by capitalising on their specialised expertise and applications. This approach offers huge potential for banks and building societies to tap into additional capability faster and more cost-effectively but faces a significant challenge: the ability to share data.

Sharing data for analysis is an operational requirement for financial institutions enabling them to gain insights that directly support business imperatives including innovation, fraud detection and credit risk amongst many others. Such insights enable in-house teams and third parties to build, shape and deliver propositions derived from understanding customer behaviours based on their transactions.

Data sharing is quite rightly subject to strict governance, security, regulation and legislation such as European-wide General Data Protection Regulation (GDPR), Payment Card Industry Data Security Standard (PCI-DSS), California Consumer Privacy Act (CCPA), Data Protection Act (UK) and Health Insurance Portability and Accountability Act of 1996 (HIPAA, US). Third-party integrations are also mandated in the UK by Open Banking and Open Application Programming Interface (API) regulations such as Payment Services Directive Two (PSD2). In some cases, these regulations make sharing data across regional borders or organisations impossible which would otherwise allow even greater insights. These are exactly the types of requirements that pandemic health analysis requires, for example.

Synthetic data is a new paradigm for sharing information safely and responsibly for innovation in financial services. Hazy’s software uses Artificial Intelligence and Machine Learning (AI / ML) to create synthetic data from securely held customer data which does not leave its protected environment. Hazy’s software extracts the statistical information and relationships within the data but contains none of the original data so cannot be traced back in any way to the source meaning that customers are 100% protected. Synthetic data can therefore by used by internal teams and third parties freely and safely to analyse and validate commercial innovations quickly.

Nationwide Building Society and Hazy have worked together to address these challenges head-on and removed three major barriers to sharing transactions safely and faster with 3rd party partners:
1. To create synthetic data that preserves the complexities of the original data sufficiently for behavioural analysis of current account transactions
2. To substantially reduce the time and cost of creating safe data from months to days
3. To share such data without risk via the cloud.

This is the first time that synthetic data has proved it can preserve the time-sensitive nuances of customer banking transactions that can be shared safely with external parties in a production environment. It is also a transformational play for Nationwide in proving that synthetic data is sufficiently representative of real data to increase their speed to innovation and sets a benchmark for driving data agility and eliminating security concerns for sharing data.

The Challenge

Companies face multiple challenges to sharing data. Key amongst these is the ability to transfer the patterns of consumer behaviour in the data needed to feed the analytics they want to run, but without the need to transfer the real data. Another key challenge is that they want to do this without forcing their analytical partners to ingest an entirely new type of data structure, eg aggregate data. In other words, they want a drop-in replacement for the real data that has the same schema and properties.

Techniques such as masking and anonymisation which are typically used to protect the privacy of customer’s data have known weaknesses including:

  • Not preserving key statistical relationships in the original data and referential integrity
  • Being a slow and resource-intensive process
  • The risk of being ‘unmasked’ to reveal original data (eg, through linked attacks)
  • The inherent weaknesses of masked data are one of the barriers for sharing data more freely and as such Nationwide currently only approves its use in certain (highly secure) circumstances.

A further challenge is that the time taken to create masked data depends on complexity and size whilst there are also limits to the quality and utility of the output compared to production data. This is an industry challenge as the whole process can take six months or more limiting the number of such data sets per year which significantly reduces the capacity to innovate and collaborate.

The final challenge is sharing the data with third parties safely. Synthetic data can be freely shared because it contains no real customer data and cannot be reverse-engineered. However, the most effective method for sharing is via the Cloud and a new secure method has been defined which we are about to implement for working with a third party specialising in transaction analysis.

Addressing these challenges has enabled Nationwide to obtain representative, re-usable customer transaction data which contains no personally identifiable information that can be shared with third parties for validation of their capabilities and innovation. Such capability enables Nationwide to generate a proper assessment of their technology without exposing the society to risk or requiring a lengthy governance process to obtain data.

The Solution

The solution to these challenges is synthetic data that is sufficiently representative of the real data to preserve the signal (ie the statistical properties of the original data) and be used as a drop-in replacement for real data. This is because it preserves the statistical properties and patterns of consumer behaviour without any of the privacy concerns.

This signal is required to analyse how customers manage their money which, in most cases, is very similar: making sure bills are paid and their account stays in credit each month. However, transaction behaviours that fall outside of the norm may indicate fraud or pivotal events such as unemployment; to identify these behaviours requires a high fidelity within that signal and is critically dependent on the time when transactions occur. These are known as signatures — characteristics of behaviour that lead to specific outcomes.

Hazy trained its software on a data set of 30 million customer transactions from an 18-month period representing more than 8,800 customers with nearly 10,000 accounts between them. Once trained, the resulting synthetic data model can be used to generate synthetic datasets of arbitrary size on demand.

Within such a large dataset there is a rich variety of characteristics and patterns of behaviour that need to be learned in order to produce a synthetic dataset that is fit for purpose. Here are a few key examples for illustration:

  • Different types of customer with varying numbers of accounts
  • Different types of account, i.e. credit card & current accounts exhibiting different behaviours
  • Transactions with a wide range of values, ranging from buying a coffee to paying for a family holiday
  • Transactions with a wide variety of merchants and other recipients, such as convenience stores or local government authorities (for council tax payments)

Sequential behaviour: transactions which recur every month such as rent or salary, transactions which follow each other in quick succession, etc.

To verify that the synthetic data successfully preserved these characteristics, a battery of metrics was built. Some of the metrics were already part of the Hazy tool kit, others were created specifically to evaluate performance on the sequential characteristics. For each of the measures below, we computed them in the source and synthetic datasets and compared:

  • Probability distributions over key attributes, such as transaction amounts and counts, running balance and initial and final balance
  • Co-dependencies (mutual information to be precise) between pairs of attributes
  • Quality of classification tasks, such as classifying behavioural patterns into life events
  • The importance given to various features when using ML to predict attributes such as merchant category from other attributes
  • Autocorrelation of transaction time series (how an entire time series correlates with itself as it is progressively shifted in time)

Having set out the performance metrics for success, the team fine-tuned the synthetic data model and built an interactive tool to enable direct A / B comparison of real and synthetic data with the ability to query different subsets. The results were outstanding and indicate we believe for the first time a new level of signal preservation in synthetic data. The graph below shows the levels of similarity achieved:

Example of the account balance for a synthetic customer. Synthetic data indistinguishable from the real one

The second proof point was demonstrating how the time to create anonymous data could be reduced from months to days. A detailed business analysis was undertaken to compare the current process to the new process which for reducing time was a key element. However, Nationwide went further to evaluate the complete end-to-end process from identifying a use case and onboarding of third parties through to the third proof point of sharing synthetic data with them for analysis. This puts synthetic data at the heart of the workflow as shown below:

The Benefits

The headline benefit from our collaboration is that Nationwide can now innovate faster. Using Hazy’s synthetic data enables Nationwide to preserve the behavioural and temporal characteristics of production data to rapidly create and provision representative synthetic data sets for 3rd parties to perform analysis. The full set of measurable benefits that using synthetic data has created include:

  • Reducing the time to create and share safe data from months to days
  • Increasing the throughput of innovation projects per year
  • Reducing the people time required to prepare data
  • Reducing Nationwide’s risk of data leakage
  • Making the process of sharing data with external parties faster, safer and trackable
  • Building in Hazy synthetic data generation into the end-to-end workflow of 3rd party onboarding through the standardisation of contractual processes and governance

Combined, these represent a significant step-change in the delivery of value to Nationwide in terms of speed, security and costs.

Conclusion

Nationwide and Hazy have solved the historic challenge of creating safe, representative, GDPR-compliant data to share with third parties so that through analysis they can generate meaningful actionable insights into account behaviours which directly enhances the speed of innovation.

This is a vital driver to remain competitive in an evolving landscape that is being transformed by unforeseen changes to the economy, aggressive new entrants, new regulations such as Open Banking and challenger FinTech platform strategies. Hazy’s synthetic data solution has shown it can meet these challenges head-on.

--

--