Privacy-preserving synthetic data for the financial industry

Elise Devaux
Statice
Published in
6 min readOct 27, 2020

Strict data regulations and cumbersome data governance processes are causing innovation inertia in banks and financial institutions. Where data should drive product development and fuel analysis, we see slow and tedious processes preventing teams from accessing, sharing, and leveraging data. This post explores these challenges, as well as how to regain the ability to work with data safely and efficiently.

The digital transformation presupposes access to data

Data is central for financial institutions on the path to digital transformation. It fuels operational efficiency, helps enterprises build personalized customer experiences, and allows developing competitive products. To remain in the competitive race, leading financial institutions must continuously think about optimizing their data assets.

The recent paradigm and technological shifts emphasize the organizational ability to aggregate, analyze, and use data:

  • For example, the ability to access data to power machine learning applications gives financial companies an advantage in their ability to fight fraud, predict and prevent churn, and provide personalized experiences, to name just a few key use cases.
  • Reliable data governance processes also allow businesses to migrate to cost-efficient infrastructures, such as public clouds. And these trends are growing.

by 2022, public cloud services will be essential for 90% of data and analytics innovation” Gartner, Top Data & Analytics trends in 2020

But there are many blockers on the road to data access for financial organizations.

The roadblocks on the path of data-driven transformation

Fast-evolving regulations and the associated compliance risks are among the blockers that companies are facing. In Europe, personal data protection laws reinforced the legislative frameworks that already regulated data processing in the financial sector. Because of the financial risks of non-compliance, many enterprises adopt a cautious approach to data strategy.

Overall, the finance sector has received more EU General Data Protection Regulation fines than any other industry. Recently, the Dutch Credit Registration Bureau (BKR) received an 830000€ penalty from The Dutch Data Protection Authority (DPA) for non-compliance.

The costs of non-compliance include not only fines settlements but also business disruption, productivity, and revenue loss. The Italian Garante (Data Protection Authority) fined UniCredit bank 600,000€ for non-compliance before the GDPR. This security-first approach left its mark on the sector.

To this, we must add legacy systems with proprietary formats or siloed IT infrastructures. They prevent data teams from quickly accessing data due to prolonged and tedious data access processes.

In cases where data is accessible, the quality might not suffice for cutting edge use cases. Being able to maintain data privacy and its usefulness is not an easy task. With redacted data, it’s common that the data quality doesn’t allow for some forms of analysis anymore.

In the end, these roadblocks are not only preventing companies from leveraging their data entirely, they also come at cost.

The costs of data inertia for financial enterprises

The lack of agility to innovate can cost companies a competitive advantage.

For companies that are arming their workers with data today, 32 percent see a “significant increase” in product or service quality, while 28 percent see an increase in productivity or efficiency,” Harvard Business Review report, Meet the New Decision Makers

On the other side of the coin, companies processing sensitive data without proper protection mechanisms expose themselves to financial, legal, and corporate risks. The re-identification and data leak risks of poor privacy mechanisms should not be ignored as they can lead to severe damage for a company.

For instance, customer trust loss can represent a high cost for companies, although it’s hard to quantify. But customers are less inclined to trust businesses with their money and confidential information after a breach.

Companies that want to use their data for business intelligence and data science applications have the option to use privacy-preserving synthetic data.

Why financial organizations should use privacy-preserving synthetic data

In a general sense, synthetic data is artificially generated information instead of data collected from the real-world. Financial synthetic data mimics the statistical characteristics of the original dataset it’s derived from.

One of the advantages of this method is that the data utility can still be well preserved. Indeed, the synthetic data can still retain many of the properties and statistical information of the original data. With these underlying statistical patterns still present, it’s possible to power almost any application intended for the original data.

This synthetic financial data is also in effect an anonymization method. It safeguards the privacy of any personal data from the original dataset. If generated correctly, it won’t have a one-to-one relationship with the original data, protecting the privacy of customers. It should not be possible to learn information about a particular individual from privacy-preserving synthetic data. It withdraws it from the scope of personal data processing regulations.

A compliant and easy-to-access data asset

Privacy-preserving synthetic financial data is private by design. It’s a guarantee for enterprises to remain compliant with personal data processing regulations, making it a crucial asset. For example, to comply with the GDPR data retention period, a bank would need to delete all personal and financial information after a customer contract ends, preventing any long term data analysis. With privacy-preserving synthetic data, the enterprise could run such long-term analysis on synthetic data generated during the contract period, and delete the customer information as required by any relevant regulation.

Compared to traditional privacy protection mechanisms, properly implemented synthetic data offers stronger privacy guarantees. Other methods, such as tokenization or pseudonymization, present re-identification risks that the use of synthetic data doesn’t. As previously mentioned, solely removing PII from the data is not a safe data protection mechanism and exposes business and individuals to privacy breaches through linkage attacks and other privacy-compromising exploits.

Financial enterprises using synthetic data are also able to grow more agile with their data operations. Without the tedious governance and security processes that often prevent data from flowing within the organization, they can reduce their time-to-data.

Supporting innovation with synthetic data

The changes in enterprises are not only a matter of applications but also infrastructures. Data architecture is evolving, and synthetic data might be one of the keys to building an infrastructure that supports innovation.

The use of cloud storage, compute and other tools are, for instance, an important infrastructural shift. But how do you upgrade from on-premise to cloud when heavy governance and security processes regulate the transfer of customer data off-premise? Synthetic data offers an alternative to moving sensitive data out of your premises. You can make it available to your teams anywhere in the world without increasing compliance overhead, or risking security breaches.

Besides greater flexibility in data architecture, synthetic data opens the door to many risk-free applications, from machine learning training to BI or external data sharing with partners. Successful projects in the financial sectors already attest to that fact that synthetic data holds great potential for financial organizations.

The largest companies in the world are starting to work with synthetic data. Amazon is already using this technology to improve customer purchase prediction. American Express is also exploring the topic. The data teams are researching synthetic data to train machine learning and improve their fraud detection algorithms. We recently saw the Financial Conduct Authority launch a digital sandbox project to foster innovation among financial institutions. In this project, synthetic data offers an opportunity to build alternative payment datasets to improve scam and fraud detection models.

Originally published at https://www.statice.ai.

--

--

Elise Devaux
Statice

Personal blog of a tech enthusiast, digital marketer interested in synthetic data, data privacy, and climate tech. Currently works at cozero.io