Across industries, data is recognised as an organisation’s most valuable asset. From data comes knowledge and new insights that can be used to improve every function of a business, from new and better products and services for customers, to operational efficiencies.
As data strategies mature, firms are turning an increasingly expectant eye toward the possibilities enabled by advanced technologies such as AI, machine learning and data science.
AI and ML models have delivered unprecedented value in many industries. In financial services, they have unlocked incredible efficiencies by automating decision-making processes, risk calculation and enabled the creation of ever more intelligent solutions.
In each case, the quality of the models is determined by two major factors: quantity and quality.
The more data an organisation has to feed its models, the better the models will be. But there is a significant and eternal problem when it comes to data in financial services: there’s never enough. Data is the lifeblood of artificial intelligence. To create
a model, data scientists spend a great deal of time locating the right data to feed and train models, and even more time cleaning and preparing that data.
So, what happens when there just isn’t enough data to advance models? We can’t just create data, right? Well, actually, we can.
The problems with real data
The main issues data science teams face in financial services is that data is often highly regulated, sensitive or, by its very nature, sparse.
Many datasets contain personally identifiable information or other sensitive details, such as a person’s full name or Social Security number. This makes it difficult to share with third parties, and even internally for the purpose of data analysis or model
building. Gaining permissions to access and use certain datasets, as well as the process of anonymising sensitive data, can take weeks, even months, and even those that can be used may not be large enough to make a truly effective model.
If we need more usable data to train models but can’t acquire it naturally, we need to create some, which is where synthetic data comes in.
So, you just make it up?
Synthetic data is artificially generated, rather than collected by real-world events. Its purpose is to resemble real data, despite being entirely fake in nature. But it’s not a simple process of replicating values of a real dataset. Data has a distribution,
a shape that defines the way it looks. Think of a dataset laid out in a tabular format. There are columns that interact with other columns, data that have inherent correlations and patterns when parsed. Synthesising this kind of data is a hugely difficult
To synthesise data, we need to build a machine learning model that understands the way the data looks, interacts, and behaves. Once we have this model, we can generate millions of synthetic records that perfectly mimic real datasets, overcoming data limitations.
Banks, for example, can only get more customer checking account data when customers open checking accounts. They then have to wait for a period of time for these customers to start using this financial product, creating further engagement data. With synthetic
data that is drawn from the current customer base, it is possible to synthesize new checking accounts, along with their associated usage. The bank now has a much larger dataset from which it can draw better insights and improve products and services – with
the customers being the ultimate beneficiaries.
The key to creating high quality synthetic data is to start with a dataset that is both large and high quality. With this, it is possible to expand our current dataset with high quality synthetic data points. It’s the classic input/output quality equation.
On the cusp of something
Increasingly, synthetic data will be sought as data strategies evolve. Third-party firms that specialize in creating synthetic data from real datasets are already working with large financial institutions today.
I believe this trend will continue and precipitate a new data boom that will positively affect the maturity of models across the financial services ecosystem, improving data-driven decision-making and accelerating innovation.