Data is produced and collected at an unprecedented scale by users and companies. However, GDPR regulations impose strict restrictions on internal and external data sharing for public and private organizations.
This is problematic in many ways. Consider healthcare, for instance, where clinicians would like to understand possible outcomes of available treatments for a specific patient with a rare disease. If they could share data about diagnostics and treatments
from other hospitals with patients having the same disease that could bring huge advantages since powerful machine learning models could be trained to generate reliable what if scenarios for the outcome of possible treatments.
It is well known that machine learning algorithms reach higher accuracy just by accessing larger sets of training data. Thus by accessing a wider pool of data, more precise models could be trained - this is specially crucial for rare cases.
In the financial sector data sharing is also very problematic as any institutions are strictly prohibited to share clients data due to strict financial regulations.
Imagine a bank trying to detect complex money laundering activities. At present banks are restricted to use their own data and struggle to detect novel sophisticated strategies previously used by attackers in other banks. However, if these banks could share
a distilled version of the data related to the attacks, machine learning models could pick these illicit transactions much better. This data doesn't need to include any personal elements, sometimes daily or weekly aggregated transactions are enough.
Synthetic data can solve this problem by creating a completely “fake” version of the original data without any personal identifiers. This data, however, will replicate all the properties of the original data, correlations, time dependencies, etc, but would
contain no identifiable information about the users. Note that this is not the same as data anonymization, as this process substantially degrades the quality of the data and does not guarantee privacy.
Let's consider a user named Anthony, 39 years old, living in Southampton with a chronic heart disease and on a treatment of an experimental drug X for 1 year. In the synthetic world Anthony would appear as a 35 years old nameless male living in Reading with
the same chronic heart disease on treatment with the same drug but for 8 months. This patient would retain relevant information related to the treatment but the details would not match any real record. In other words, it's safe, realistic synthetic data.
With data collected from hundreds or thousands of synthetic patients like Anthony, with the same heart disease but different clinical history, treated with other drugs and different outcomes and side effects, the clinicians could train a ML model to study
possible scenarios for their real existing patients. This will add a much higher level of confidence for the clinicians to generate an evidence-based outcome for all possible scenarios rather than guessing - (see
this example for medical data use).
Next figure exemplifies how synthetic data can be used to overcome the problem of data silos. I call this vision, the Synthetic Ecosystem.
Figure: Synthetic Ecosystem
Sharing synthetic sensitive data, like medical records or bank transactions, will require a proper technological and legal framework. The rollout of this vision will probably require the creation of several tiers of data for different levels of sensitivity
and privacy. More sensitive synthetic data will require a more stringent privacy protection while less sensitive data will be shared with a less restrictive privacy concerns.
Breaking data silos
This framework would be first implemented inside the organisations to
break data silos and speed innovation.
The first step is to allow internal data sharing between different data silos in large organizations. A data silo, also known as an information silo, where data is stored, conceptually connected but incapable of operating with one another. Few organizations
choose to store their data in silos. This is a natural occurrence during the life cycle of data. The problem with data silos is that it drags innovation and creates compliance complications.
Synthetic data can help this integration by enabling to safely experiment and deploy synthetic data twins without risk. For instance, by sharing sales data from CRM with claims and underwriting, insurers can easily obtain a permissionless 360 degree perspective
of the client.
Decision makers don’t just want more data, they want insights. Since synthetic data preserves the utility of the original data, the Synthetic Ecosystem will allow us to extract these insights from the data, being Business Intelligence or Predictive Analytics.
As in the case mentioned earlier, accessing external data sources gives the data scientist extra leverage in creating more accurate predictive models.