Today, businesses rely on predictive models for a multitude of reasons. At the very least, they enable novel insights that can streamline internal processes; at the most, they are essential drivers of growth and critical to business operations. Therefore,
it is no surprise that data scientists spend a lot of time building these models.
To keep models from degrading, we need to keep a close eye on new streams of data that are coming in, to monitor potential drift, and kick off model retraining sessions to version and keep models fresh. This enables us to seek continued performance, allowing
us to continue delivering value for users.
However, if restrictions around portions of a data set or an individual data point is introduced, the ability to remove this data–without completely retraining the model from scratch–is extremely difficult. One potential solution to this could be the emerging
concept of machine unlearning.
Machine unlearning is a nascent sub-field, with research and development delivering some compelling results. It provides a potential solution to many of the problems faced across industries, from costly re-work needed in the face of new data laws and regulations
to an attempt to spot and mitigate bias.
Making data disappear
For data science teams, taking a high-performing model out of production due to legal or regulatory changes is not an uncommon problem. The process of retraining a large model, however, is extensive and costly.
Take the example of a hypothetical lending approval model in the US. Across the States it’s likely we will have tens to hundreds of millions of data points, from which we have created hundreds of features that we are using to train a potentially large scale
neural network. The time and cost it will take to train this model, as we might be using expensive hardware, e.g. multiple GPUs, can be high. Imagine that this model has been in production for a year, delivering significant value for customers, but that new
privacy laws are then introduced in California that prohibit the use of a particular region of the data set.
This puts us in a difficult position, we must now remove the California data from our training and validation sets and retrain our model. But what if there were a way to make the model forget this data without explicit retraining on the reduced dataset?
This is essentially what machine unlearning could do, which has significant benefits for organizations as well as individuals.
Privacy is a key concern for us all. In financial services and other heavily regulated industries, such as healthcare, falling foul of privacy laws can present a mission-critical problem, so seamlessly removing data that’s no longer permissible by law would
be a significant win. For an individual, especially one in Europe, whose right to be forgotten is enshrined in GDPR, machine unlearning could also be the means by which they preserve this right.
Model degradation defense
It’s the natural order of models that those that have been in production for a long time will contain data that becomes less relevant. From a model monitoring perspective, machine unlearning could help safeguard against this degradation. A key example of
this is the behavioral paradigm shift made by customers during the pandemic. In banking, for example, customers quickly moved from in-person interactions with tellers, to using digital channels. This customer behavior change made it necessary to retrain many
Another use case could be removing data that is introduced in feedback loops that might lead to an adversarial attack, or increasing remediation when bad data is introduced through, say, a system failure that causes a model to deliver malicious outcomes.
Again, the essential driver for this use case is to reduce re-work, but also to make models and data science at large more secure.
Creating true data impartiality
Another way machine unlearning could deliver value is the removal of biased data points that are identified after model training. Despite laws that prohibit the use of sensitive data in decision-making algorithms, there are a multitude of ways bias can find
its way in through the back door, leading to unfair outcomes for minority groups and individuals. There are also similar risks in other industries, such as healthcare.
When a decision can mean the difference between life-changing and, in some cases, life-saving outcomes, algorithmic fairness becomes a social responsibility. For this reason, financial inclusion is an area that is rightly a key focus for financial institutions,
and not just for the sake of social responsibility. Challengers and fintechs continue to innovate solutions that are making financial services more accessible, which will continue to bear fruit across the ecosystem.
Where to start
The concept of machine unlearning is not a novel one, but compared to other sub-fields has seen less pioneering research. Researchers working on how to deliver machine unlearning have proposed a framework called Sharded, Isolated, Sliced, and Aggregated
(SISA) training. This approach divides training data into subsets called shards, which are essentially smaller models that make up the larger model. If data within these shards needs to be removed, then it is only these shards that need to be retrained, which
can happen in isolation. Retraining is still needed in small portions with SISA, but alternate research around data removal-enabled (DaRE) forests leverages caching at nodes in an attempt to forget and remove the need for any explicit retraining.
Models deliver a significant portion of business value, so the use cases of machine unlearning outlined above is promising for the data science community. Yet there is a still a need for data removal in a dynamic and changing environment.
This is where I throw the question now to the data science community, having discussed where I believe the most value can be delivered with machine unlearning.