On Wednesday 30th October, many of you encountered a problem accessing the Revolut app.
Even though the outage lasted just a few hours, we’re sorry that it happened. This is not the level of service we expect to deliver. In this blog, we’re going to explain exactly what went wrong, and let you know what we’re doing to help avoid it happening again.
Please note, this blog gets pretty technical, but we’ll try to explain terms along the way.
When did this happen?
30th October, 2019 13:32 - 16:00
Here’s what happened
It started with a number of customers being logged out, followed by a significant reduction in performance, which resulted in the Revolut app becoming slow. By slow, we mean that things took seconds rather than the usual milliseconds.
For those who were able to log in, numerous features were affected. Card payments and ATM withdrawals were not affected and continued to work as normal.
Some context surrounding the problem
We release our backend applications — the systems that power the Revolut mobile app — many times a day. Before going to production, every change goes through a phase of automated testing (we follow Test Driven Development as an important part of our change management process).
Successful builds are automatically deployed to what’s known as a ‘staging environment’, where the consistency of the application and the deployment configuration is verified (i.e. we check that it works the way we want it to). Following this, the build can go through a phase of manual verification.
Once we are happy that any proposed changes can be released into production, we trigger what’s called a ‘green-blue deployment’. This green-blue deployment allows us to send requests to the new version of the application, once it's ready to process these requests.
If the new version of the application fails to deploy, then the old version of the application will continue to run. The same procedure is performed in the staging environment.
One of the deployment steps is to change the structure of the database, if required. Since during deployment for a very short period of time the updated database is used by the old application version, to ensure the smooth release of the new version, we have to always make database changes in a backward-compatible manner (meaning that we can go back if necessary).
Unfortunately, due to a human error, changes were made without backward compatibility, meaning that there was a change in the database behind the authentication service that authenticates every mobile app request. When the authentication service isn’t working properly, it leads to problems accessing the app.
Deeper into the issue
For the engineers out there, an unused empty column was removed, however, it remained referenced in some queries to this table, in the previous version of the authentication service running at the time of production. This resulted in authentication errors and automatic logouts in the app.
For all non-engineers, a piece of code was changed, which resulted in you being logged out of the app.
This lasted for a period of a few minutes, but due to the spike in authentication errors (people being logged out and trying to log back in again), an alert was triggered. One of our engineers saw that the errors spiked during deployment, and performed a rollback to the previous version of the service, so that customers would be able to log in again.
Unfortunately, our rollback procedure does not roll back automatic database changes; this step has to be checked manually. As a result, the rollback resulted in the deployment of the previous version of the application, which was not compatible with the version of the database structure*. This led to more authentication errors and more delays.
*This is what engineers refer to as a database schema.
Once our engineers realised the mistake, they brought the database schema in sync with the application about 20 minutes later. During this time a significant number of customers who had been logged out, tried to log back in at the same time. Once the authentication service was back up and running, those users who did manage to log in again started loading account data all at once, causing a spike in pressure on the system, leading to slow response times.
Our backend applications were gradually scaled out in the next 20 minutes (added more servers), which allowed the system to handle more requests, but at the same time created a lot of pressure on the database. We therefore decided to scale out the database (add more servers), which took a further 30 minutes. From that point, response times started gradually normalising, and by 16:00 everything was back to normal.
Still with us?
Here’s what we learned
We generally focus on automating all aspects of the change process and system runtime as much as possible, so that our systems can be more resilient without human intervention. We learned that we have to remove another step in our process that is vulnerable to human error, and to prioritise the automation of validation, with regard to backward compatibility of database structure changes.
We've made our authentication logic more stable so as not to cause logout from the app during intermittent errors, such as we experienced here.
Clearly, scaling of our databases needs to be faster. Ironically, this was a known deficiency and our engineers are currently in the final stages of implementing a solution that addresses this issue.
Even though we took this outage as an opportunity to learn, any interruption to your service is unacceptable, and we wholeheartedly apologise for that. Sometimes systems fail, but it’s by working together that we’re able to fix them.
We’d like to extend our thanks to all of you for your patience, and we hope that you found this debrief informative.