Blog article
See all stories »

Under the Hood of Screen Scraping

Financial institutions, which are preparing to meet Regulatory Technical Standards (RTS) of Open Banking, are currently facing a shortage of live Open APIs. For any bank that wishes to get ready for the Open Banking era, the only viable choice is using account aggregators, many of whom provide solutions based on “screen scraping”.

What is screen scraping? 

Today a lot of companies use so-called “Scrapers” to do “Screen Scraping”, and most of us know that it’s a way to get to the information on a web page. But what is it, and how does it happen? Well ... that is a bit more complicated. 

To be able to understand it, we need to know a bit about how the information in a system can be viewed through your browser as a user-friendly and useful page, and to do that we need to know a bit about how the layers of that page work. This isn’t very complicated, but it is important to understand, so please stay with us!

The layers of a web page

This may be stating the obvious a bit, but when you query a database and ask for information, that is usually very boring. Just rows and columns of information. To be able to get a page we need to put a couple of layers on top of it, and when you look at the data through those layers, it turns into a wonderful page. These layers are Data, Structure and Presentation. It looks something like the illustration below.

web layers

Now, the data we understand what is. It is the rows and columns of information that is very boring and not very readable. The structure is something that comes from the HTML code which tells you something about what data is to be put where on the page. Here we can also add extra bits of information like a title, some text and other things. Finally, we put on the presentation layer which is the CSS code that tells us a bit about what image or element is to be placed where, what colors, fonts and titles are you to see. Every layer is not very beautiful on its own, but when you combine them by looking at them through a web browser, you get the wonderful web-page you want to see. 

This is not very extraordinary in and of itself. Pictures on your TV screen add red, blue and green create the moving images. Just looking at the red, blue and green channel separately does not make a pretty picture. You probably also played around in kindergarten with paint to mix colors this way. In pictures it works like this: 

Colors

How then, do you do screen scraping?

Knowing what we now know, we understand that to get to the data, we need to filter out the presentation and structure. To do this filtering we create something called a “Scraper”. Scrapers can contain a lot of very complicated mechanisms using tools such as regular expressions, substrings and other ways to match patterns or get to specific parts of the page where the data is placed. When you use that filter when you view a page it has been made for, that filter will take out everything that is not data and leaves you only with relevant data or pieces of information on that page which it has been created for.

Once you have created the filter. The only thing you need to do is to either go to the web page it is written for and apply it yourself, or you can get a user to tell you to go to a webpage in a platform, apply it for the user, and present it as your own page with your own colors, logo, structure and everything. You can even combine the data from the page with other sources of data which will allow you to improve on it. Once you have filtered out all the things that a computer does not care about and have only the pieces of information it will work with, there is virtually no limit to what you can do with it!

What are the challenges of screen scraping?

The main challenge is, of course, the maintenance of the filter/scraper. Since it needs to filter out so much information and still be careful not to filter so much that it removes some of the information you need, it must be very carefully crafted. And if those who provide the page you are filtering changes something in their presentation layer or structure, the filter might break, and your use of the information with it! You then must calibrate your scraper to the new presentation and structure quickly so that you can get your solution up and running quickly again. Remember, when your scraper does not work it is your solution that will have downtime, not the page you are scraping. Their solution works as intended.

A second challenge is cost. Scrapers are used by servers to provide data, sometimes because a user clicked a link but more often because it must update the information on a regular interval which can range from a few times a day to a few times a minute. The webpages who provide the information have been developed to provide information to users. They are not built or scaled for the kind of attention that a computer can give them. This makes running a webpage which is being scraped by one or many other computers a lot more expensive than originally intended. This motivates some companies to discourage scraping by changing their presentation and structure layers often and breaking the filters as much as they can.

This challenge becomes even more important when the information is behind a login. For the providers of the service that is being scraped, this is a security loophole as well. Suddenly the information they have so carefully protected is also stored in a separate system. The one who scraped it, and what happens when that system is hacked?

Conclusion

All in all, scraping has been and continues to be a method some systems use. The reason they use it is that the information they require is not available through APIs, or they have invested in a platform that depends on it.

In the banking industry of Europe, a new directive called PSD2, demands that the banks provide APIs to their systems. With that also comes a prohibition to do screen scraping, and the concept of “Open Banking” has become a buzzword ever since it was first ordered by the EU years ago. Not only Europe sees this as a huge leap forward in terms of both market competition, security and quality of service. Similar legal initiatives exist all over the world. To list a few countries working on it we have Japan, Australia, Canada, South Africa and Singapore. 

The world is moving more and more towards an API driven world. This both to ensure data quality and equality, but we are not fully there yet, and that is why you should know what screen scraping is. It can happen to your services, and then you have the challenges we mentioned earlier, and very likely someone is doing it for you as an individual when you use one of the apps on your phone! 

3862
External | what does this mean?
This content is provided by an external author without editing by Finextra. It expresses the views and opinions of the author.

Comments: (5)

Ketharaman Swaminathan
Ketharaman Swaminathan - GTM360 Marketing Solutions - Pune 23 May, 2019, 11:46Be the first to give this comment the thumbs up 0 likes

Kudos for a great article on the mechanism of screen scraping. When I read the first paragraph, I was wondering how scraping can be used because I thought the only difference between Open Banking and PFM apps like MINT launched 10 years ago was a difference in data access mechanism. The last paragraph confirms my belief when it says PSD2 bans screen scraping. I understand there's lack of API but, still, how are banks allowed to use screen scraping when it's banned?

A Finextra member
A Finextra member 23 May, 2019, 12:30Be the first to give this comment the thumbs up 0 likes

Great question. I wrote it this way for clarity, but as with all things banking there are some neuances. While screen scraping is not allowed, there are a couple of temporary exceptions.

The first and most important exception is that screen scraping is only prohibited once the bank has an accessible production level PSD2 API, and even then there is a six month adaptation period where the third party app can move from screen scraping to the actual PSD2 usage, so under a legal framework there will still be a window where you could perform screen scraping legally. This window has a hard close though, so relying on it would not be advisable.
The second exception is that for banks who have trouble meeting the deadline with a proper PSD2 API, they can allow screen scraping as a fallback, assuming the legal entity they have their license under allows them to do so. They have to apply. Even in this case though they still need to manage the SCA element for consent, so screen scraping could not be performed identically as before ... but many of the mechanisms could still be used until the bank has a proper PSD2 interface, when the window mentioned in the first exception hits.
Finally, it is worth noting that there are other banking services outside of the PSD2 scope, and there is an ongoing discussion in many forums on whether a TPP should be allowed to scrape these. I would argue that is not a good idea for security purposes, but I have failed to find a final and formal conformation of this.
Hope these clarifications help!

Ketharaman Swaminathan
Ketharaman Swaminathan - GTM360 Marketing Solutions - Pune 23 May, 2019, 15:44Be the first to give this comment the thumbs up 0 likes

TY for the clarification.

Key is, who is responsible for having the said API in your line "once the bank has an accessible production level PSD2 API"? Is it the said bank itself or some regulator like (gasp!) EU Parliament / ECB that has mandated PSD2?

A Finextra member
A Finextra member 24 May, 2019, 09:08Be the first to give this comment the thumbs up 0 likes

This one is happily much simpler to answer clearly! The ones who are responsible to have said API in question is the bank themselves, and the ones who are there to ensure that they indeed create them is their national regulatory authorities, who again answers to their national government.

There is a significant incentive to do this. The National authorities have within their power to "motivate" the banks throug several means. All the way from fines to indeed taking their license away. In the beginning of the PSD2 live period which starts in September, I expect the authorities to be somewhat lenient as this has been a huge hurdle for most banks ... but as time goes by, I expect them to continously tighten the reins .. thereby forcing the banks to fulfil their legal obligation.

Ketharaman Swaminathan
Ketharaman Swaminathan - GTM360 Marketing Solutions - Pune 24 May, 2019, 09:37Be the first to give this comment the thumbs up 0 likes

Got it, thanks...