Under the Hood of Screen Scraping

Published in

Neonomics

6 min readMay 23, 2019

Financial institutions, which are preparing to meet Regulatory Technical Standards (RTS) of Open Banking, are currently facing a shortage of live Open APIs. For any bank that wishes to get ready for the Open Banking era, the only viable choice is using account aggregators, many of whom provide solutions based on “screen scraping”.

What is screen scraping?

Today a lot of companies use so-called “Scrapers” to do “Screen Scraping”, and most of us know that it’s a way to get to the information on a web page. But what is it, and how does it happen? Well … that is a bit more complicated.

To be able to understand it, we need to know a bit about how the information in a system can be viewed through your browser as a user-friendly and useful page, and to do that we need to know a bit about how the layers of that page work. This isn’t very complicated, but it is important to understand, so please stay with me!

The layers of a web page

This may be stating the obvious a bit, but when you query a database and ask for information, that is usually very boring. Just rows and columns of information. To be able to get a page we need to put a couple of layers on top of it, and when you look at the data through those layers, it turns into a wonderful page. These layers are Data, Structure and Presentation. It looks something like the illustration below.

Now, the data we understand what is. It is the rows and columns of information that is very boring and not very readable. The structure is something that comes from the HTML code which tells you something about what data is to be put where on the page. Here we can also add extra bits of information like a title, some text and other things. Finally, we put on the presentation layer which is the CSS code that tells us a bit about what image or element is to be placed where, what colors, fonts and titles are you to see. Every layer is not very beautiful on its own, but when you combine them by looking at them through a web browser, you get the wonderful web-page you want to see.

This is not very extraordinary in and of itself. Pictures on your TV screen add red, blue and green create the moving images. Just looking at the red, blue and green channel separately does not make a pretty picture. You probably also played around in kindergarten with paint to mix colors this way. In pictures it works like this:

How then, do you do screen scraping?

Knowing what we now know, we understand that to get to the data, we need to filter out the presentation and structure. To do this filtering we create something called a “Scraper”. Scrapers can contain a lot of very complicated mechanisms using tools such as regular expressions, substrings and other ways to match patterns or get to specific parts of the page where the data is placed. When you use that filter when you view a page it has been made for, that filter will take out everything that is not data and leaves you only with relevant data or pieces of information on that page which it has been created for.

Once you have created the filter. The only thing you need to do is to either go to the web page it is written for and apply it yourself, or you can get a user to tell you to go to a webpage in a platform, apply it for the user, and present it as your own page with your own colors, logo, structure and everything. You can even combine the data from the page with other sources of data which will allow you to improve on it. Once you have filtered out all the things that a computer does not care about and have only the pieces of information it will work with, there is virtually no limit to what you can do with it!

What are the challenges of screen scraping?

The main challenge is, of course, the maintenance of the filter/scraper. Since it needs to filter out so much information and still be careful not to filter so much that it removes some of the information you need, it must be very carefully crafted. And if those who provide the page you are filtering changes something in their presentation layer or structure, the filter might break, and your use of the information with it! You then must calibrate your scraper to the new presentation and structure quickly so that you can get your solution up and running quickly again. Remember, when your scraper does not work it is your solution that will have downtime, not the page you are scraping. Their solution works as intended.

A second challenge is cost. Scrapers are used by servers to provide data, sometimes because a user clicked a link but more often because it must update the information on a regular interval which can range from a few times a day to a few times a minute. The webpages who provide the information have been developed to provide information to users. They are not built or scaled for the kind of attention that a computer can give them. This makes running a webpage which is being scraped by one or many other computers a lot more expensive than originally intended. This motivates some companies to discourage scraping by changing their presentation and structure layers often and breaking the filters as much as they can.

This challenge becomes even more important when the information is behind a login. For the providers of the service that is being scraped, this is a security loophole as well. Suddenly the information they have so carefully protected is also stored in a separate system. The one who scraped it, and what happens when that system is hacked?

Conclusion

All in all, scraping has been and continues to be a method some systems use. The reason they use it is that the information they require is not available through APIs, or they have invested in a platform that depends on it.

In the banking industry of Europe, a new directive called PSD2, demands that the banks provide APIs to their systems. With that also comes a prohibition to do screen scraping, and the concept of “Open Banking” has become a buzzword ever since it was first ordered by the EU years ago. Not only Europe sees this as a huge leap forward in terms of both market competition, security and quality of service. Similar legal initiatives exist all over the world. To list a few countries working on it we have Japan, Australia, Canada, South Africa and Singapore.

The world is moving more and more towards an API driven world. This both to ensure data quality and equality, but we are not fully there yet, and that is why you should know what screen scraping is. It can happen to your services, and then you have the challenges we mentioned earlier, and very likely someone is doing it for you as an individual when you use one of the apps on your phone!

Read the original article at: https://www.finextra.com/blogposting/17269/under-the-hood-of-screen-scraping

Under the Hood of Screen Scraping

Published in Neonomics

Written by Yujin Jo