When we think automation, we tend to think more of physical work being replaced and automated by machines. However, a lot of office jobs have components and aspects that it would benefit being automated, because it would free workers up for more high level and time sensitive tasks.
In this article I will show an example of how a small script can automate aspects of finance, and the process of generating invoices to send to clients, with no need to have administrator privileges to access a system back-end or mess around with an API, just simply using the web interface you or I would look at and copy the information out of, by using web scraping.
Before we begin, all code can be found here on my Github.
To Start — The Sample Data.
I used 5 sample ‘companies’, that each have a name, an amount due, a VAT rate, and a due date.
The Extraction — A Puppeteer.
Now that the sample data is constructed, I began developing the script to pull the data directly out of the browser. To do this, I have used a library called Puppeteer that is made by the Google Chrome team.
Before I could extract anything though, I first had to determine what elements within the website’s code contain the data. To find these elements, a simple query selector was used that finds the table nested within the relatively complex Vue structure. This method shows that regardless of how complex the website holding the data seems, it can only be holding it in HTML code (whether static of generated) and the data you’re after can be targeted within that HTML.
With the relevant data found within the HTML structure, and because all data is contained within the one table, a map function can be used to extract the information into a tidy array.
In the above image, we see Puppeteer being used to pull all elements out within the table that we have identified, and storing them into an array. I have made this ever so slightly more complicated by storing everything in the one array, so all records are together. This could be present a problem, so …
A Problem — The Array Splitter.
The data that we can see in the screenshot above is an example of how the data would appear in the real world — as a large list not neatly split up for you into each record. But if we look at the array, we can see the data is structured.
Every fifth entry represents a new entry in the table. With the first being the name, the second being the amount, third the date due, fourth the VAT rate and the fifth marking the start of the next entry.
Now that we know there is actually an order to the array, we can split it up since we know at each index what result to expect. I have split it like so:
With each index within the array denoting what value it is. I have then added all values to their own arrays. So for example for each client, their amount and date and VAT are found at the same index in each corresponding array, making recall easier. There are many ways and more efficient ways to do this operation, but I have found this to be an effective method. Of course in a real world example, much more error catching should be done!
Building It — The PDF Generator.
In this stage, I now have the value of each client stored in individual arrays representing the data they hold. I can therefore loop through each client, and begin generating an invoice based on their data.
To do this I have used the pdfkit library to quickly generate PDFs. Each client has their own PDF, as they would in the real world, which can be named anything but I have opted for the client’s name, and that PDF has the information that is relevant to them. PDFKit makes building PDFs very easy, and images can be embedded as well as fonts and stylings used.
As a note, the VAT percentage as it appears in the tables has a percent sign after. This was deliberate to show that the data may not always appear as you want it to. In an ideal world, the VAT would just be a number but instead has extra characters attached so it cannot be manipulated immediately and directly.
As part of the PDF generation, the total amount is calculated by adding the VAT on to the amount due, using the below function. This is quick and simple and just uses regular expressions to strip all non alphanumeric values out of the VAT number.
Testing — The Output PDF.
With the PDFs now being generated and stored in a directory, we can look at a sample PDF that has been output.
It might not be the prettiest invoice, but it could be with some simple styling and branding added to it. As we can see it is built using the information that was available to us in the tables at the very beginning, and is now ready to be sent out to a client!
It is easy to see how this process can make so many jobs so much easier; if we imagine an employee who gathers information from many systems and collates them into one consolidated place of information, then this process of web scraping allows them to do it automatically and tremendously quicker, and frees the employee up for much more complex tasks.
We can imagine a situation in which more complex algorithms are utilised, such as Bayesian Classification used to change the wording of a generated output, or a Neural Network used to inform and convey the information in better and previously unconsidered ways.
The complete sample code for the script can be found here on my Github.