Facing the tax season like a programmer

Anurag Ramdasan
Mar 11, 2018 · 7 min read

Phone bill about two G’s flat
No need to worry, my accountant handles that

- Notorious B.I.G

So its tax season and taxman came knocking for bills. This is one of my least favorite times of the year. After all, who wants to spend time chasing old invoices and worst of all, organize them.

Image for post
Image for post
:(

Unfortunately for me, I take cab twice day, order takeout food at least once a day. These still form only a small subset of all my expenses and I completely lack the self discipline to manage all those invoices, keep a tab of how much I’ve spent!

However, taxman is the taxman and I have been told you cant avoid the taxman, so being true to the lazy, shortcut taking stereotypical programmer, I started hunting for libraries on Github that would make it easy for me to get away without organizing anything.

A few seconds of search and I came across Uber Data Extractor. Thanks Mr. Umm Jackson.

Image for post
Image for post
Good artists copy, great artists fork

So that’s first step solved right there. What this does is it goes through you entire Uber account, scrapes the page, extracts all the trip information and puts it into a csv file. Remember that we need all the invoices and not just trip information, but this seems like a good start.

Most of the magic of scraping seems to happen with this piece of code that defines the scraper.

declaring basics selectors for scrape

This is possible because of an interesting library called Artoo.js. I had never heard of it before. Its a small library which provides minimal but powerful functions to scrape a webpage from your console. I liked this because it meant no session management B.S that you would otherwise have to deal with.

Image for post
Image for post

More on Artoo later, lets focus on Uber for now. This is the kind of output that uber-data-extractor gives you. The most important thing here for us is the trip_id column.

Image for post
Image for post
i had to remove the address for privacy reasons

The reason we care so much about the trip_id column is the trip page below. This is the page where you can actually download your invoice. The trip_id value is the url parameter for each of the trip page. As long as you have the trip_id variable, you can get a piece of Javascript to open up all the trip pages and get the invoice.

Image for post
Image for post
almost left the address out here

However, Artoo is just a scraper. It can only get you the HTML and cannot perform any DOM activity. Which means that we can in theory use Artoo to find all trip_id and from there use a piece of Javascript to open all the invouce pages but at the end of it all, we still need to download all the invoices by ourselves. Unacceptable!

Back to googling. A little while later, I discovered a Chrome extension I had never heard of before.

Image for post
Image for post
be very careful using this though

Enter Custom Javascript for website. This addon basically injects a javascript into a page once the page is fully loaded.

This is amazing, because now we can click on the download invoice button and get all the invoices auto-downloaded. Zero work needed! All the JS that you ever need is this

Now that we have the basic setup to auto download invoices from the trips page, let us go back to making it all work together.

We still have all our trip_ids in the csv file so let’s put together a small python script to extract all of it, make it a Javascript array declaration and write it to a text file. We’re doing this now so that we can feed it to a piece of JS code later to download all the invoices. Here’s the script for that.

Now the text file has something like

Now we can simply copy and paste the code below into the console. This iterates over two trips page at a time, downloads the invoice, 2 every 9 seconds and then repeats.

Why 2 every 9 seconds? I had to add a setTimeout because without it I had my code open a few hundred tabs at once and bring my entire system down. With the setTimeout and w.close() on line 16, we are assured that only two tabs are opened at once, invoices are downloaded and the tabs are closed before the next two tabs are opened.

It is very important to show respect to your computer when you are dealing with Chrome!

Btw, this scripts downloads the invoice in PDF format. If you’re like me and have “Always open files of this type on download” enabled for pdf files, you may want to disable it before running the script.

Okay so this was great and now I have over a thousand Uber invoices in a folder along with a CSV file which summarizes all the trips and pricing info.

However, Ubering isnt all that I do excessively. I have a ridiculous amount of Swiggy bills because it would seem that I am too lazy to cook for myself. Too bad you can’t code cooking — yet.

Anyway, keeping that in mind, I started looking at the Swiggy dashboard hoping to find a way to download my 300+ invoices for that.

Good news — unlike Uber, swiggy doesn’t make you open a new page for your invoice, you can find the button on the same page. Bad news — Swiggy doesnt paginate with page tabs, it uses expand button on the same page.

Image for post
Image for post
Chai point is genuinely good

However, Artoo seems to have an autoExpand feature. This was the biggest sigh of relief for me in this entire exercise. I was fairly certain I would’ve abandoned the entire project if that autoExpand feature didn't exist.

Now for Uber I had a default extractor script which was part of the chrome extension. Same doesnt exist for Swiggy so I had to write it from scratch for myself. However it was a fairly simple task to do.

Now Swiggy does some obfuscation so we have to run with these weird classnames for now. Given that Swiggy also doesnt present data in the cleanest format, we have to do some pre-processing(line 14/15) below.

We use autoExpand on line 26 but mostly everything else is fairly basic to scrape swiggy.

Once this is done, we end up with a very similar sort of csv file as we had for uber

Image for post
Image for post

For some reason, I put the order ID in a different column so I had to tweak my python extractor script to adjust accordingly.

The piece of Javascript is also more or less the same except for the URL construction. The 9000ms is not really a limit, you can tweak it around according to what works best for you and how adventurous you feel that day.

But there it is. Between the pagination and the autoExpand the Custom Javascript for Websites, most cases for most billing pages are very well covered and you should be able to find all the here.

All of this took barely a couple hours to put together and test and it must’ve saved me a few days of work if not weeks.

Taxman: 0 Programmer: 1 #lazinessFTW

Also it is important to realize that it is not just a taxman vs programmer effort. I have a programming background and really enjoy it so I constantly try to see how I can involve my programming skills into pretty much everything that I do. This also helps me tackle a lot of procrastination on things that I don’t want to do by involving things I enjoy doing into it. You can read some more of the related stuff here. So keep coding and don’t let boring basic work ever get to you!

If you liked this article or are just wondering if your click feature still works, you can repeatedly tap on the clap button to find out or also reach out to me on twitter. Would love to hear what else can be done to make the above approaches more efficient.

3one4 Capital

Transformative Capital to help you #RaiseTheBar

Anurag Ramdasan

Written by

I enjoy learning about human behavior and how we interact with our surroundings and ourselves, and especially with technology.

3one4 Capital

3one4 Capital is an early-stage venture capital fund based in Bangalore, India. Our investments include Licious, Betterplace, DarwinBox, Open, Pocket Aces, Jupiter, Yourstory, Faircent, and Tracxn.

Anurag Ramdasan

Written by

I enjoy learning about human behavior and how we interact with our surroundings and ourselves, and especially with technology.

3one4 Capital

3one4 Capital is an early-stage venture capital fund based in Bangalore, India. Our investments include Licious, Betterplace, DarwinBox, Open, Pocket Aces, Jupiter, Yourstory, Faircent, and Tracxn.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store