Use Your Data: Automating Email Attachments Downloads

Part 1 — Problem Statement formulation and collecting the data

oranyeli samuel
Analytics Vidhya
Published in
5 min readOct 14, 2019

--

This is the first of two posts that shows how to use your data to solve problems. I believe that you do not need a grand problem before you can apply your data science skills. Some are right in front of us and with careful attention and automation we can solve them. The first part covers how to automate downloads of email attachments, while the second part covers how to scrape PDFs and extract relevant data. Please note that no data is shared, as these are financially sensitive information.

Problem Overview: Recently the proprietor at the family day care my kids attend informed me that there was an outstanding amount in payments. Outstanding? Wow! This much? How come? She explained that the child care subsidies for my kids fluctuated and she just recently completed her checks. She gave me a list of what was owed so far, and I assured her I would check my records and sort it out.

Alright, let’s get to work.

Problem definition: How much do I owe in back payments for child care? And how do I stop these payments discrepancies from recurring?

Course of Action: Download the statement of entitlements from the email (this gives a breakdown of the number of hours used, the childcare subsidy paid by Centrelink and the outstanding amount to be paid by the parents), and do some calculations. Now I could do this by hand, by manual addition and subtractions of the amounts. But what happens three months from now? Six months? Will I have to repeat this very manual process? Absolutely inconvenient!!! Surely this has to be automated.

I employed the first three parts of the OSEMN model for data science (Obtain Data, Scrubbing Data, Exploring Data) — http://www.dataists.com/2010/09/a-taxonomy-of-data-science. In this article, we will focus on the first part — Obtaining Data.

Step 1 — Get the Data: The statements of entitlement are sent to my wife’s inbox as PDF files. How convenient. I have to somehow scrape these PDF files, combine them into one spreadsheet, and run computations to find out how much back payments I owed due to the child care subsidy fluctuations. But first I had to get the PDFs? How? First thought was to download each of them, which were not so many(yet) but inconvenient. There had to be a way to simplify things.

What if I found a way to automate the download process, where somehow the program would go into the mailbox, sift through for mails that had a subject titled ‘Statement of Entitlement’, find the attachments in those specific mails, and download them to my computer?

Python is one of my tools for data science. It is simple, easy and effective. The script below shows the entire automation process.

The Imaplib package helps connect to the email server, especially for retrieval. The Email package manages email messages. ConfigParser, in this case, helps retrieve the mail username and password stored in a separate file — you do not want your email details in your scripts — definitely unsafe! Pathlib helps with file access and is much easier to use, in my opinion, than the OS package.

Alright modules loaded. Let’s login to our mailbox. First we create a file that will contain our username and password and save it as ‘something.ini’. We then reference this file using configparser and login to our gmail account. Again, the whole idea of keeping our details in a separate file is for security reasons — ensure that no one has access to our login details via the script.

At this point you should see an OK, login successful message on your screen. If it fails and you receive an ‘Authentication Failure’ error, you could change your settings on gmail and allow less secure apps access. You can always turn it off once you have executed the script. The mailBox defaults to your inbox. Of course, you can access other parts of your mail, but here I will focus on our problem scope.

Okay, so we are in, let’s get those attachments. I mean, that’s what we are here for.

Here we search for our mail. We are specific because we know the subject title. The search is done by uid -unique identifier. Every message has a UID. It is advisable to search by UID — This is copied directly from Python’s Imaplib documentation : “Note that IMAP4 message numbers change as the mailbox changes; in particular, after an EXPUNGE command performs deletions the remaining messages are renumbered. So it is highly advisable to use UIDs instead, with the UID command.” https://docs.python.org/3.7/library/imaplib.html

The output of the search is a tuple — we are only interested in the second part of the tuple, which contains the relevant message (in bytes). We still have some ways to go to get our attachments. Come on, let’s go — almost there.

Basically what the script does is walk through our messages, looks for only the sub parts of the message that has attachments and downloads them to a specific location on my computer.

And there you have it. Email attachments can be automatically downloaded to my PC. Here is the link to the complete script on my github repository : https://github.com/samukweku/PDF_Extraction/blob/master/attachment_downloads.py

Have a go at it, tweak it, break it, configure it to your needs and let me know what you think. Feedback and suggestions are welcome.

Let’s head to Part 2 — Extracting the data.

--

--