Automating data extraction from E-mail attachments with Databricks

4 min readMar 20, 2023

The untapped goldmine for richer and deeper insights.

Living in today’s world, contrary to the explosion of AI innovation, I have witnessed the immense dependency on email attachments. As both a consultant and solutions architect in the data and machine learning space, I believe it is one of the largest untapped reserves of rich data for businesses and organisations, vendors, and customers to help streamline their decision making.

When ‘electronic mail’ or email was conceived more than 50 years ago, no-one was ready for the dependence and volumes that we see today. Over the years, individuals and businesses, large and small alike have utilised emails for communication and data transfer, and today we see over 306 billion emails being sent and received each day in 2020 [statista]. The data trapped in these reserves (think of all the CSVs, Excel files, PDFs) are more precious and useful as ever as the world looks towards the power of data science and machine learning.

Whilst we need to tackle this email dependency epidemic at its root cause (that’s a topic for another time), the operational pains caused by dependencies on external parties or other organisations has evidently resulted in this paradigm sticking around, far past its expiry date. That means, albeit frustrating, organisations that want to shift away from emails are finding it difficult to build workflows to automate the extraction and processing of email attachments to be consumed by analytics and machine learning use cases.

Think of all the emails at:

Insurance companies Excel and PDF claims forms
Banks with financial ledgers in Excel and approvals in PDFs
Retail stores with their budgets and sales ledgers in Excel

Nevertheless, organisations that truly wish to capitalize on machine learning and digital transformations can now start to move on from manual email processing regardless of their partners, suppliers, vendors or customers. Automate your workflows to convert your attachments to Delta Lake tables for analytics, reporting and machine learning with Databricks.

If you or your organisation wish to start extracting all the rich data from your email attachments, the Databricks Lakehouse Platform serves as the perfect unified data environment to extract your email attachment, ingest all the valuable data, and serve the respective analytics and machine learning outputs.

A typical pattern for processing emails into your data stack.

Let’s put it into practice…

Emails have become a blessing and a curse over the years, and now, more than ever we see email automation boosting productivity in all business teams.

Some ideas where email automation can save time, costs and errors:

Internal Business reporting, such as budgeting, daily store sales summaries
Third-party data ingestion, invoices, order forms, claims
Customer queries and forms

Step 1: Connect

Connect and setup your email provider, and download all relevant attachments into one of your storage buckets (S3, ADLSgen2, GCS)

The code below shows an example of how you can use a notebook to ingest your email attachments to your storage for processing.

# Setup connection via IMAP
import re
import os
from imap_tools import MailBox
from imap_tools import AND
import email
import pprint

imapHost = '[IMAP HOST]'
imapUser = '[USERNAME]'
imapPass = '[PASSWORD]'
target_folder = 'mnt/mount_path/folder'

with MailBox(imapHost).login(imapUser, imapPass, initial_folder = 'INBOX') as mailbox:
  for message in mailbox.fetch(AND(subject='[filter]')):
    log = []
    uid = message.uid
    subject = message.subject
    sender = message.from_
    date = message.date

      for att in message.attachments:
        if '.pdf' in att.filename.lower() or '.xlsx' in att.filename.lower():
        obj_name = uid + '_' + att.filename
        path = os.path.join(target_folder, obj_name)
     
        with open(path, 'wb') as fp:
          fp.write(att.payload)
          log = [uid, subject, sender, date, path, 0]

Alternatively, ingestion tools such as FiveTran are available on Databricks Partner Connect.

Partner Connect: Fivetran (Fivetran Website | Fivetran Partner Connect | Databricks Data Ingestion Partners )

Step 2: PARSE

Based on your business requirements, parse these documents by reading them, cleaning them and saving to a master file. This process can be automated with Databricks workflows or Delta Live Tables.

This is the ‘bronze’ stage.

This can be built upon with any other data cleaning or engineering and orchestrated through Databricks Workflows, Delta Live Tables, or with our Partners such as DBT via Partner Connect.

Delta: What is Delta Lake?

Databricks workflows: Introduction to Databricks workflows

DLT: Introduction to Delta Live Tables

Partner Connect: DBT (DBT Website | DBT Partner Connect | Data Preparation and Transformation Partners)

Step 3: AGGREGATE OR MODEL

Data, once in Delta, can then be efficiently cleaned, joined with other tables or dimensions, and be ready for Machine Learning or further processing. This is the ‘silver’ stage.

Learn more about how you can take your newly provisions data source to feed machine learning models on Databricks via MLflow.

The Big Book of MLOps

MLOps on the Lakehouse

MLflow 2.0

Step 4: SERVE

Finally, clean data can now be aggregated to business levels for reporting or served to applications or downstream operations. This is the ‘gold’ stage.

Partner Connect: Tableau (Tableau Partner Connect | BI and Visualization Partners)

Partner Connect: Power BI (PowerBI Partner Connect | BI and Visualization Partners)

Databricks SQL: What is Databricks SQL? | Databricks SQL Dashboards

Go forth and conquer — with automation on the Databricks Lakehouse

With support for unstructured, semi-structured and structured data the Databricks Lakehouse is perfect for managing your data from raw to gold: Delta Lake’s ability to unify a lake and warehouse facilitates piping your attachment data straight into models, direct querying and BI — all without the complexity of multiple platforms, different storage locations, various storage formats and siloes. Since all your attachments are ingested into your unified data platform, even if you have photos, or scanned documents — It’s as easy as ‘bronze, silver, gold’ to enable automation or AI/ML capabilities into your organisation’s workflows.

These steps here only touch on the early stages of what is possible — learn more about the benefits of Databricks Lakehouse.

If this is something you want to explore, please reach out to your Databricks representative/account team.