Stories by Vinayak Mehta on Medium

Community Update: Announcing Grofers Tech Talks!

Vinayak Mehta — Mon, 27 May 2019 10:24:46 GMT

We’re excited to announce Grofers Tech Talks! This is an initiative towards our bid to foster an open exchange of ideas around technology. We’ve decided to keep the scope of this meetup a bit broad, to promote discussions on topics ranging from product to devops and from design to engineering.

We plan to do this every month. Each meetup will have four talks, with one or two talks by Grofers and the rest by the community. To propose talks, you can simply open an issue on this GitHub repo: https://github.com/grofers/talks. Pretty neat, right?

On April 20, we hosted the first Grofers Tech Talks meetup (or GT-squared as we’ve started to call it) in our Bangalore office. The meetup had two tracks for Engineering and Data Science, each comprising of two talks.

Introduction to Kubernetes by Adit Biswas

The meetup was kickstarted with the Engineering track, by Adit Biswas with his introductory talk on Kubernetes. He taught us about some basic concepts about containers, explained the Kubernetes architecture which was followed by a short demo. He also talked about how we use Kubernetes at Grofers scale.

Rulette: A simple and versatile rule engine by Kislay Verma

Adit’s talk was followed by a talk on Rulette by Kislay Verma. He talked about how you can simplify business rule management using Rulette, an open-source rule engine developed by him!

Extracting tabular data from PDFs using Camelot and Excalibur by Vinayak Mehta

After a short break, we started with the Data Science track. The third talk (by me) focused on how you can use Camelot and Excalibur, a Python library and a web interface developed by me, to extract tabular data from PDFs files very easily.

Introduction to Tweets Analysis in R by Abdul Majed RS

In the final talk, Abdul Majed RS showed us how we can easily analyze tweets using R. He used the tweets made about The Patriot Act with Hasan Minhaj’s episode on Indian elections and showed us some interesting insights.

In the end, we opened the space for lightning talks before initiating a coordinated attack on pizzas. Devjyoti Patra talked about tenalisorcerer, an open-source SQL parsing and analysis framework that he’s working on.

Sounds fun, right? You can find links to talk videos and slides from Meetup #1 in the README on this GitHub repo: https://github.com/grofers/talks.

Call for Proposals

We’re organizing Meetup #2 on 1st June (this Saturday) in our Bangalore office. You can propose talks about anything interesting that you’re working on by simply opening an issue on the GitHub repo mentioned above, we would love to hear from you.

If you’re interested in solving the kind of problems we write about on Lambda, check out our open positions here. You can follow Grofers Engineering on Twitter and Instagram to always stay updated on what we’re doing. Hope to see you at our next meetup!

Community Update: Announcing Grofers Tech Talks! was originally published in Lambda by Blinkit on Medium, where people are continuing the conversation by highlighting and responding to this story.

Community Update — March 2019 (Bangalore)

Vinayak Mehta — Tue, 19 Mar 2019 05:55:25 GMT

Community Update — March 2019 (Bangalore)

Meetups are an important platform to learn new things, meet like-minded people and get helpful advice. They promote an open exchange of ideas and expose you to perspectives other than your own. Some even go on to become “the crucible for an entire industry”. With that in mind, we were super excited to host the first ever meetup at our Bangalore office!

Last weekend we hosted BangPypers — Bangalore Python Users Group, a monthly meetup where developers meet and discuss topics related to the Python programming language.

On Saturday, our office was filled with 40 young students and professionals from across Bangalore who were there to engage in talks on how they can make their Python code more robust by writings tests!

Continuous Quality by Sanket Saurav

The meetup was kickstarted with a talk on Continuous Quality (CQ) where Sanket emphasized on how CQ forms an integral part of the development process and how it can help you ship more reliable, secure and maintainable software.

Picture credits: Deepsource

You can reach out to Sanket on Twitter. His Github profile is https://github.com/sanketsaurav.

Using dependency injection to overcome your testing woes by Joydeep Bhattacharjee

In the second talk, Joydeep spoke on how requirement injection can enable you to write generic tests that don’t have to change with your Python requirements. He also spoke on how you can “mock” complex dependencies, for example: APIs or databases, so that your tests don’t have to rely on their availability. You can find the slides and code for his talk here.

Picture credits: BangPypers

You can reach out to Joydeep on Twitter. His Github profile is https://github.com/infinite-Joy.

Data Validation by Ankur Gupta

The next talk by Ankur focused on how you can use Cerberus to validate key-value data. He shared various tips and tricks on how you can offload the data validation logic from your code to a simple Cerberus configuration. You can find the Jupyter notebook used in his talk here.

Picture credits: BangPypers

You can reach out to Ankur on Twitter. His Github profile is https://github.com/originalankur

Why build an automation library? by Praveen G Shirali

The final talk by Praveen focused on how you can create an automation library (an API to automate your product) which can then be used for testing, among other things. He also talked about how you can create utilities on top of this library to promote exploratory testing and save engineering time. You can find the slides for his talk here.

Picture credits: BangPypers

You can reach out to Praveen here. His Github profile is https://github.com/pshirali.

Thanks to everyone who came to the meetup, it was a great day filled with fun and learning! We’re looking to host a lot more meetups in the future. If you’re a community organizer and need a venue for your meetups, you can reach out to Kasisnu (Gurgaon) and Vinayak (Bangalore).

Community Update — March 2019 (Bangalore) was originally published in Lambda by Blinkit on Medium, where people are continuing the conversation by highlighting and responding to this story.

An Open-Source Tool to Extract Tables from PDFs into CSVs

Vinayak Mehta — Mon, 26 Nov 2018 01:09:16 GMT

Excalibur is a free and open-source tool that can help you to easily extract tabular data from PDFs. I originally wrote this post for my website.

Photo by Patrick Tomasso on Unsplash

Borrowing the first three paragraphs from my previous blog post since they perfectly explain why extracting tables from PDFs is hard.

The PDF (Portable Document Format) was born out of The Camelot Project to create “a universal way to communicate documents across a wide variety of machine configurations, operating systems and communication networks”. Basically, the goal was to make documents viewable on any display and printable on any modern printer. PDF was built on top of PostScript (a page description language), which had already solved this “view and print anywhere” problem. PDF encapsulates the components required to create a “view and print anywhere” document. These include characters, fonts, graphics and images.

A PDF file defines instructions to place characters (and other components) at precise x,y coordinates relative to the bottom-left corner of the page. Words are simulated by placing some characters closer than others. Similarly, spaces are simulated by placing words relatively far apart. How are tables simulated then? You guessed it correctly — by placing words as they would appear in a spreadsheet.

The PDF format has no internal representation of a table structure, which makes it difficult to extract tables for analysis. Sadly, a lot of open data is stored in PDFs, which was not designed for tabular data in the first place!

Excalibur: Extract tables from PDFs into CSVs

Excalibur is a web interface to extract tabular data from PDFs, written in Python 3! It is powered by Camelot. You can check out fantastic documentation at Read the Docs and follow the development on GitHub.

Note: Excalibur only works with text-based PDFs and not scanned documents. (As Tabula explains, “If you can click and drag to select text in your table in a PDF viewer, then your PDF is text-based”.)

How to install Excalibur

After installing ghostscript (see install instructions), you can simply use pip to install Excalibur:

$ pip install excalibur-py

Note: You can also download executables for Windows and Linux from the releases page and run them directly!

How to use Excalibur

After installation with pip, you can initialize the metadata database using:

$ excalibur initdb

And then start the webserver using:

$ excalibur webserver

That’s it! Now you can go to http://localhost:5000 and start extracting tabular data from your PDFs.

Upload a PDF and enter the page numbers you want to extract tables from.
Go to each page and select the table by drawing a box around it. (You can choose to skip this step since Excalibur can automatically detect tables on its own. Click on “Autodetect tables” to see what Excalibur sees.)
Choose a flavor (Lattice or Stream) from “Advanced”: Lattice, for tables formed with lines or Stream, for tables formed with whitespaces.
Click on “View and download data” to see the extracted tables.
Select your favorite format (CSV/Excel/JSON/HTML) and click on “Download”!

A table detection upgrade

Camelot, the Python library that powers Excalibur, implements two methods to extract tables from two different types of table structures: Lattice, for tables formed with lines, and Stream, for tables formed with whitespaces. Lattice gave nice results from v0.1.0 since it was able to detect different tables on a single PDF page, in contrast to Stream which treated the whole page as a table.

But last week, Camelot v0.4.0 was released to fix that problem. #206 adds an implementation of the table detection algorithm described by Anssi Nurminen’s master’s thesis that is able to detect multiple Stream-type tables on a single PDF page (most of the time)! You can see the difference in the following images.

Both Stream-type tables detected in v0.4.0

as compared to

Whole page being treated as a table in v0.3.0

Voted #1 on Labworm

Excalibur was voted #1 on Labworm in the second week of November! Labworm is a platform that guides scientists to the best online resources for their research and helps mediate knowledge exchange by promoting open science.

LabWorm on Twitter

Votes are in! In 1st place: Excalibur, a web interface to extract tabular data from PDFs. See & Vote TOP #research tools at https://t.co/50tYJLLZqc

Why another PDF table extraction tool?

There are both open (Tabula, pdfplumber) and closed-source (Smallpdf, Docparser) tools that are widely used to extract data tables from PDFs. They either give a nice output or fail miserably. There is no in between. This is not helpful since everything in the real world, including PDF table extraction, is fuzzy.

Excalibur uses Camelot under the hood, which was created to offer users complete control over table extraction. If you can’t get your desired output with the default settings, you can tweak the “Advanced” settings and get the job done!

For a more detailed account of why Camelot was created, you should also check out “The longer read” section of my previous blog post. Use Ctrl + F.

The road ahead

Reiterating from “The longer read” section I talked about above, it was a pain to see open-source tools not give a nice table extraction output every time. And it was frustrating to see paywalls on closed-source tools. I think that paywalls should not block the way to open science. I believe that Camelot was a successful attempt by us, at SocialCops, to address the problem of extracting tables from text-based PDFs accurately. Excalibur has made it more easier for anyone to access Camelot’s goodness with a nice web interface.

But there’s still a lot of open data trapped inside images and image-based PDFs. And state of the art optical character recognition software is locked behind paywalls.

‘At this time, proprietary OCR software drastically outperforms free and open source OCR software and as such could be worth a public agency’s investment depending on the amount and type of OCR jobs the public agency is needing to perform.’ — How to Open Data — Working with PDFs

So the next step is to make it easy for anyone to extract tables (or any other type of data for that matter) from images or image-based PDFs by adding OCR support to Camelot and Excalibur. If you would like to contribute your ideas towards this, do add your comments on #101. You can also check out the Contributor’s Guide for guidelines around contributing code, documentation or tests, reporting issues and proposing enhancements.

If Excalibur has helped you extract tables from PDFs, please consider supporting its development by becoming a backer or a sponsor on OpenCollective!

Also, stop publishing open data as PDFs and keep looking up! :)

Thanks to Christine Garcia for providing feedback and suggesting edits.

An Open-Source Tool to Extract Tables from PDFs into CSVs was originally published in HackerNoon.com on Medium, where people are continuing the conversation by highlighting and responding to this story.

Announcing Camelot, a Python Library to Extract Tabular Data from PDFs

Vinayak Mehta — Wed, 03 Oct 2018 06:28:02 GMT

Photo by Carles Rabada on Unsplash

A PDF file defines instructions to place characters (and other components) at precise x,y coordinates relative to the bottom-left corner of the page. Words are simulated by placing some characters closer than others. Similarly, spaces are simulated by placing words relatively far apart. How are tables simulated then? You guessed it correctly — by placing words as they would appear in a spreadsheet.

Camelot: PDF table extraction for humans

Today, we’re pleased to announce the release of Camelot, a Python library and command-line tool that makes it easy for anyone to extract data tables trapped inside PDF files! You can check out the documentation at Read the Docs and follow the development on GitHub.

How to install Camelot

Installation is easy! After installing the dependencies, you can install Camelot using pip (the recommended tool for installing Python packages):

$ pip install camelot-py

How to use Camelot

Extracting tables from a PDF using Camelot is very simple. Here’s how you do it. (Here’s the PDF used in the following example.)

>>> import camelot
>>> tables = camelot.read_pdf('foo.pdf')
>>> tables

>>> tables.export('foo.csv', f='csv', compress=True) # json, excel, html
>>> tables[0]

>>> tables[0].parsing_report
{
'accuracy': 99.02,
'whitespace': 12.24,
'order': 1,
'page': 1
}
>>> tables[0].to_csv('foo.csv') # to_json, to_excel, to_html
>>> tables[0].df # get a pandas DataFrame!https://medium.com/media/40652afca4b940a4a249a1c9696c1f01/hrefYou can also check out the command-line interface.
Why use Camelot?
Camelot gives you complete control over table extraction by letting you tweak its settings.
Bad tables can be discarded based on metrics like accuracy and whitespace, without ever having to manually look at each table.
Each table is a pandas DataFrame, which seamlessly integrates into ETL and data analysis workflows.
You can export tables to multiple formats, including CSV, JSON, Excel and HTML.
Okay, but why another PDF table extraction library?
TL;DR: Total control for better table extraction
Many people use open (Tabula, pdf-table-extract) and closed-source (smallpdf, pdftables) tools to extract tables from PDFs. But they either give a nice output or fail miserably. There is no in between. This is not helpful since everything in the real world, including PDF table extraction, is fuzzy. This leads to the creation of ad-hoc table extraction scripts for each type of PDF table.
We created Camelot to offer users complete control over table extraction. If you can’t get your desired output with the default settings, you can tweak them and get the job done!
You can check out a comparison of Camelot’s output with other open-source PDF table extraction libraries.
The longer read
We’ve often needed to extract data trapped inside PDFs.
The first tool that we tried was Tabula, which has nice user and command-line interfaces, but it either worked perfectly or failed miserably. When it failed, it was difficult to tweak the settings — such as the image thresholding parameters, which influence table detection and can lead to a better output.
We also tried closed-source tools like smallpdf and pdftables, which worked slightly better than Tabula. But then again, they also didn’t allow tweaking and cost money. (We wrote a blog post about how we went about extracting tables from PDFs back in 2015, titled “PDF is evil”.)
When these full-blown PDF table extraction tools didn’t work, we tried pdftotext (an open-source command-line utility). pdftotext extracts text from a PDF while preserving the layout, using spaces. After getting the text, we had to write Python scripts with complicated regexes (regular expressions) to convert the text into tables. This wasn’t scalable, since we had to change the regexs for each new table layout.
We clearly needed a tweakable PDF table extraction tool, so we started developing one in December 2015. We started with the idea of giving the tool back to the community, which had given us so many open-source tools to work with.
We knew that Tabula classifies PDF tables into two classes. It has two methods to extract these different classes: Lattice (to extract tables with clearly defined lines between cells) and Stream (to extract tables with spaces between cells). We named Camelot’s table extraction flavors, Lattice and Stream, after Tabula’s methods.
For Lattice, Tabula uses Hough Transform, an image processing technique to detect lines. Since we wanted to use Python, OpenCV was the obvious choice to do image processing. However, OpenCV’s Hough Line Transform returned only line equations. After more exploration, we settled on morphological transformations, which gave the exact line segments. From here, representing the table trapped inside a PDF was straightforward.
To get more information on how Lattice and Stream work in Camelot, check out the “How It Works” section of the documentation.
How we use Camelot
We’ve battle tested Camelot by using it in a variety of projects, both for one-off and automated table extraction.
Earlier this year, we developed our UN SDG Solution to help organizations track and measure their contribution to Agenda 2030. For India, we identified open data sources (primarily PDF reports) for each of the 17 Sustainable Development Goals. For example, one of our sources for Goal 3 (“Good Health and Well-Being for People”) is the National Family Health Survey (NFHS) report released by IIPS. To get data from these PDF sources, we created an internal web interface built on top of Camelot, where our data analysts could upload PDF reports and extract tables in their preferred format.
Note: We became finalists for the UN SDG Action Awards in February 2018.
We also set up an ETL workflow using Apache Airflow to track disease outbreaks in India. The workflow scrapes the Integrated Disease Surveillance Programme (IDSP) website for weekly PDFs of disease outbreak data, and then it extracts tables from the PDFs using Camelot, sends alerts to our team, and loads the data into a data warehouse.
To infinity and beyond!
Camelot has some limitations. (We’re developing solutions!) Here are a couple of them:
When using Stream, tables aren’t autodetected. Stream treats the whole page as a single table, which gives bad output when there are multiple tables on the page.
Camelot only works with text-based PDFs and not scanned documents. (As Tabula explains, “If you can click-and-drag to select text in your table in a PDF viewer… then your PDF is text-based”.)
You can check out the GitHub repository for more information.
You can help too — every contribution counts! Check out the Contributor’s Guide for guidelines around contributing code, documentation or tests, reporting issues and proposing enhancements. You can also head to the issue tracker and look for issues labeled “help wanted” and “good first issue”.
We urge organizations to release open data in a “data friendly” format like the CSV. But while tables are trapped inside PDF files, there’s Camelot :)
Announcing Camelot, a Python Library to Extract Tabular Data from PDFs was originally published in HackerNoon.com on Medium, where people are continuing the conversation by highlighting and responding to this story.

Airflow, Meta Data Engineering, and a Data Platform for the World’s Largest Democracy
Vinayak Mehta — Sat, 25 Aug 2018 08:53:06 GMT
I originally wrote this post for the SocialCops engineering blog.
Photo by arihant daga on Unsplash
In our last post on Apache Airflow, we mentioned how it has taken the data engineering ecosystem by storm. We also talked about how we’ve been using it to move data across our internal systems and explained the steps we took to create an internal workflow. The ETL workflow (e)xtracted PDFs from a website, (t)ransformed them into CSVs and (l)oaded the CSVs into a store. We also touched briefly on the breadth of ETL use cases you can solve for, using the Airflow platform.
In this post, we will talk about how one of Airflow’s principles, of being ‘Dynamic’, offers configuration-as-code as a powerful construct to automate workflow generation. We’ll also talk about how that helped us use Airflow to power DISHA, a national data platform where Indian MPs and MLAs monitor the progress of 42 national level schemes. In the end, we will discuss briefly some of our reflections from the project on today’s public data technology.
Why Airflow?
To recap from the previous post, Airflow is a workflow management platform created by Maxime Beauchemin at Airbnb. We have been using Airflow to set up batching data workflows in production for more than a year, during which we have found the following points, some of which are also its core principles, to be very useful.
Dynamic: A workflow can be defined as a Directed Acyclic Graph (DAG) in a Python file (the DAG file), making dynamic generation of complex workflows possible.
An Airflow DAG
Extensible: There are a lot of operators right out of the box! An operator is a building block for your workflow and each one performs a certain function. For example, the PythonOperator lets you define the logic that runs inside each of the tasks in your workflow, using Python!
Scalable: The tasks in your workflow can be executed parallely by multiple Celery workers, using the CeleryExecutor.
Open Source: The project is under incubation at the Apache Software Foundation and being actively maintained. It also has an active Gitter room.
Furthermore, Airflow comes with a web interface that gives you all the context you need about your workflow’s execution, from each task’s state (running, success, failed, etc.) to logs that the task generated!
The problem with static code
Here at SocialCops, we’ve observed a recurring use case of extracting data from various systems using web services, as a component of our ETL workflows. One of the ways to go forward with this task is to write Python code, which can be used with the PythonOperator to integrate the data into a workflow. Let’s look at a very rudimentary DAG file that illustrates this.
https://medium.com/media/1066f1495e914c519dc2472b88574189/hrefAs you can observe, the PythonOperator can be instantiated by specifying the name of the function containing your Python code using the python_callable keyword argument. Multiple instantiated operators can then be linked using Airflow API’s set_downstream and set_upstream methods.
In the DAG file above, the extract function makes a GET request to httpbin.org, with a query parameter. Web services can vary in their request limit (if they support multiple requests at the same time), query parameters, response format and so on. Since writing custom Python code for each web service would be a nightmare for anyone maintaining the code, we decided to build a Python library (we call it Magneton, since it is a magnet for data), which takes in the JSON configuration describing a particular web service as input and fetches the data using a set of pre-defined queries. But that solved only half of our problem.
In our last post and in the example DAG file above, we could link operators together by writing static calls to the set_downstream and set_upstream methods since the workflows were pretty basic. But imagine a DAG file’s readability with 1,000 operators defined in it. You would have to be a savant to infer the relationships between operators. Moreover, everyone in your team (including people who don’t work with Python as their primary language) wouldn’t have the know-how to write a DAG file, and writing them manually would be repetitive and inefficient.
Meta data engineering
In his talk “Advance Data Engineering Patterns with Apache Airflow“, Maxime Beauchemin, the author of Airflow, explains how data engineers should find common patterns (ways to build workflows dynamically) in their work and build frameworks and services around them. He gives some examples of such patterns, one of which is AutoDAG. It is an Airflow plugin that Airbnb developed internally to automate DAG creation, allowing the users who just need to run a scheduled SQL query (and don’t want to go author a workflow) to create the query on the web interface itself.
Finding patterns involves identifying the building blocks of a workflow and chaining them based on a static configuration. Look at the DAG file that we showed in the section above and try to identify the building blocks. It has just three components, which can be modeled into a YAML configuration.
https://medium.com/media/b6a582d72a50ff7640bd2795eba5c31a/hrefThis is a very basic example. For a more detailed one, you should check out how Pachyderm and Digdag have modeled their workflow specifications, which they use to dynamically generate workflows.
This makes it easy for us to now write a single DAG file that can take in a bunch of these YAML configurations and build DAGs dynamically, by linking operators which have the same identifiers (in this example, we have used a number, 1, for the sake of simplicity). Moreover, anyone in your team who wants to create a workflow can just write a YAML, which makes it easy for a human to define a configuration that is machine-readable. Once you’ve figured out a way to create DAGs based on configurations, you can build a interface to let users build a DAG without writing configurations, making it easy for anyone looking to create a workflow!
https://medium.com/media/23215c76a914e4fdfdf4511e53ae0830/hrefFor DISHA, we needed to (E)xtract scheme data from source systems via web services and then follow that with the T and L. At an atomic level, our workflows could be broken down into:
a PythonOperator to (E)xtract data from the source system, by using Magneton (the Python library we had developed).
a BashOperator to run R or Python-based (T)ransformations on the extracted data, like cleaning, reshaping, and standardizing geographies across data sets.
a PythonOperator to (L)oad the transformed data into our data warehouse, on which Visualize can run analytical queries.
Additionally, we added Slack and email alerts using the PythonOperator.
Identifying this pattern let us automate DAG creation. Using a web interface, anyone could now add the configuration needed for the three basic tasks outlined above. This helped us to distribute the work of setting up workflows within our small team, most of whom were comfortable only in R. Soon, everyone was writing R scripts and building intricate workflows, like the one below.
The Airflow web interface lets the project stakeholders manage complex workflows (like the one shown above) with ease, since they can check the workflow’s state and pinpoint the exact step where something failed, look at the logs for the failed task, resolve the issue and then resume the workflow by retrying the failed task. Making tasks idempotent is a good practice to deal with retries. (Note: retries can be automated within Airflow too.)
An overview on DISHA
‘DISHA is a crucial step towards good governance through which we will be able to monitor everything centrally. It will enable us to effectively monitor every village of the country.’ — Narendra Modi, Prime Minister of India
The District Development Coordination and Monitoring Committee (DISHA) was formed in 2016. The goal was to coordinate between the Central, State and Local Panchayat Governments for successful and timely implementation of key schemes (such as the National Rural Livelihoods Mission, Pradhan Mantri Awaas Yojana and Swachh Bharat Mission). To monitor the schemes and make data-driven implementation decisions, stakeholders needed to get meaningful insights about the schemes. This required integrating the different systems containing the scheme data.
Last year, we partnered with the Ministry of Rural Development (MoRD) and National Informatics Centre (NIC) to create the DISHA dashboard, which was launched by the Prime Minister in October. The DISHA Dashboard helps Members of Parliament (MPs), Members of Legislative Assembly (MLAs) and District Officials track the performance of flagship schemes of different central ministries in their respective districts and constituencies.
https://medium.com/media/16c02890f2c2399c48eba5c0d8f6f6fc/hrefBack in October 2017, the dashboard had data for 6 schemes, and it was updated in August 2018 to show data for a total of 22 schemes. In its final phase, the dashboard will unify data from 42 flagship schemes to help stakeholders find the answer to life, universe and everything. For the first time, data from 20 ministries will break silos to come together in one place, bringing accountability to a government budget of over Rs. 2 lakh crores!
The DISHA meetings are held once every quarter, where the committee members meet to ensure that all schemes are being implemented in accordance with the guidelines, look into irregularities with respect to implementation and closely review the flow of allocated funds. Workflows, like the one showed above, have automated the flow of data from scheme databases to the DISHA Dashboard, updating the dashboard regularly with the most recent data for a scheme. This is useful for the committee members since they can plan for the meeting agenda by checking each scheme’s performance and identifying priorities and gap areas.
Watch the Prime Minister speak about how he uses the DISHA dashboard to monitor the progress of Pradhan Mantri Awas Yojna here.
Going towards better public technology
Data extraction is a piece of a larger puzzle called data integration (getting the data you want to a single place, from different systems, the way you want it), which people have been working on since the early 1980s. Integrating different data systems can be quite complex due to the systems being heterogeneous; this means they can differ in the way they are implemented, how data is stored within them and how they interact with other systems, making them silos of information.
To successfully extract data from another system, people on both ends of the transaction first need to agree upon a schema for how the data will be shared. As we found, this can be the most time-consuming part of a project, given how heterogeneous different data systems can be. Even though we have successfully integrated 22 data sources together so far, the time we spent on getting the data in the right format, with the variables we needed for each scheme, would’ve been saved if there was a standard for storing and sharing data across all ministries.
In his article “Bringing Wall Street and the US Army to bear on public service delivery”, Varun Adibhatla of ARGO Labs talks about ‘Information Exchange protocols’. He calls it jargon for, being able to share standardized data or speak a common language at some considerable scale. He further explains here that, in the 1990s, Wall Street financial institutions got together and agreed to speak in a common transactional language. This led to the creation of the FIX protocol, which let them share data quickly. He mentions that data standards like the FIX protocol are not unique to Wall Street, but exist in almost every sphere of trade and commerce.
‘A small team of purpose-driven public technologists, leveraging advances in low-cost device, data and decision-making and the right kind of support is all it takes to build and maintain public, digital infrastructures.’ — Varun Adibhatla in “Bringing Wall Street and the US Army to bear on public service delivery.”
As we move towards a Digital India, we need a fundamental shift from the Excel-for-everything mindset and how today’s public technology is set up. We need a standardized data infrastructure across public services that will help ministries and departments share data with each other quickly, and with the public. A new generation of public servants with Silicon Valley–grade technical chops need to be trained and hired. There’s already a Chief Economic Advisor to the government and a Chief Financial Officer for the RBI. It’s high time a Chief Data Officer is appointed for India!
Airflow, Meta Data Engineering, and a Data Platform for the World’s Largest Democracy was originally published in HackerNoon.com on Medium, where people are continuing the conversation by highlighting and responding to this story.

How to Create a Workflow in Apache Airflow to Track Disease Outbreaks in India
Vinayak Mehta — Wed, 27 Jun 2018 09:30:42 GMT
I originally wrote this post for the SocialCops engineering blog.
What is the first thing that comes to your mind upon hearing the word ‘Airflow’? Data engineering, right? For good reason, I suppose. You are likely to find Airflow mentioned in every other blog post that talks about data engineering.
Apache Airflow is a workflow management platform. To oversimplify, you can think of it as cron, but on steroids! It was started in October 2014 by Maxime Beauchemin at Airbnb. From the very first commit, Airflow was open source. Less than a year later, it was moved into the Airbnb Github. Since then, it has become a vital part of the data engineering ecosystem.
We have been using Airflow to move data across our internal systems for more than a year, over the course of which we have created a lot of ETL (Extract-Transform-Load) pipelines. In this post, we’ll talk about one of these pipelines in detail and show you the set-up steps.
Note: We will not be going through how to set up Airflow. You can check out a great blog from Clairvoyant for that.
Why use Airflow?
Dependency Management: A workflow can be defined as a Directed Acyclic Graph (DAG). Airflow will make sure that the defined tasks are executed one after the other, managing the dependencies between tasks.
Extensible: Airflow offers a variety of Operators, which are the building blocks of a workflow. One example is the PythonOperator, which you can use to write custom Python code that will run as a part of your workflow.
Scalable: Celery, which is a distributed task queue, can be used as an Executor to scale your workflow’s execution.
Open Source: It is under incubation at the Apache Software Foundation, which means it is being actively maintained.
IDSP: The disease data source
Even though open data portals are cropping up across multiple domains, working with the datasets they provide is difficult. In our bid to identify and help prevent disease outbreaks at SocialCops, we came across one such difficult data source.
The Ministry of Health and Family Affairs (MHRD) runs the Integrated Disease Surveillance Programme (IDSP) scheme, which identifies disease outbreaks at the sub-district & village level across India. Under this scheme, the MHRD releases weekly outbreak data as a PDF document.
PDFs are notorious for being hard to scrape and incorporate in data science workflows, but just wait till you see the IDSP PDFs. It may look like the data in them is in a nice table format, but they’ve changed the table formatting over the years and may continue to do so. We’ve also encountered glitches in the document like different tables being joined together, tables flowing out of the page and even tables within tables!
Setting up the ETL pipeline
No brownie points for figuring out the steps involved in our pipeline. We (E)xtract the PDFs from the IDSP website, (T)ransform the PDFs into CSVs and (L)oad this CSV data into a store.
Conventions
Let us set up some conventions now, because without order, anarchy would ensue! Each Directed Acyclic Graph should have a unique identifier. We can use an ID, which describes what our DAG is doing, plus a version number. Let us name our DAG idsp_v1.
Note: We borrowed this naming convention from the Airflow “Common Pitfalls” documentation. It comes in handy when you have to change the start date and schedule interval of a DAG, while preserving the scheduling history of the old version. Make sure you check out this link for other common pitfalls.
We will also define a base directory where data from all the DagRuns will be kept. What is a DagRun, you ask? It is just an instance of your DAG in time. We will also create a new directory for each DagRun.
Here’s a requirements.txt file which you can use to install the dependencies.
https://medium.com/media/e7ae54a0f6c215cce6ef324ad8e7071d/hrefHow to DAG
In Airflow, DAGs are defined as Python files. They have to be placed inside the dag_folder, which you can define in the Airflow configuration file. Based on the ETL steps we defined above, let’s create our DAG.
We will define three tasks using the Airflow PythonOperator. You need to pass your Python functions containing the task logic to each Operator using the python_callable keyword argument. Define these as dummy functions in a utils.py file for now. We’ll look at each one later.
We will also link them together using the set_downstream methods. This will define the order in which our tasks get executed. Observe how we haven’t defined the logic that will run inside the tasks, but our DAG is ready to run!
Have a look at the DAG file. We have set the schedule_interval to 0 0 * * 2. Yes, you guessed it correctly — it’s a cron string.This means that our DAG will run every Tuesday at 12 AM. Airflow scheduling can be a bit confusing, so we suggest you check out the Airflow docs to understand how it works.
We have also set provide_context to True since we want Airflow to pass the DagRun’s context (think metadata, like the dag_id, execution_date etc.) into our task functions as keyword arguments.
Note: We’ll use execution_date (which is a Python datetime object) from the context Airflow passes into our function to create a new directory, like we discussed above, to store the DagRun’s data.
https://medium.com/media/9924141031afad674910270942d5d981/hrefAt this point, you can go ahead and create a DAG run by executing airflow trigger_dag idsp_v1 on the command line. Make sure you go to the Airflow UI and unpause the DAG before creating a DagRun. The DagRun should be a success since our tasks are just dummy functions.
Now that we have our DAG file ready, let’s look at the logic that will run inside our tasks.
Note: Everything you print to standard output inside the function passed to the PythonOperator will be viewable on the Airflow UI. Just click on View Log in the respective operator’s DAG node.
Scraping the IDSP website
A new PDF document is released almost every week (with some lag) on the IDSP website. We can’t keep scraping all the PDFs every time a new one is released. Instead, we will have to save the week number of the PDF we last scraped somewhere.
We can store this state in a CSV file in our base directory at the end of each DagRun and refer to it at the start of another. Take a look at the scraping code. nThere’s nothing fancy here, just your run-of-the-mill web scraping, using requests and lxml.
https://medium.com/media/a35d302c549b857c5ec7aa86b922f915/hrefNote: In production, we don’t run the scraping code inside Airflow. It is run on a separate service, which can connect to REST/SOAP APIs to extract data, in addition to running these scrapers. This gives us a central place to schedule and track how data is pulled into our platform. The task logic is replaced with a call to the data export service.
Scraping the PDFs
Yay! Now that we have new PDFs, we can go about scraping them. We will use pdfminer to accomplish this.
But first, let me just point out that PDF is the worst format for tabular data. A PDF contains instructions for PDF viewers to place text in the desired font at specific X,Y coordinates on the 2D plane. It doesn’t matter if we just need to get text from of a PDF, but if we need to get tabular data with the table structure intact, it gets difficult. We use a simple heuristic here to get this data out.
First, we extract all text objects using pdfminer. A text object contains a string and a float coordinate tuple, which describes the string’s position. We then sort all these objects in increasing X and increasing Y order. We now have all these text objects in row-major format, starting from the last row. We can just group them based on their x-axis projections into different columns, discarding any objects that span multiple columns. We can drop the first and last columns since we don’t have any use for them in this post.
https://medium.com/media/0db1d266e36da7c89089a166b4868999/href
Voila, we have our table! It is still not usable though and needs some minor cleaning. We are leaving this as an exercise to you. It is easy to define some rules in code to convert the above CSV to something cleaner, like the following.
You can add this cleaning code as another PythonOperator or within the same scrape_pdfoperator. If you are not comfortable with Python and want to use R instead, you can use the BashOperator to call your R script. Extensibility FTW!
Note: It is very difficult to get 100% table scraping accuracy on all types of PDF with a single tool. We can just throw various heuristics at the problem and hope for the best result. A cleaning step is usually required.
When we were in the process of preparing the IDSP dataset, using all the previous years’ PDFs, we couldn’t find any tool/library that could solve this problem. We tried many open source tools like Tabula, as well as closed source tools like PDFTables without any success.
This led us to developing our own library, which uses image recognition with a bunch of heuristics to try and solve the PDF table scraping problem. It gave us an acceptable scraping accuracy on a lot of PDF types, including the IDSP ones. Once we plugged this into our data cleaning product, Transform, we could finally convert PDF data into a fully clean CSV.
Update (5th October 2018): We released Camelot, a Python library that helps anyone extract tabular data from PDFs. You can find a version of the code provided in this blog post that uses Camelot in this Jupyter notebook.
Curating the scraped data
Now that we have a clean CSV, we can add it to our master IDSP dataset. The operator contains just a for loop, which appends page-wise CSVs to our master CSV dataset. We could’ve used pandas here, but we didn’t want to add another requirement just for this append.
https://medium.com/media/cc0df6edca8ed3ac43822aec1c74715b/hrefInternally, our ETL pipeline doesn’t stop here though. We pass the text in the ‘Comments’ column that we dropped earlier through our entity recognition system, which gives us a list of geographies where the outbreaks happened. This is then used to send alerts to our team and clients.
Where can you go from here?
Congrats! You have a regularly updating disease outbreaks data set! Now it’s up to you to figure out how you’re gonna use it. cough predictive analytics cough. You can replace the scraping code to scrape data from any other website, write it to the run directory, plug in the PDF scraping operator (if the data you scraped is in PDF format), or plug in a bunch of your own operators to do anything.
You can find the complete code repo for this exercise here.
If you do extend this DAG, do tweet at us. We’d love to hear what you did!
Seize the data!
How to Create a Workflow in Apache Airflow to Track Disease Outbreaks in India was originally published in HackerNoon.com on Medium, where people are continuing the conversation by highlighting and responding to this story.