It is a commonly known fact that technology, especially the Internet, has progressively made the life of stalkers much easier. However, you know things are really getting out of control when the Government of India decides to make it easier for them.
Recently, I was given the idea to build an Android app that, taking your Voter ID as input, will lead you to your polling station. I never did end up building this application, but while investigating if it was possible to build it, I found some rather interesting things.
What I have found so far is only for Delhi, but the methods used can be applied to every single state and union territory within India.
A Short Summary of the Findings
- It is not only possible, but extremely easy to retrieve the PDF electoral rolls for every state and union territory in India, which contain the personal information of every registered voter.
- These PDFs can then be processed in a matter of minutes to produce details like Addresses, names, father’s name, gender, age and voters ID number for every single registered voter of India
- Nearly 25% of the Voter IDs assigned within Delhi alone fail to conform to the government format, and fail the Luhn Checksum test (explained in part 2) used to validate them. It is likely that other states are in a similar, if not worse condition.
Part 1 — Availability of personal information
The ease with which one is able to retrieve the electoral rolls for all of Delhi, and the fact that this can be adapted in a few minutes to any other part of the country, is worrisome.
The application would have required some form of a lookup service which can take the voter ID and return the constituency and polling booth required, so that I can then lead the user to it.
Such a service already exists. While this made my life very simple, it also did the same for anyone trying to locate a person, especially in bulk. Since the form isn’t protected by anything that can prevent automation, such as a CAPTCHA, a script can easily be generated to look up the names of people any number of people, and discover their approximate address, as well as age, gender and father’s name.
While this service would have served my immediate purpose, I was trying to build something that can be scaled for all of India. Keeping that in mind, I decided to build my own database of Voter ID numbers, and their corresponding polling booths and constituencies.
Every state and union territory in India offers PDF Electoral Rolls for each polling booth. These PDF rolls contain the name, father’s name, age, Voter ID, and enough information to figure out a more or less exact address for any person, down to the house number. Some rolls also contain photos of the people.
The first step to using these to create an index would be to obtain all of the PDFs in question. These are stored as individual files (though Arunachal Pradesh offers ZIP files of areas), and downloading them manually is unfeasible to say the least.
Since manually downloading everything is out of the question, I turned to scripting. The files are stored in a very structured manner, which makes automating their retrieval a trivial task. The URL for any file fits the following format:
http://ceodelhi.gov.in/WriteReadData/AssemblyConstituency/AC<AC NUMBER/A<THREE DIGIT AC NUMBER><FOUR DIGIT BOOTH NUMBER>.pdf
An example url hence looks like: http://ceodelhi.gov.in/WriteReadData/AssemblyConstituency/AC22/A0220161.pdf
Delhi has 70 assembly constituencies, each with a different number of polling booths. I built the following python script to retrieve all the electoral rolls:
The script is available on GitHub here. Running it as as simple as changing the path in directory to point to somewhere on your system, and then executing:
This is a single threaded script, so it can be optimised to download more than one file at once. However, this served the immediate requirement, and I had downloaded every single PDF for Delhi in a few hours. All 11,832 of them. Before running the script, keep in mind that in all this is about 5.47 GB of data, and we haven’t even processed it yet.
PDFs aren’t very edit friendly formats, and building an index directly from the files will be quite a task. Instead, I converted them to text files, which are far simpler to handle, with the help of Xpdf and another python script:
Once again, the code is available as a gist and you can run the script by putting in the same directory you used earlier and executing (you must have Xpdf installed):
This gave me a text version of every PDF file, sorted by constituency and polling booth.
As the next step in the indexing process is to extract all the voter IDs from each file, and store them in a database along with the meta data like which polling booth they belong to.
I accomplished this with yet another python script, which iterated through the files and then returned another set of files with all the voter IDs:
The code can as always be found in a gist and run by you.
The final step in the indexing process is to add all this information into a database. I opted to use some more python, and SQLite as the database:
This code too is available in a gist.
Having done all this, I was left with a database of every Voter ID in those files, along with which constituency and polling booth it belonged to. In all, I had 13,066,244 such IDs with corresponding data.
Once I had all this data, I decided to check just how many of the voter IDs match the format and guidelines laid down by the ECI in 2000. Which leads us to…
Part 2 — 22% of the voter IDs do NOT match the government format
There are currently 2 formats for the voting ID in use. The first one was laid down in 1993, and modified in 1994 as per this document, which specifies the details of both formats. The second one was introduced in May 2000.
- 13 alphanumeric character sequence of the order XX/00/000/000000
- Different parts are separated by an oblique (/)
- The first part consists of two letters denoting the state
- The second part consists of two digits denoting the parliamentary constituency
- The third part consists of three digits denoting the assembly constituency
- Finally, the fourth and last part consists of 6 digits which form a running serial
- 10 alphanumeric character sequence of the order XXX000000C
- The first three characters are letters, and represent the area in which the ID was first assigned
- The following six characters are a running serial
- The 10th character is a checksum for the first 6, calculated using the Luhn Checksum Algorithm, which is the same algorithm used to check the last digit of credit cards.
A checksum algorithm is an algorithm that generally adds a character, most often a digit, to the end of a number. This digit is calculated by using all the other digits in that number, and can detect errors in entering the number. If while entering the number you hit the wrong digit somewhere, the checksum will fail, and you will instantly realise that this is not the correct number.
Armed with the formats and checksum algorithm, I decided to check if all Format 2 IDs passed the Luhn check.
In the database, there are 12,052,087 IDs of format 2, out of a total of 13,066,244 IDs.
I used the following python script to determine how many of those 12,052,087 IDs fail the Luhn Checksum:
There are 2,884,941 IDs out of 12,052,087 that failed the checksum test. This means that either the software being used by the government in these elections is buggy, or that there has been human error in entering these numbers somewhere, which was not caught.
In either case, the implications of this are massive. If voter IDs that do not conform to the format set by the government can slip unnoticed through the system, then it opens the door to a whole host of other problems, such as fake voter IDs that may be injected into the system and not noticed.
I’m making a GitHub repository with the necessary scripts to create the database containing the IDs and failed IDs, which can be found here. I will update it with a few other states’ scripts, if time permits.
This short exercise, which started with the creation of an app, has led to the discovery of some huge gaping security and privacy holes within the government. Most of the techniques used by me can be easily blocked, and are blocked by every major website that is privately run. However, the government’s systems make no effort to prevent me from downloading the information of every registered voter in the country, aside from having absurdly slow servers for some states.
Here’s a few fixes the government could implement:
- Rate limit requests to their servers — While downloading all the files, I was making several requests a second to their servers, all directly hitting PDF files. This can easily be rate limited so that I cannot download with such speed.
- Add a CAPTCHA to their lookup form — Even without PDFs, one can find a large amount of information by using the look up form. As there is no CAPTCHA, any one can make automated requests to the page and retrieve as much information as they like
- Actually follow their own guidelines — The fact that nearly 25% of the new voter IDs fail their own test is a very worrisome issue. If so many invalid IDs can slip through the system unnoticed, it is entirely possible for fake IDs to be added in and never caught, or people to vote incorrectly through some other system error.
This article is written in the hope that the people can see the state of their election system, and in the hope that it will attract enough attention to have the government implement some fixes and security measures.