Protecting Personal Identifiable Information with LexNLP

Suryakanta Mohapatra
cisco-fpie
Published in
8 min readMar 26, 2021

It is no wonder companies are taking stringent measures to make sure they are fully compliant towards EU’s General Data Protection Regulations (GDPR) which protects privacy of Personally Identifiable Information(PII) of EU residents, a lot more stricter regulations are coming sooner than later such as California Privacy Right Act in 2023, SAFE DATA Act by the end of 2021. Hence protecting PII is becoming a matter of paramount importance to businesses.

On the other hand, it is quite burdensome for humans to verify each and every public medium at their disposal to check if it contains PII. Hence, in this article we will delve deeper to understand how to retrieve textual information from an image by Optical Character Recognition(OCR) and how to use a Natural Language Processing Library called LexNLP to process the extracted text to check if it contains any PII. This is the output of a recent hackathon I participated.

So, to give a high-level overview, first we will walk through OCR, its application and at the end we will see how LexNLP helps to detect presence of PII. In order to tie everything together we’ll write a simple python script that helps extract text and checks for PII. All these are exposed via a Flask web application for the convenience of user interactivity.

Optical Character Recognition

Optical Character Recognition or OCR is a technology that helps detect the presence of textual information in an image and extract the machine encoded text from it which the computer understands. Detecting text information through automated processes is not as trivial as it appears to humans. Behind the scenes is a series of complicated processes involving image processing and implementation of other complex algorithms which finally extracts the text. To the computer the processed image contains only matrix of white and black dots. Extraction involves multiple phases such as Despeckle, Binarisation, Line removal, Layout analysis, script recognition, segmentation, and normalization, Matrix matching and post processing. We will keep these jargons out of scope of this article for simplicity.

Here in this article, we will stick to PyTesseract library to retrieve text from an image. This is a wrapper for Google’s Tesseract-OCR engine. There are other libraries also that can be used such as Pyocr.

Natural Language Processing

Although this is a vast field in itself which primarily involves with interaction between computers and human language in order to process and analyze large amount of natural language data, here we will use this technology via an open-source library called LexNLP to extract PII from a textual content.

Technology stack we will use

For this project, we are going to use the following

  • PyTesseract for Optical Character Recognition
  • Flask web framework for OCR server
  • Pillow library for image manipulation
  • LexNLP for extracting PII

Getting our hands dirty

Enough of this explanations, now let’s build the real thing which in layman’s term will provide us with a web page that allows a user to upload an image, snapshot or scanned photo to the server and get back information from the image along with any PII that may be present.

Initially we are going to install all the prerequisite for this project. Then we will develop the flask framework needed for ease of user interaction.

We will use pip to install software packages. Install pip by following the below steps:

Manually download the get-pip.py from here or use curl command to do so.

curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py

Then, navigate to the folder where get-pip.py is downloaded and run the following command to install pip

python get-pip.py

Now we are ready to install pipenv using the following command

pip install pipenv

pipenv creates and manages virtual environment for our project. Now, since we have pipenv, let’s create a directory and kick-start by following command. We use python3 system link here.

mkdir lexnlp-extraction && cd lexnlp-extraction && pipenv install –three

Activate virtual environment by the command

pipenv shell

Now we can install all packages/ dependencies by pip install command. As we know we have dependencies on pytesseract for OCR functions and Pillow for image manipulation, lets install both of them now.

pipenv install pytesseract Pillow

The most important dependency to install is lexNLP which provides the core functionality to fetch the PII from the supplied text. Follow the below steps to install lexNLP:

Clone the lexNLP git repo to your local folder and install by pipenv install command

git clone https://github.com/LexPredict/lexpredict-lexnlp.gitcd lexpredict-lexnlppipenv install

The above steps install lexNLP library which provides arrays of features, but we will focus on lexNLP-PII feature.

And the last but not the least is to install flask framework. Follow the below command

pipenv install flask

As we have installed all the prerequisites, now it’s high time we create the necessary scripts. Should you need to learn flask before moving ahead feel free to visit here.

Before writing the scripts, let us see how the framework layout looks like:

Framework Layout
  • app.py : Kick starts flask server and contains necessary routes.
  • ocr_extraction.py : Handles the extraction of text from the image
  • lexnlp_extraction.py : This file handles the extraction of PII
  • template/ : Folder contains all the html files
  • static/uploads/ : Folder contains all uploaded images
  • index.html : Starting page with which the application starts
  • upload.html : Lets user upload the image file and shows result

Let’s define the content of ocr_extraction.py

ocr_extraction.py

The above script takes charge of opening the image by using Image class of Pillow library and then extracts the text by using image_to_string() function of pytesseract.

lexnlp_extraction.py is another file which defines a method to extracts the list of PII from the supplied text.

lexnlp_extraction.py

app.py is the file which literally starts the flask application. Here is the code.

app.py

The upload_page() functions is called when image is uploaded from the HTML page. And the uploaded image is stored in the static/uploads/ folder. Similarly, the HTML files are stored in templates/ folder. So, we have to manually create both these folders and keep the HTML (shown below) files in the template/ folder.

The index.html file is pointed by default(basic route) as the home page. index.html has hyperlink for upload.html. Below is the code.

index.html

And at last, the upload.html is responsible for submitting the image via POST method and renders the result/ response from app.py. Below is the code:

upload.html

Now we are ready to run the app. All we have to do is go to the virtual environment in the same directory by running the command:

pipenv shell 
flask run

The server should start with the following message:

* Environment: productionWARNING: This is a development server. Do not use it in a production deployment.Use a production WSGI server instead.* Debug mode: off* Running on http://127.0.0.1:5000/ (Press CTRL+C to quit)

Go to the browser and load the above displayed URL(127.0.0.1:5000) and we should see the following page

127.0.0.1:5000

When clicked upload link, it should show the following page:

127.0.0.1:5000/upload

In the upload page, we have to upload an image which we intend to fetch the text and PII from by clicking choose File button and then selecting the file in local folder and then we need to click upload button, which shows the extracted text and PII.

Below are the results when tried with different images such as color handwritten scanned image, scanned Black and White handwritten image and digitally written black and white screenshot.

Result for Screenshot Image of digitally written document

The above result is for screenshot image of digitally written document, which fetches the result(both text and PII such as SSN and Phone) accurately.

Result for Black and White scanned Image of handwritten document

The above result is for black and white scanned image of handwritten document, which poorly fetched the SSN where 1 is misinterpreted as ‘(‘ and hence no PII was detected.

Result for Color scanned Image of handwritten document

The above result is for color scanned image of handwritten document, which is quite similar to the previous one, the difference being “Live en” vs “Live tn”. Moreover, it is worth the effort to contribute to LexNLP to include information like medical record, tax record ..etc in their PII scope.

The chances of getting better accuracy in fetching text is highly dependent on how better contrast does the image have. This area definitely needs further analysis and testing. The primary purpose of this article is to bring and demonstrate my learning from a hackathon which may help the beginners trying to get into this field.

Conclusion:

Although we have achieved a lot in this project, still we could have included PDF file in the text extraction process for which we could use PyPDF2 library and rest of the process works same. Pytesseract and LexNLP are great opens source libraries for OCR and PII detection which would be greatly useful in multiple use cases to make sure PII privacy is well complied.

The source code of the project can be accessed at GitHub

Reference and Credits

[1]Robley Gori — PyTesseract — Simple Python Optical Character Recognition https://stackabuse.com/pytesseract-simple-python-optical-character-recognition

[2]Extracting Personally-Identifiable Information(PII) — https://lexpredict-lexnlp.readthedocs.io/en/docs-0.1.6/modules/extract_en_pii.html

[3]GitHub Tesseract — https://github.com/tesseract-ocr/tesseract

[4]https://pypi.org/project/pytesseract/

[5]Word cloud from https://www.wordclouds.com/

[6]Icons in the cover image from https://icons8.com/. See below.

--

--