Text Parsing in Python with US-Patent Data

Published in

Incedge & Co.

10 min readNov 6, 2019

Credit- Renan Kamikoga | Follow him on— Unsplash

In today’s world, there is simply much more unstructured data than structured data. Unstructured data(for instance — metadata, images, videos etc.) makeup 80% and more of enterprise data and is growing at a rate of 55% and 65% per year (according to Datamotion Source). So, in this article, I will be going to work on semi-structured data called — U.S. Patents Data which is in XML format, you will learn how to do text parsing in python and how to deal with semi-structured. At last, we will end up with converting the semi-structured data into structured data which is easier to read and understood by machines and humans.

Introduction

As we know, data wrangling/data munging/ data cleaning is that steps in the data science process which takes 80% of the time of the data scientist, leaving only 20% for exploration and modelling(source — What data scientist really do via HBR(Harvard Business Review) & IBM — Data Scientists Productivity).

For today’s article, I am using USPTO (the United States Patent and Trademark Office) SOURCE LINK WHICH is the federal agency for granting U.S. patents and registering trademarks. They also provide open data about U.S. patents. So, today we will work with one open dataset they provided on their website. (Source- download the dataset from here-Patent grant full-text data/XML)

Dataset Description — Patent grant full-text data (no images) (JAN 1976 — present) (1976–2001 Automated Patent System (APS) format is Green Book) Contains the full text of each patent grant issued weekly (Tuesdays) from January 1, 1976, to present (excludes images/drawings and reexaminations). The dataset contains (JAN 2002 — present) the full text of each patent grant issued weekly (Tuesdays) (excludes images/drawings and reexaminations). The file format is eXtensible Markup Language (XML) in accordance with the Patent Grant International Common Element (ICE) Document Type Definition (DTD).

Note- Dataset is quite large if you tried to open the dataset into text editor software (for instance — notepadd++ etc. maybe your computer get stuck because of less RAM. At least 8GB or More RAM is required to read all the data) My suggestion is to use the online free or paid platform for instance — Kaggle, GCP, AWS etc.

Just for all the user, I have uploaded the file on Kaggle. (Source — Download the dataset from here- U.S.-Patents-Data)

For ease of use and easy understanding. Firstly, I copied all the XML data into a Text editor ( Source-I used notepad++)

Our dataset initially look like this

Now, for anyone to understand the above code is quite an arduous task. Especially, if you are novice in this field. So, now the question is how to deal with file like this one?

Don’t worry you don’t need to know everything about XML. Some few basic is quite enough. For better understanding, let’s explain you with very simple XML file.

Suppose this is your XML file(see below code). If you see below you notice there is one tag in starting saying something like<?xml version =”1.0" …>, then another one tag called <contact-info> inside which three more tags (name, company, phone) are present. So what are these? what does it mean? etc…

In the example encoding=” UTF-8“, specifies that 8-bits are used to represent the characters. To represent 16-bit characters, UTF-16 encoding can be used.

<?xml version = "1.0" encoding = "UTF-8" standalone = "no" ?>
<contact-info>
   <name>Mohit Sharma</name>
   <company>Medium.com/@imoisharma</company>
   <phone>0426-XXX-XXX</phone>
</contact-info>

Before explaining the above code, you need to know some syntax rules of XML.

XML Declaration

The XML document can optionally have an XML declaration. It is written as follows −

<?xml version = "1.0" encoding = "UTF-8"?>

Where version is the XML version and encoding specifies the character encoding used in the document (as I said earlier).

Syntax Rules for XML Declaration

The XML declaration is case sensitive and must begin with “<?xml>” where “xml” is written in lower-case.
If the document contains XML declaration, then it strictly needs to be the first statement of the XML document.
The XML declaration strictly needs be the first statement in the XML document.
An HTTP protocol can override the value of encoding that you put in the XML declaration. ( Source — To learn more refer to TutorialsPoint)

From the above code, what I can understand there is only one main tag/column(in terms of a table) called contact-info inside which three more tags/columns called name, company and phone are present or you can say three columns with contact-info/name, contact-info/company, contact-info/phone. It depends upon your perception of how you see the problem. Below I have converted the XML file into CSV file and open it MS-Excel.

So now you have a basic understanding of XML document and how they look like. So let’s get started.

Just for making little interesting. I change it to a case study which you have to deal, being a data scientist.

Here is the background information on your task

XYZ XXX Pty Ltd, a small legal company, has approached your company ABC XXX Pty Ltd. XYZ XXX Pty Ltd is keen to learn more about U.S. Patents from January 1976 — till present. Primarily, XYZ XXX Pty Ltd needs help with its customer and legal trademarks. The company has a large dataset (which they provided to your company and being a data scientist your company assigned the task to you)related to patents, but their team is unsure how to effectively analyse it to help optimise its marketing strategy.

You decide to start the preliminary data extraction and identify ways to do text parsing and improve the quality of XYZ XXX Pty Ltd’s data.

After exploration, you identified some tags which is very important in making their business strategy.

grant_id: this is unique Id provided for each patent grant which is consisting of alphanumeric characters

2. patent_kind: what type of category to which the patent grant belongs

3. patent_title: Inventor gave the patent title to the patent claim

4. number_of_claims: the number of claims for a given grant(an integer denoting).

5. citations_examiner_count: the number of citations made by the examiner for a given patent grant (an integer denoting)

6. citations_applicant_count: the number of citations made by the applicant for a given patent grant (an integer denoting)

7. inventors: tag contains a list of the patent inventors’ names

8. claims_text: a list of claim texts for the different patent claims

9. abstract: the patent abstract text.

Software and tools I used-
Environment: Python 3.7.1 and Anaconda 5.7.4 (64-bit)
Libraries used:
pandas 0.23.4 (for data frame, included in Anaconda Python 3.6)
re 2.2.1 (for regular expression, included in Anaconda Python 3.6)

Table of Content

Import the libraries
Reading a File
Compiling a Regular Expression pattern
Data Cleaning & Creating lists
Creating a DataFrame using pd.DataFrame()
Checking Missing Value
Saving DataFrame into CSV
Summary

1. Import the libraries

Import the libraries

2. Reading a File

As I said earlier I copied all the data into text file and named as “U.S. Patents” you can also download the same file from Kaggle. So, we start with open a file in reading mode, then read it’s content and store in file_content_raw variable. Next step, is to find a number of patents and to find out several patents. I have created a regular expression pattern, which returns a pattern object called text1. If you feel you are not getting any understanding of regular expression, then one of a good way is to see its visualization. (Source — Regular expression visualizer using railroad diagrams.)

At last, we spilt the pattern object and using while loop we iterate it and finding out the number of patents in the file.

3. Compiling a Regular Expression pattern

A regular expression is one of the best things I feel in order to find out the particular pattern or object from the text. Below, I have created a regular expression for each column in which we are interested in.

Writing Regular Expressions for desired columns

I am just explaining one that is grant_id. Once you explore the file you see that in line 3 we can get the grant_id.

We see the US patents grant_id is starting with capital U and S followed by some alphanumeric, then followed by some digit numbers. Below(in the figure), d{6} means d for digit and exactly 6 occurrences. (Note/Tip- making a right regular expression which extracts the right text from each file is not a straightforward task and sometimes you have to do a lot of validation to retrieve each text pattern)

Pythex.org is real-time regular expression editor for Python, a quick way to test your regular expressions.

Similarly, I make regular expressions for each matching patterns and store into their corresponding pattern object.

4. Data Cleaning & Creating lists

Now, the crucial step came.

1.) So, firstly I created 9 empty lists keep in mind that I have to build pandas data frame later on and with the list, it’s quite easy to make data frame.

2.) After that, I iterating each line from file_content and using re.findall(Return a list of all non-overlapping matches of pattern in the string). So, grant_id below is the pattern object which we created above and using re.findall(). I can find out all the non-overlapping matches of pattern in the string( line in our case). Similarly, I have done it for the rest of the columns.

gid_list, title_list, kind_list, no_of_claim_list, name_list, applicant_list, examiners_list, claim_list, abstract_list, = ([] for i in range(9))for line in file_content:
    
    
    gid=grant_id.findall(line) #to find grant_id
    title=patent_title.findall(line) #to find patent_title
    kinds=kind.findall(line) #to find kind
    sclaim=number_of_claim.findall(line) #to find no_of_claims
    
    #to find inventors
    inventors=re.findall("<inventor.*?>[\s\S]*</inventor>",line)
    for person in inventors:
        first=first_name.findall(person)
        last=last_name.findall(person)
    name = [firstName +" "+ lastName for firstName, lastName in zip(first,last)]
    if len(name)==0:
        names="NA"
    else:
        names=name
    
    # here we count citation_by_applicant
    if len(citation_by_applicant.findall(line))==0:
        citation_by_applicants=0
    else:
        citation_by_applicants=len(citation_by_applicant.findall(line)) 
    
    # count for citation_by_examiner
    if len(citation_by_examiner.findall(line))==0:
        citation_by_examiners=0
    else:    
        citation_by_examiners=len(citation_by_examiner.findall(line))   
    
    # For claim_text
    if (len(re.findall("<claim-text>[\s\S<]*</claim-text>",line))==0):
        claim_text=["NA"]
    else:
        claim_text=re.findall("<claim-text>[\s\S<]*</claim-text>",line) 
    
    # For abstract
    abst=abstract.findall(line)
    if len(abst)==0:
        abstracts=["NA"]
    else:    
        abstracts=abst  
    
    # checking length of gid is not equal to 0 then do append to all the lists
    if len(gid)!=0:                             
        gid_list.append(gid[0])
        title_list.append(title[0])
        kind_list.append(kinds[0])
        no_of_claim_list.append(sclaim[0])
        name_list.append(names)
        applicant_list.append(citation_by_applicants)
        examiners_list.append(citation_by_examiners)
        claim_list.append(claim_text[0])
        abstract_list.append(abstracts[0])
        
# cleaning claim text         
element=0
for items in claim_list:
    claim_list[element]=re.sub(cleaner,'',claim_list[element])
    claim_list[element]=re.sub(cleaner2,',',claim_list[element])
    claim_list[element]=re.sub(cleaner3,',',claim_list[element])
    claim_list[element]=re.sub(cleaner4,'.,',claim_list[element])
    claim_list[element]=re.sub(cleaner5,',',claim_list[element])
    claim_list[element]=re.sub(cleaner6,'; ',claim_list[element])
    element=element+1# For kind 
Kind1 = [w.replace('P2', 'Plant Patent Grant(with a published application) issued on or after January 2, 2001') for w in kind_list]
Kind2 = [w.replace('B2', 'Utility Patent Grant (with a published application) issued on or after January 2, 2001.') for w in Kind1]
Kind3 = [w.replace('S1', 'Design Patent') for w in Kind2]
Kind4 = [w.replace('B1', 'Utility Patent Grant (no published application) issued on or after January 2, 2001.') for w in Kind3]

5. Creating a DataFrame using pd.DataFrame()

Iteration part is done and we get the lists as results. Now, the time to create the data frame using pd.DataFrame()

data_frame = pd.DataFrame(
    {'grant_id': gid_list,
     'patent_title': title_list,
     'kind': Kind4,
     'number_of_claims':no_of_claim_list,
     'inventors':name_list,
     'citations_applicant_count':applicant_list,
     'citations_examiner_count':examiners_list,
     'claims_text':claim_list,
     'abstract':abstract_list
    })

6. Checking Missing Value

In [8]:

data_frame.isnull().sum()

Out[8]:

grant_id                     0
patent_title                 0
kind                         0
number_of_claims             0
inventors                    0
citations_applicant_count    0
citations_examiner_count     0
claims_text                  0
abstract                     0
dtype: int64

7. Dataframe

As you can see this structured data frame is easy to read and understood by both machines and humans.

Just to be on the safer side, what I done I extracted the details of inventors who claimed about “Thin Food Cluster” in Row 0. Then, using the same claim_text I searched it on google and see what findings I am getting.

Google recommends me this website (FPO- Free patents online). When open the website it shows the same document with same inventors name and patent title and same U.S. grant_id etc. along with more information. Hence, it validates that the results of extraction are correct. (Note- I just showed you only one way, however, I did several other testings on this dataset)

Check it out for info — http://www.freepatentsonline.com/D864516.pdf — *Source-Check it out for info of patents info*

Summary

I hope you learn something new, especially how to deal with semi-structured data like — XML, JSON etc. Do note, we just used only two main libraries one is pandas for creating data frame and another one is a re module for creating a regular expression.

We probably can achieve more tags by using a regular expression, but I leave it to you to do that.

Source code can be found at Github. I look forward to hearing any feedback or questions. Also, any new topic or concept you want me to discuss. Comment down below. Thank you.

You can also connect with — Linkedin | Kaggle | Facebook | Instagram | Twitter