Programmatic Patent Searches Using Google’s BigQuery & Public Patent Data

Characterizing Google’s Public Patent Data

Most data science projects begin with an analysis of the problem or issue to be addressed and follow that with the preparatory data collecting, formatting and cleaning, all before any insightful analysis begins. But with Google’s BigQuery and the public patent datasets we can skip all that initial data preparation. However, that still doesn’t mean we can jump directly into insightful analysis. An understanding of the data that’s available is required.

Query #1#standardSQL-- PublishedPatentApps_PerYear_PerCountrySELECTMIN(publication_date) AS Earliest_Patent_Publication_Date,MAX(publication_date) AS MostRecent_Patent_Publication_DateFROM`patents-public-data.patents.publications` AS patentsdbWHEREpublication_date > 0;
Figure 1
Query #2#standardSQL-- Applications_Per_CountrySELECT country_code AS Country_Code, COUNT(*) AS Number_of_Patent_AppsFROM (SELECT ANY_VALUE(country_code) AS Country_CodeFROM `patents-public-data.patents.publications` AS patentsdbGROUP BY application_number)GROUP BY Country_CodeORDER BY Number_of_Patent_Apps DESC;
Figure 2
Query #3#standardSQL-- NotNULLAbstracts_PerCountrySELECT COUNT(*) AS Number_of_Patents, country_code AS Country_CodeFROM (SELECT ANY_VALUE(country_code) AS Country_CodeFROM `patents-public-data.patents.publications` AS patentsdb,UNNEST(abstract_localized) AS abstract_infoWHEREabstract_info.text IS NOT NULLANDabstract_info.language = 'en'ANDCHARACTER_LENGTH(abstract_info.text) > 10GROUP BY application_number)GROUP BY Country_CodeORDER BY Number_of_Patents DESC;
Figure 3

Example Keyword Phrase Search Query

Now armed with a better understanding of the patents-public-data.patents.publications dataset, the next objective is to work with some keyword phrase queries to derive some intelligence from the dataset. From a keyword phrase perspective, the abstract is the only text field that spans the international patent applications in the dataset, so that will be the focus in order to provide an international perspective to the results. As noted above, there are ~46 million English abstracts spanning the patent applications from the various countries as listed in the right-hand table of Figure 2. The query chosen to exemplify a keyword phrase search is one that simply produces time-series data representing the number of patent applications that use a specified keyword phrase.

Query #4#standardSQL-- This counts the number of U.S. patents matching the phrase on a monthly basis.WITHPatent_Matches AS(SELECTPARSE_DATE('%Y%m%d', SAFE_CAST(ANY_VALUE(patentsdb.filing_date) AS STRING)) AS Patent_Filing_Date,patentsdb.application_number AS Patent_Application_Number,ANY_VALUE(abstract_info.text) AS Patent_Title,ANY_VALUE(abstract_info.language) AS Patent_Title_LanguageFROM`patents-public-data.patents.publications` AS patentsdb,UNNEST(abstract_localized) AS abstract_infoWHERELOWER(abstract_info.text) LIKE '%internet of things%'AND patentsdb.country_code = 'US'GROUP BYPatent_Application_Number),Date_Series_Table AS(SELECT day, 0 AS Number_of_PatentsFROM UNNEST (GENERATE_DATE_ARRAY((SELECT MIN(Patent_Filing_Date) FROM Patent_Matches),(SELECT MAX(Patent_Filing_Date) FROM Patent_Matches))) AS day)SELECT SAFE_CAST(FORMAT_DATE('%Y-%m',Date_Series_Table.day) AS STRING) AS Patent_Date_YearMonth, COUNT(Patent_Matches.Patent_Application_Number) AS Number_of_Patent_ApplicationsFROM Patent_MatchesRIGHT JOIN Date_Series_TableON Patent_Matches.Patent_Filing_Date = Date_Series_Table.dayGROUP BY Patent_Date_YearMonthORDER BY Patent_Date_YearMonth;
Figure 4
Figure 5
Figure 6
Figure 7

The Take-Aways

  • Overall, the combination of Google’s BigQuery data warehouse and public patent data can be a very powerful tool to quickly implement efficient patent searches via SQL.
  • Google offers a wide variety of public patent datasets that can be utilized for different types of search projects. Each dataset will likely require some level of characterization work to understand the dataset prior to any insightful analysis.
  • The combination of BigQuery and the patents-public-data.patents.publications dataset, creates a platform that excels at the capability to quickly and inexpensively query information from a large number of patents and applications. The BigQuery patent.publications dataset contains bibliographic information from over 73 million patents and applications worldwide and full-text information from over 12 million U.S. patents and applications.
  • The timeliness of Google’s public patent datasets can sometimes be an issue, as the datasets are not currently updated very frequently. That said, as a newer offering, it’s reasonable to expect the timeliness of the data to improve.
  • Keyword phrase queries can be utilized very effectively to find patents and patent applications of interest. The keyword phrase, time-series data query exemplified in this report can be modified to search for different keyword phrases and different countries and can be used as a basis for more complex patent searches.
  • The specific keyword phrase analysis in this report focused on the phrase “internet of things” and shows the initial use of the phrase in patent literature in December of 2007, the increase in usage midyear 2010, and the peak usage midyear 2016. In addition, the analysis shows the heavy usage of the phrase in Chinese patent applications and conversely the lack of usage in Japanese patent applications. More research is needed to determine if these potential abnormalities are due to language translation issues.

Further Reading:

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Jim Moeller

Jim Moeller

Intellectual Property and Competitive Intelligence Research Consultant / Registered U.S. Patent Agent