Published in



The Bioeconomy of Emerging Cancer Therapy

Patent Data Analysis and Concept Graph Mining As Means to Learn About Commercial R&D

“Human Biology technology on abstract digital background” by ipopba

Patents are essential to gain a foothold in competitive markets of biotechnology and to bring a product to customers. A significant amount of research and development normally precede the creation of the biological product, and for the financial accounting to add up positive, patents have become indispensable to applied biotechnology.

By their very legal definition, patents are public. We can therefore use patent data to learn about the R&D, which is pursued with the intent of commercial application now or in the near future.

The interests that prompt patent filings are therefore different than the interests that motivate the publication of research papers in academic journals. As a whole patent data can reveal present beliefs among economic actors about the horizon of the bioeconomy. So I ask:

What biotech R&D are people with cash investing in presently under the belief that the investments have positive expected cash return in the next decade or two?

We can discover part of the answer by pursuing a data-driven analysis of recent US patent data.

The bioeconomy spans a great many technical domains. In order to not overwhelm the reader with sections and subsections of analysis, I limit my analysis to cancer therapy. Nowadays biology features prominently in cancer therapy because it turns out synthetic proteins, nucleic acids, even entire cells, can be engineered to enact targeted functions within a patient. It was only in the late 1990s when these types of therapies made an impact in the clinic. The global market size for cancer therapy is in the ballpark of $150 billions and growing fast. This is in other words a big slice of the bioeconomy where advanced biological engineering and R&D meets a market and translates into money.

I avoid describing my novel data analytical method in detail — my goal is on results, not to serve up another data science method dump. Some details are needed for context, though, and the code I have created is available in my code repository.

By the end, I hope to provide the reader with a partial answer to the question posed above as far as biological cancer therapy is concerned, as well as an outline of a method that can guide our never-ending learning journey in applied biotechnology.

Cartoon representation of fragments of two influential therapeutic antibodies (trastuzumab and pertuzumab), featured in numerous patents, bound to their target ERBB2; from structure 6OGE.

What Are Patents, How To Get One, And Why They Matter?

Patents are one component of intellectual property (IP). The other components are copyrights, trademarks and trade secretes. IP is defined in law and creates (or codifies, depending on philosophical priors) a class of abstract things as property, which then can be owned, traded, defended etc.

For patents specifically the following apply:

  • A patent grants its owner (or assignee) a time-limited (typically 20 years) right within a certain jurisdiction (like USA, Germany, Taiwan) to exclude others from making, using, selling or importing the innovation which the patent claims as property. This right is not automatically enforced, but requires the owner to actively exercise their right, if needed through litigation— like when one pharmaceutical company had to pay another $1.2 billion over an infringement of IP.
  • A patent cannot claim what already has been claimed by other patents, or which is already in the public-domain, or that which belongs to certain special classes of things (like a law of nature or a living creature). It must be claiming something novel or non-obvious.
  • The process of being granted a patent involves examination by government agencies, and can take years. In practice the examination mostly sorts out the question of novelty, and patent applications are often narrowed in scope so as not to claim things too broadly. An application can be denied or rendered practically worthless. The examination is not however concerned directly with the question does the innovation work?
  • A requirement of being granted a patent is that the innovation is adequately disclosed such that a person having ordinary skill in the art can understand. The exchange is: the innovator enables others to build further on the innovator’s insights and intellectual labour, and in return the innovator can more efficiently commercialize the innovation. The hope is through the required transparency to promote more useful innovation in the community.

Obtaining a patent costs money and as with all things legal, law firms and lawyers do a great deal of the work. Apart from well-decorated interiors of law firms, this cost implies that a rational actor will only file a patent application if they have some minimum expectation that it will become useful. Therefore in aggregate, though not in every case, patents reflect the beliefs about the cutting-edge of soon-to-be commercially useful innovation.

A lot more can be said about patent strategy, tactics of prosecuting a patent, variations between countries, as well as the normative debate about what a reformed system for useful innovation ought to be. But that is for another time and place.

So what makes patents especially relevant in biotechnology, far more than many other industries?

The typical product of biotechnology costs a great deal to figure out. Be it a novel protein sequence, organic molecule, seed, nucleic acid sequence, or cell, its utility is a non-trivial function of its precise structural composition — one among a combinatorially large set of possibilities. And the useful function is rarely evident, but demands costly testing.

Once the function and structural composition is established, however, the production, distribution, and sales of the physical manifestation of the innovation is in most cases not nearly as costly or limiting. The ability to govern the terms of the sales of physical manifestations of intellectual work is therefore central to making the intellectual work economically rational — at least within present laws and politics. So biotechnology R&D organizations file ample amounts of patents.

Illustration of what a patent can look like; they consist of several pages and images and densely packed text describing the innovation — hundreds of pages are not unheard of.

Biotechnology Patents Dataset

I look into a set of patents that meets all of the three criteria below:

  1. Patent applications to the US Patent and Trademark Office (USPTO) published sometime between January 2015 and April 2021.
  2. Patent applications that are classified as A61P, which is defined as innovations related to “specific therapeutic activity of chemical compounds or medicinal preparations”.
  3. Patent applications that match a keyword search for “cancer” or “oncology”.

This returns in total 68,354 patents. All data for this collection is retrieved with Google Patents. Any reference to patents in the analysis refers to this dataset.

As will become evident further below, this dataset includes a host of cancer drug patents, such as biologics, but also smaller organic molecules.

Note that the dataset is made up of patent applications. That means the text describes a broader conception of the innovation than what eventually will become legally significant upon granting. For the purpose of what I seek to do here — mapping current and emerging beliefs about a key part of the bioeconomy — this is a benefit.

Anything deemed sufficiently valuable in the healthcare market, even if invented outside the United States, is very likely to be patented in the USA (the US makes up about 40% of the global market value). That is why the limit of the analysis to patents at the USPTO is the least restrictive. Appending European, Japanese, or Chinese patents to the dataset would make the dataset bulkier and complicate matters due to language differences, while only modestly increase the ability to pick up trends or relations in innovation. Therefore, the above selection parameters are a good compromise between accuracy and effort.

Onwards With Analysis…

To start with I retrieve a few high-level trends. Afterward, I dive into the content of the patents.

Note that the analysis is guided by the data and semantic relations that emerge from it — deep domain knowledge is not an input. So the approach can also serve as a way to guide learning of advanced concepts in technical domains in commercial R&D. For select items and trends I will summarize the technical content, which may inspire the curious reader to further study these especially relevant subdomains.

Around 900 New Patent Applications Per Month

The next two charts show how nearly 70,000 patents are divided over the years and months.

Patents published per month. The intra-month variability is amplified by that USPTO publishes patents in bulk once a week.
Average published patents per month as a function of year, 2021 only partial.

The rate of the total number of patent applications is constant, with at most a hint of annual growth.

Top Patent Assignees: Big & Medium Pharma Plus US Universities and Research Hospitals

Each patent has at least one assignee. That is the organization, which owns the patent. The assignee can change over time, and through licensing other organizations than the assignee may be commercializing the innovation.

However, the organization or institution where the innovators were employed while conducting their work will in most cases be the assignee, or at least one of the assignees.

The forty assignees associated with the greatest number of patents are shown below.

Top-40 assignees of patents

Several of the big name-brand pharma companies (e.g. Novartis, Roche, Merck, Pfizer) are featured, but also some smaller and newer biotech companies. Also very prominent in the list are US universities (e.g. University of California, University of Pennsylvania) and research hospitals (e.g. Johns Hopkins, the General Hospital Corporation).

Assignees are legal entities. Especially for large, multinational, and diversified companies, there can be multiple assignees for what we may view as one company. Case in point: Bayer. They are a very large company doing a great deal more than drug development. There are at least seven legal entities related to Bayer represented (Bayer Pharma Aktiengesellschaft, Bayer Intellectual Property GmbH, Bayer Healthcare LLC, etc.). A detailed comparative analysis between research organizations would have to sort out the many-to-one relationships.

The Up-and-Coming Assignees in Recent Years

Because the number of patents filed is relatively small for most organizations and institutions, looking for relative growth trends requires generous error bars on any estimate, since small absolute numbers easily lead to noisy relative ones.

With that caveat in mind, I describe three illustrative examples of recent high-growth organizations.

Immatics Biotechnologies GmbH, founded in Germany, has been steadily increasing its US patent filings. Their tagline includes “delivering the power of T cells”, and so far they have no approved drugs, but clearly they have ramped up efforts to legally protect their pre-clinical research.

Patents published per quarter for Immatics

City of Hope, a private not-for-profit research hospital in California, USA, appears to have increased its patent filings in the previous two years. The hospital is not only dealing with cancer research.

This is also a good illustration that the assignee of a patent is not necessarily the organization that commercially applies the innovation. For example, in 2021 City of Hope exclusively licensed several of their patents to a spinoff company, CytoImmune Therapeutics. These IP relationships are not always readily apparent as parts of the agreements are confidential.

Patents published per quarter for City of Hope

Jiangsu Hengrui Medicine (恒瑞医药), founded in China, starts from a low baseline and is yet to appear in the top-40 patent filing organizations. But in the last two years they show a noticeable shift upwards in the rate of filings. This is the largest listed pharmaceutical company in China with a handful of approved drugs, and in recent years they have expanded operations a great deal outside China including with a new focus on biologics.

Patents published per quarter for 恒瑞医药

What About the Content of the Patents?

To figure out exactly what any single patent claims and the innovative step or novel composition of matter it deals with, deep reading is often called for. In some cases, the best and the brightest on the bench of the Supreme Court of the United States have to weigh in on the matter.

However, my goal here is not to analyze every single patent, rather I aim to unearth trends and relations in a particular domain of innovation. As often in data analysis and statistics, saying something with precision about an aggregate property is simpler than doing so for any individual item.

It is with this in mind that machine parsed technical concepts prove very useful.

From the text of the various sections (claims, title, abstract, description) of a patent, technical concepts (or concepts for short) are detected and retrieved. This machine effort matches terms in the text against a dictionary of keywords. The matching is sophisticated in that it can account for synonyms and hierarchical relations.

For example, HER2, ERBB2, CD340 are technical terms that refer to the same, or nearly the same, concept: a plasma membrane bound protein (or its gene) that has taken on a great deal of practical significance in cancer treatment, breast cancer especially. Two patents that use different terms for this concept are still matched to the same concept. Also, either term is matched to the broader concept of protein.

These many-to-many relationships can be encoded in so-called ontologies.

Graphical illustration of hierarchical relations between concepts with synonyms; pink arrows illustrate the many omitted subclasses or other type of relationships.

Though the machine process works very well in most cases, it is not perfect, and not every alias for a technical concept is matched right, and redundant terminology remains to some degree. The process is implemented as part of Google Patents and its outcome is available for retrieval.

The ten most common concepts in the dataset are: hydrogen, antibodies, sodium chloride, compounds, mixture, salts, cancer, antigens, proteins and genes, pharmaceutical composition, nucleic acids. In other words, very general concepts from the domain of molecular biology and medicine. Further down the list, more specific concepts appear. For example, the 315th most common concept is ERBB2, the 1998th most common concept is methylparaben, and the 3225th most common concept is T-cell Leukemia.

There are a bit over 4,100 concepts in at least the claims of at least 0.25% of the patents under consideration. I characterize each patent in terms of the absence or presence of each of these 4,100 concepts. Additional concepts can be added, but risking greater statistical noise due to their lower occurrence in the dataset.

Graphical illustration of a patent-concept matrix, with each element a binary yes/no value to signify that a concept appears at least in the claims of the patent; the matrix is sparse with mostly “No” values.

Compute Concept Graphs, Derive Concept Groupings

The aggregate co-occurrence (or co-variance) of any pair of concepts in the collection of patents defines the weighted concept graph. This graph can be subject to different forms of graph analysis.

It is not my intent to get into the technical details of the analysis. In short, I quantify concepts for topological similarity within the concept graph and derive concept groupings using in part the t-SNE method for dimensionality reduction. The interested reader can check out my code with comments.

I posit from theory and — as will be shown — practical outcomes, that semantically meaningful groupings follow. A map of more or less related concepts in a slice of the patent landscape emerges.

Two-dimensional reduction from t-SNE of concept distances in the concept graph for the patents.

The scattered points above correspond to concepts, and their relative placement relates to their derived topological similarity. More or less crisp groups of points are topologically similar to each other and to some degree topologically distinct from all the rest.

Now time to zoom in and go for a guided tour along with the smorgasbord of concepts. It turns out topological similarity is informative of the meaning of concepts and their place within the innovation landscape.

Immuno-Oncology Grouping: mAbs and Targets, Old and New

The grouping of concepts at the middle-left portion of the map is first explored. Zooming in…

Next a GIF-animation as I move interactively across the points to view the concept labels.

GIF illustration of interactively moving the mouse cursor over the points in the concept map.

We see clusters of some engineered monoclonal antibodies (mAbs), such as durvalumab and tremelimumab, and some other concepts like IDO1, CD274, TIGIT.

Their common denominator: immuno-oncology, also called cancer immunotherapy.

Among the 30 nearest concepts to this grouping we find:

This grouping of concepts makes sense. They all relate to immuno-oncology research, especially the so-called approach of immune-checkpoint blockade.

The immune-checkpoint blockade approach is built on the following base fact: Our immune systems can recognize some cancer cells as abnormal, and trigger an immune response that has the capacity to destroy the cancerous cells. That is the endogenous immune response to cancer. But, cancer cells in their turn have the capacity to dial back the immune cells — pacify them, so to speak. As an example, if the PD-1 receptor on T-cells is bound to its ligand PD-L1, the T-cell is inhibited. That is an interaction that helps regulate our own immune system such that it does not destroy our own healthy cells through so-called auto-immunity. Through its own fitness selection, the cancer cell exploits this native mechanism by expressing abundant PD-L1 on its surface making its microenvironment a zone of T-cell inhibition.

Therefore, an antibody that attaches to PD-1 (like nivolumab) and blocks the binding to PD-L1 will in turn inhibit the inhibition. The T-cell, or immune cell broadly speaking, proceeds to hack away at the cancerous cell.

As the reader understands, there is a great deal more to this story. Otherwise, why would this area of the bioeconomy be as active as it currently is? Among other things, there is buzz around the other proteins that the map revealed alongside PD-1, like TIGIT, IDO1, and LAG-3. They are considered as combination targets that may further refine the immune system response near tumours. As an illustration, this patent application claims a particular antibody sequence that targets TIGIT.

And the Trending Concepts Are…

Each patent is associated with a publication date and a set of concepts. Hence, the concepts that are becoming more common in recent years can be ascertained from the data.

Next I characterize what is trending in the three to five years leading up to the present and its place in the concept graph.

Immuno-Oncology Everywhere: Variations and Combinations

A majority of the top-30 growing concepts during the span of the dataset are ones from the previous list, and thus relate to immuno-oncology.

Durvalumab is at the top. It has become a benchmark that is listed in many of the patent claims of novel immuno-oncology treatments as a potential unit to be combined with. As an illustrative example, this patent claims a non-antibody fusion protein, for which the inventors particularly call out the possible combination therapy with durvalumab (claim 36).

Increasing count of concept durvalumab in patent claims.

The growth of durvalumab is in part a reflection of the belief that it, and mAb immune-checkpoint blockers in general, are likely to be a major part of cancer treatments in the coming decades by themselves or in a combination with other molecules.

Early Days of CRISPR and Cancer

CRISPR is a method in biological engineering that has created many new possibilities in genetic engineering, and been recognized by the Nobel Committee.

Cancer, however, has not been the first application of CRISPR as the therapeutic agent. It is showing up in this dataset through the concept guide RNA (gRNA). By designing gRNA a particular section of a gene can be modified and hopefully modulate the condition of a disease.

Increasing count of concept guide RNA in patent claims.

As an illustration, this patent relate to a gRNA that once delivered to a cancerous tumour messes with a particular cell cycle gene, like CDK1, such that the tumour cells divide and proliferate at a lower rate and that way make the tumour less capable of outgrowing whatever complementary treatment is deployed. The idea to inhibit the cell division of tumorous cells is at the foundation of a number of already approved drugs — however all small organic molecule drugs. Might a CRISPR therapeutic offer additional benefits? The patent filing trends suggests there is a belief it may, though it is early days still.

Occurring More Often Jointly: Cannabidiol and Cancer

Another concept that has experienced recent growth is cannabidiol, a naturally occurring organic molecule. The most recent changes this molecule has experienced is not scientific, rather regulatory. Some US states and Canada especially have in the last few years liberalized laws with respect to the natural plant source of cannabidiol, the cannabis plant or cannabis sativa.

Increasing count of concept cannabidiol in patent claims.

An illustrative example is this patent, which describes possible signalling pathways in tumour growth that are inhibited by cannabidiol and related compounds.

The growth illustrates as well the point about patents that they reflect foremost beliefs and expectations about commercial viability. It is not for me to say if the growth in this concept reflects a true untapped potential in the fight against cancer to date thwarted by state authorities, nor is it the job of the patent examiner to verify that an innovation is working. What can be said confidently, however, is that following the change in regulations people with capital are more willing to make bigger bets on the cannabidiol slice of the bioeconomy, including as a treatment of cancer.

Chimeric Antigen Receptors and The Concept Connections to Costimulatory Concepts

Sorting on top trending concepts further reveals a handful of human proteins that are not in the immune blockade grouping, and which connect strongly to the concept chimeric antigen receptor (CAR) — another relatively recent addition to the repertoire of cancer treatments.

The concept graph quantifies among other things conditional co-occurrences. The image below is a snippet of the concept graph illustrating the six strongest incoming conditional connections to the CAR concept. Given that, say HCST (also known as DAP10), is part of the claims of a patent, 83% of those patents will also contain the CAR concept in the claims. No other concepts have as strong conditional connections to CAR. Note however that other concepts co-occur more frequently with CAR (like CD28 and CD8), but not as selectively.

Concept graph snippet, showing inbound conditional relations to CAR. Dashed arrows to illustrate the numerous other inbound and outbound connections that have been omitted.

There are nowadays approved therapies based on so called CAR-T cells. In simple terms, the therapy consists of re-engineering an immune cell, like T-cells or NK-cells, taken from the blood of the cancer patient. The immune cell is genetically modified to express a protein domain on the exterior of the cell (the ectodomain) that very efficiently targets the known tumorous cells, and a protein domain inside the cell (the endodomain) that stimulates the immune cell into the desired action once the ectodomain binds to its target. It turns out the hurdle that had to be overcome before this idea became a successful therapy was the tuning of the intra-cellular signal. A host of different costimulatory subdomains, borrowed from the range of human proteins, are being explored as parts of the endodomain. These human protein subdomains therefore appear in the claims on several novel CAR designs that are in the process of patent prosecution.

An illustrative and admirably brief patent application is this one, which claims a engineered NK cell with particular combinations of domains as endodomain, including HCST (referred to as DAP10).

Three growing concepts related to CAR therapy; note the different scales of the vertical axes

…And So Much More

Clearly there is a lot of space of the concept graph and the derived map I have not described. A few summary remarks:

  • The diffuse cloud of points near the centre of the map relate to broad general concepts (e.g. antibodies, cell, disease, cancer) that in themselves offer little insight and which are not semantically distinct enough to become groups.
  • Some groups relate to disease that are distinct and only tangentially associated with cancer, like cognitive and neurological disease. They can show up in the map due to associations like chemo-induced peripheral neuropathy (condition caused by cancer treatment with chemo-therapy), or through catch-all enumerations of diseases or disorders that a patented compound might in a narrow slice of possible futures become relevant (claim 52 of this patent application is an excellent illustration of the idiom to cover one’s bases).
  • A few larger groups relate to various functional chemical groups that are part of claims of organic molecules, or natural and non-natural amino acids. For an analysis of trends and relations in molecular motifs of drugs, this grouping could be informative.

But some concept groups I have ignored no doubt carry intriguing relationships to a person interested in the emerging slices of the bioeconomy. A good map can guide a manifold of paths at many levels of granularities for multitudes of purposes.

What I hope to have illustrated is that from a data anaytical approach applied to the large volumes of text in patents, it is possible to derive a semantically helpful structure without knowing a great deal about the particular domain to start with. From this structure — the graph and the map — we can choose to further explore, understand, or invest in concepts that are presently considered of great potential value to the bioeconomy.

Anders Öhrn blends atoms and bits. In a pioneering startup in Vancouver, he helped make protein structure data, statistics, AI, and grit materialize into an antibody-drug scaffold nowadays used across Big Pharma. Thereafter he has around the world developed consumer electronics with embedded software and explored deep learning, where the forces of atoms and bits propel each other into creation.

Follow Bioeconomy.XYZ, in order to learn more about all the ways biotech, is shaping the world around us.



The Medium publication for biotechnology and everyone involved in the revolution. The best brought to you by the brightest.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Anders Ohrn

Quantitative if possible, towards first principles, pragmatic always. Innovation, biology, computation & complexity.