Advanced Model # 2

Published in

NLP Capstone Blog

9 min readMay 16, 2018

As explained in our previous blog post, our current challenge is involves constructing the dataset in an efficient and effective manner. This blog post will detail the progress made in the past week in relation to data construction and the challenges we have faced. We will also briefly highlight our plan for the next week.

Training

We’ve attempted training on a small subset of the data in the format the model should expect, and ran into an issue with batching.

As of now, we backpropagate once per batch. But at this point, latent representations of sentences and the document have already been computed. Meaning, the RNNs have already encoded every word in the document before parameters are updated. This, in conjunction with the series of affines for each sentence will likely produce a computation graph that we won’t have enough memory to backprop on.

We’ve now switched to backpropagation once per sentence, which will hopefully lead to faster learning.

New Heuristics

Last week we described a method of BIO tagging sentences that involved using ROUGE. While this method might produce good tags, we found that ROUGE was incredibly slow to run. Since we run ROUGE once for every sentence in every document, we chose to develop a difference heuristic that worked like ROUGE but was much faster. This led us to experimenting with two new approaches, which are detailed below.

Skip-bigrams

When learning about ROUGE, we learned of a variety of ROUGE called ROUGE-SU. Here, SU stands for Skip Bigrams and Unigrams. Skip Bigrams refer to bigrams which are formed as any subsequent pair of words in the sentence. In other words, it’s every bigram possible in a sentence such that the bigram follows sentence order. For our first approach, we decided to use the same greedy algorithm described from previous blog posts, except we try to maximize the skip bigram overlap between the reference skip bigrams and the set of sentences we choose to extract. We implemented this functionality from scratch.

Cosine Similarity

Cosine similarity is defined as a measure of similarity between two vectors. Essentially, it is a way of determining the cosine of the angle between two sequences of text in Euclidean space. A value tending toward 1 means that the two pieces of text are more similar, while smaller values mean there is less correlation. In NLP, this metric is used as a bag-of-words comparison, combining the words of both sequences into a master set of words, and computing the cosine similarity between each sequence’s respective frequency vector whose dimensionality is equal to the size of the set. In other words, it is the cosine of the angle between their tf-idf vectors.

Currently, we’re using spaCy’s implementation of cosine similarity.

Example Data

When we are sufficiently strict (ex. enforcing a skip-bigram intersection of at least 10 with a cosine similarity of at least 0.94) then we can get promising matches:

PAPER: Mechanisms of NO/cGMP-Dependent Vasorelaxation TERMs: {‘omega-Nitro-L-Arginine, N’, ‘NO2Arg’, ‘N omega Nitro L Arginine’, ‘L-NNA’, ‘Nitroarginine [Chemical/Ingredient]’, ‘NG-Nitro-L-Arginine’, ‘omega-Nitroarginine’, ‘NG-nitro-L-arginine’, ‘N(omega)-Nitroarginine’, ‘NOLA’, ‘N omega-Nitro-L-Arginine’, ‘N OMEGA NITROARGININE L’, ‘omega Nitroarginine’, ‘NOARG’, ‘NG-Nitroarginine’, ‘N(G)-Nitroarginine’, ‘NG NITROARGININE L’, ‘NG Nitro L Arginine’, ‘Nitroarginine’, ‘NG Nitroarginine’}
TERM FOUND: True
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 145/145 [00:01<00:00, 103.03it/s]
REFERENCE: An amino acid derivative and nitric oxide synthase (NOS) inhibitor with potential antineoplastic and antiangiogenic activities. Upon administration, NG-nitro-L-arginine inhibits the enzyme nitric oxide synthase, thereby preventing the formation of nitric oxide (NO). By preventing NO generation, the vasodilatory effects of NO are abrogated leading to vasoconstriction, reduction in vascular permeability and an inhibition of angiogenesis. As blood flow to tumors is restricted, this may result in an inhibition of tumor cell proliferation. NO plays an important role in tumor blood flow and stimulation of angiogenesis, tumor progression, survival, migration and invasiveness.
CHOSEN: [‘NO coordinates the blood-flow distribution between arterioles and the microvasculature by regulating the diameter of small arteries.7 The importance of NO and cGMP for the regulation of vascular tone and blood pressure has been recently strengthened by the observation that mice deficient in eNOS, ANP, the ANP receptor guanylyl cyclase A, or cGKI develop hypertension.2–6,17’]
MAX SKIPGRAM MATCHES: 38
MAX COSINE SIMILARITY: 0.9430605549205237

The ‘CHOSEN’ array contains a sentence that was extracted with our heuristics. Like the reference, it mentions nitric oxide, blood flow, and regulation of vascularity. The goal is to train model on instances such as these and were able to extract sentences of such relevance.

We should also note that definitions define multiple terms (in other words, terms may have multiple synonyms). The extracted sentence ‘NO coordinates the blood-flow distribution between arterioles …’ itself contains none of the entities explicitly, and yet seems to align well and be indicative of it’s definition. Robustness in recognizing a given term and its synonyms, along with being able to extract sentences about that term without the term actually being in it is extremely desirable for us. A model that is able to recognize sentences about a particular technical term without the term being present would be an especially helpful research tool.

Unfortunately, the heuristic is not perfect and can be lead astray:
“title”: “On the influence of various physicochemical properties of the CNTs based implantable devices on the fibroblasts’ reaction in vitro”
},
“e_gold”: “ A record of something that is being done, has been done, can be done, or is intended or requested to be done. Examples: The kinds of acts that are common in health care are (1) a clinical observation, (2) an assessment of health condition (such as problems and diagnoses), (3) healthcare goals, (4) treatment services (such as medication, surgery, physical and psychological therapy), (5) assisting, monitoring or attending, (6) training and education services to patients and their next of kin, (7) and notary services (such as advanced directives or living will), (8) editing and maintaining documents, and many others. Discussion and Rationale: Acts are the pivot of the RIM; all domain information and processes are represented primarily in Acts. Any profession or business, including healthcare, is primarily constituted of intentional and occasionally non-intentional actions, performed and recorded by responsible actors. An Act-instance is a record of such an action. Acts connect to Entities in their Roles through Participations and connect to other Acts through ActRelationships. Participations are the authors, performers and other responsible parties as well as subjects and beneficiaries (which includes tools and material used in the performance of the act, which are also subjects). The moodCode distinguishes between Acts that are meant as factual records, vs. records of intended or ordered services, and the other modalities in which act can appear. One of the Participations that all acts have (at least implicitly) is a primary author, who is responsible of the Act and who \”owns\” the act. Responsibility for the act means responsibility for what is being stated in the Act and as what it is stated. Ownership of the act is assumed in the sense of who may operationally modify the same act. Ownership and responsibility of the Act is not the same as ownership or responsibility of what the Act-object refers to in the real world. The same real world activity can be described by two people, each being the author of their Act, describing the same real world activity. Yet one can be a witness while the other can be a principal performer. The performer has responsibilities for the physical actions; the witness only has responsibility for making a true statement to the best of his or her ability. The two Act-instances may even disagree, but because each is properly attributed to its author, such disagreements can exist side by side and left to arbitration by a recipient of these Act-instances. In this sense, an Act-instance represents a \”statement\” according to Rector and Nowlan (1991) [Foundations for an electronic medical record. Methods Inf Med. 30.] Rector and Nowlan have emphasized the importance of understanding the medical record not as a collection of facts, but \”a faithful record of what clinicians have heard, seen, thought, and done.\” Rector and Nowlan go on saying that \”the other requirements for a medical record, e.g., that it be attributable and permanent, follow naturally from this view.\” Indeed the Act class is this attributable statement, and the rules of updating acts (discussed in the state-transition model, see Act.statusCode) versus generating new Act-instances are designed according to this principle of permanent attributable statements. Rector and Nolan focus on the electronic medical record as a collection of statements, while attributed statements, these are still mostly factual statements. However, the Act class goes beyond this limitation to attributed factual statements, representing what is known as \”speech-acts\” in linguistics and philosophy. The notion of speech-act includes that there is pragmatic meaning in language utterances, aside from just factual statements; and that these utterances interact with the real world to change the state of affairs, even directly cause physical activities to happen. For example, an order is a speech act that (provided it is issued adequately) will cause the ordered action to be physically performed. The speech act theory has culminated in the seminal work by Austin (1962) [How to do things with words. Oxford University Press]. An activity in the real world may progress from defined, through planned and ordered to executed, which is represented as the mood of the Act. Even though one might think of a single activity as progressing from planned to executed, this progression is reflected by multiple Act-instances, each having one and only one mood that will not change along the Act-instance life cycle. This is because the attribution and content of speech acts along this progression of an activity may be different, and it is often critical that a permanent and faithful record be maintained of this progression. The specification of orders or promises or plans must not be overwritten by the specification of what was actually done, so as to allow comparing actions with their earlier specifications. Act-instances that describe this progression of the same real world activity are linked through the ActRelationships (of the relationship category \”sequel\”). Act as statements or speech-acts are the only representation of real world facts or processes in the HL7 RIM. The truth about the real world is constructed through a combination (and arbitration) of such attributed statements only, and there is no class in the RIM whose objects represent \”objective state of affairs\” or \”real processes\” independent from attributed statements. As such, there is no distinction between an activity and its documentation. Every Act includes both to varying degrees. For example, a factual statement made about recent (but past) activities, authored (and signed) by the performer of such activities, is commonly known as a procedure report or original documentation (e.g., surgical procedure report, clinic note etc.). Conversely, a status update on an activity that is presently in progress, authored by the performer (or a close observer) is considered to capture that activity (and is later superceded by a full procedure report). However, both status update and procedure report are acts of the same kind, only distinguished by mood and state (see statusCode) and completeness of the information. “,
“entity”: “act”,
“extracted”: [
“Since their discovery in 1952, carbon nanotubes (CNTs) have been attracting increasing attention in being applied in various areas of materials science due to their outstanding mechanical properties, high chemical and thermal stability and, in some cases, very good conductivity via an electron transfer.”,
“Thus, at that time point, differences in fibroblasts’ proliferation rate may have been governed by different chemical composition of the samples and an increased amount of COOH species in the CNT_ox [28].”
],

Note that since the script selected this example, that the cosine similarity score between the extracted sentences and the reference were above 0.93. Not only is the cosine similarity too generous as a heuristic, the fact that the reference is so large means that it will almost always overlap with more than enough skip-bigrams to reach our skip-bigram threshold.

Examples like these and others are concerning, but the hope is that helpful examples like the NO example outnumber the noise that make it past our heuristics.

Going Forward

Our goals for the next week include fine tuning our data collection thresholds to maximize the quality of our dataset, training the model, and hopefully producing results. As we mentioned in our previous blog post, the model is ready for training, but the real challenge might be with the way we produce our dataset. In the upcoming week, we expect to experiment with new heuristics for sentence similarity and tweak the existing heuristics in order to produce the best dataset.