Question Answering with PyTorch Transformers: Part 2

A simple vector index with scikit-learn

11 min readJan 2, 2020

In Part 1 we briefly examined the problem of question answering in machine learning and how recent breakthroughs have greatly improved the quality of answers produced by computer systems.

Using the pipeline API Transformers library we were able to run a pre-trained model in a few lines of code. In this article we’ll prototype an information retrieval system around it. In later articles we’ll turn that into web services that can be queried by browsers and mobile apps.

If you’re already familiar with vectorization and TF-IDF, feel free to skip ahead to the code.

Runnable notebook in Paperspace or static in GitHub

Part 1: Introduction
Part 2: Basic indexing with sklearn
Part 3: Working with pre-trained transformers
Part 4: Using external indexing services
Part 5: Pitfalls and improvements
Part 6: Scaling out

We can break this problem in two phases: first, finding the contexts relevant to the question then extracting answers from each context. We’ve already seen how to use Transformer’s pipeline interface to do the latter. A common way of tackling the first part is to build an inverted index from a forward index then ranking the results using a frequency based algorithm. There are many robust solutions available as commercial or open-source products. Rather than leveraging any of them, in this post we’re going to focus on ranking and build a really simple system, for didactic purposes. Later in the series we will address more practical options.

A naive approach might be to take each word in the query and search every document for exactly those words, perhaps dropping interrogative words like “who”, “what”, “when”, etc. However, it is likely that the relevant documents might use different phrasing or synonyms of those words, causing the algorithm to dismiss them prematurely. We could relax the criterion and require a threshold of words to be met or rank results by matching words. There are more issues with this approach.

First, how important is each word in the question and how much will it effect the results? There are exceptionally common words (referred to as “stop words”) that can be filtered out without effecting the meaning of the question. For instance, “Who was the person that invented the light bulb?” could be trimmed down to “person invented light bulb?”. While not grammatical, it still conveys the full meaning of the question.

How then would we treat two different documents, one missing “invented” and the other missing “bulb” while both contain “light”? One might be an article on pre-commercial versions of the light bulb while the other might be an article on the invention of light sails. In our query “bulb” is more important than “invented”.

An alternative approach is to transform each document into a mathematical representation that allow us to compute relatedness. Most of us are familiar with points in two and three dimensional spaces. Two points near each other are more “similar” than two points that are far apart. We can measure the distance between two points and use that as a metric of their closeness. By mapping a question into a point in space, we can fetch relevant documents by finding points nearest in that neighborhood. This concept work the same way in spaces with more than three dimensions. Using more dimensions gives us more options when separating groups that are related in some ways but differ in others.

tf-idf

In information retrieval, tf-idf or TFIDF, short for term frequency-inverse document frequency, is a numerical…

en.wikipedia.org

At one extreme end of the spectrum, we can have a dimension for each word in our vocabulary. The value along an axis represents how important the word is to the meaning of the document. One useful way to quantify meaning is called “term frequency-inverse document frequency” (TF-IDF). I’m going to avoid going into that in much detail, since I won’t do as good a job of it as Wikipedia. The key takeaways are that how often a document mentions (TF) each term gives you an idea of what the subject is. Furthermore terms that are uncommon across all documents (IDF) are more informative since they are more “surprising” according to information theory.

After we obtain vector representations for a document and a question, we can calculate a relevancy score by using a distance metric. There are many options with different properties, but we’re going to use the dot product for simplicity.

scikit-learn

"We use scikit-learn to support leading-edge basic research [...]" "I think it's the most well-designed ML package I've…

scikit-learn.org

This is pretty easy to implement with scikit-learn.

from sklearn.feature_extraction.text import TfidfVectorizercorpus = ...vectorizer = TfidfVectorizer(
    stop_words='english', min_df=5, max_df=.5, ngram_range=(1,3))
tfidf = vectorizer.fit_transform(corpus)

TfidfVectorizer takes a list of paragraphs and turns each into a sparse vector. The indices of the vector correspond to words in a vocabulary that it builds from each distinct word in its input. Each distinct word, corresponds to a dimension in the vector.

Let’s break it down with a small example before moving on to SQUAD.

%precision 1
from sklearn.feature_extraction.text import TfidfVectorizercorpus = [
    "how now brown cow",
    "the quick brown fox jumps over the lazy brown dog",
    "over the lazy river the cows jumped"
]vectorizer = TfidfVectorizer(stop_words='english')
tfidf = vectorizer.fit_transform(corpus)
vectorizer.vocabulary_, tfidf.todense()

output:

({'brown': 0,
  'cow': 1,
  'quick': 8,
  'fox': 4,
  'jumps': 6,
  'lazy': 7,
  'dog': 3,
  'river': 9,
  'cows': 2,
  'jumped': 5},
 matrix([[0.6, 0.8, 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. ],
         [0.6, 0. , 0. , 0.4, 0.4, 0. , 0.4, 0.3, 0.4, 0. ],
         [0. , 0. , 0.5, 0. , 0. , 0.5, 0. , 0.4, 0. , 0.5]]))

TfidfVectorizer found ten distinct words across our three phrases. What happened to “the”, “we”, etc? These words are so common that their IDF values are close to 0. They provide little information about the topic of the context, so there’s little point in counting them. We can tell TfidfVectorizer to ignore them with the stop_words argument. There were three other options omitted in this example: min_df, max_df and ngram_range. These will be useful on larger datasets as we will see later.

Our corpus has been transformed into 3 sparse vectors of 10 elements each. We can invert the vectorization to see what TfidfVectorizer pays attention to:

[vectorizer.inverse_transform(vec)[0].tolist() for vec in tfidf]

Output:

[['brown', 'cow'],
 ['brown', 'quick', 'fox', 'jumps', 'lazy', 'dog'],
 ['lazy', 'river', 'cows', 'jumped']]

Note how word order and multiplicity is ignored.

Did you also notice how “cow”/“cows” and “jumps”/“jumped” are counted separately? These words only differ by their suffixes and impart roughly the same meaning on the phrase. Ideally we could ignore the suffix when it doesn’t matter. Some toolkits like NLTK will allow you to map a word to its stem — the part that doesn’t change with different usages. While this improves the situation the vocabulary ends up with a lot of awkward entries like “improv”, “entr”, “produc”, etc.

Lemmatization on the other hand, replaces words with their base forms or lemmas. These are the words you would see at the beginning of a dictionary entry. “Improved”, “improving” and “improves” would all get replaced by “improve”. However, “improvement” and “improvements” are nouns rather than verbs, so they would end up with a separate vocabulary entry.

spaCy · Industrial-strength Natural Language Processing in Python

spaCy is designed to help you do real work - to build real products, or gather real insights. The library respects your…

spacy.io

The spaCy library provides an API for easy lemmatization.

import spacyspacy.prefer_gpu()
sp = spacy.load('en')[tok.lemma_ for tok in sp("over the lazy river the cows jumped")]

Output:

['over', 'the', 'lazy', 'river', 'the', 'cow', 'jump']

Later in the series we’ll see other ways that spaCy can help us.

Let’s explore the SQUAD2.0 dataset in a little more detail than the previous article.

import json
import randomwith open("train-v2.0.json") as f:
    doc = json.load(f)doc.keys(), type(doc["data"]), len(doc["data"])

Output:

(dict_keys(['version', 'data']), list, 442)

At the root level, we have a JSON object with a data array of topics. Each topic has multiple paragraphs which in turn have multiple question/answer pairs. The text of each paragraph is in the field “context”. Let’s extract all the paragraph texts and questions, into flat lists.

paragraphs = []
questions = []for topic in doc["data"]:
    for pgraph in topic["paragraphs"]:
        paragraphs.append(pgraph["context"])
        for qa in pgraph["qas"]:
            if not qa["is_impossible"]:
                questions.append(qa["question"])
        
len(paragraphs), len(questions), random.sample(paragraphs, 2), random.sample(questions, 5)

Output:

(19035,
 86821,
 ["If the highest echelons of the governments also take advantage from corruption or embezzlement from the state's treasury, it is sometimes referred with the neologism kleptocracy. Members of the government can take advantage of the natural resources (e.g., diamonds and oil in a few prominent cases) or state-owned productive industries. A number of corrupt governments have enriched themselves via foreign aid, which is often spent on showy buildings and armaments.",
  'The base of the stupa has 108 small depictions of the Dhyani Buddha Amitabha. It is surrounded with a brick wall with 147 niches, each with four or five prayer wheels engraved with the mantra, om mani padme hum. At the northern entrance where visitors must pass is a shrine dedicated to Ajima, the goddess of smallpox. Every year the stupa attracts many Tibetan Buddhist pilgrims who perform full body prostrations in the inner lower enclosure, walk around the stupa with prayer wheels, chant, and pray. Thousands of prayer flags are hoisted up from the top of the stupa downwards and dot the perimeter of the complex. The influx of many Tibetan refugees from China has seen the construction of over 50 Tibetan gompas (monasteries) around Boudhanath.'],
 ['Were Laserdiscs initially cheaper or more costly to produce than their VHS counterparts?',
  'Who wrote "Baseball\'s Sad Lexicon"?',
  'What happened to the shawn and the wooden cornet during the Baroque period?',
  'Where did the state place on population chart?',
  'When was the UNSCOP formed?'])

Since we’re randomly sampling questions from the global pool, they are not related to the contexts displayed. We can see that some questions just don’t make sense without a context, unfortunately, but many of them are specific enough to be useful.

Before, vectorizing let’s lemmatize each context. This takes a while so I suggest caching the result to file.

%%time
def lemmatize(phrase):
    return " ".join([word.lemma_ for word in sp(phrase)])lemmas = [lemmatize(par) for par in paragraphs]
df = pd.DataFrame(data={'context': paragraphs, 'lemmas': lemmas})
df.to_feather("squad_context.feather")

This took about 5 minutes on my desktop which has a decently fast CPU and SSD. This isn’t the most efficient way to work with spaCy, but we’ll use it for now for the sake of simplicity.

Load the cache instead, if resuming:

df = pd.read_feather("squad_context.feather")
paragraphs = df.context
lemmas = df.lemmas

Now for vectorization:

vectorizer = TfidfVectorizer(
    stop_words='english', min_df=5, max_df=.5, ngram_range=(1,3))
tfidf = vectorizer.fit_transform(lemmas)len(vectorizer.vocabulary_)

Output: 37943

This only takes about 10 seconds on my computer — the same order of magnitude as loading the pre-trained BERT model, so caching the vectors to disk is less of a win. You can pickle tfidf and vectorizer if you wish.

There are 38 thousand words in our vocabulary. That’s quite a bit but that order of magnitude is common in this situation. Remember that each word in our vocabulary becomes a dimension in our vector space, so we’re dealing with 37943-D vectors. Luckily most of those are 0 and sklearn is smart enough to only encode the non-zero values in each vector using a sparse matrix. The cost of computing over this structure is determined by how many non-zero elements we have rather than the full 38K dimensions.

Now about those function parameters we left out in the smaller example. Aside from removing stop words we want to remove words that show up in either very few or too many articles. We can use the min_df and max_df options respectively to set limits on the document frequency for each word. You can provide either an integer for an absolute limit or a float to specify a portion of the corpus. Here, we’re saying we want words that appear in at least five different documents but no more than 50% of all documents. You can play with the parameters to trade-off between performance and accuracy.

Also, we don’t want to examine only words in isolation. There might be common sequences like “city hall”, “light bulb”, or “United States” that occur often enough to be relevant. By using ngram_range, we can tell TfidfVectorizer to consider adjacent pair or triplet of words. Most will occur too few or too many times to satisfy the frequency filters. However we may catch things like “the United Nations” or “the late 1900's”. This is a poor stand-in for named-entity recognition, but is significantly faster to index and should be sufficient with moderate to large sets of documents.

Time to ask some questions. First, we need to vectorize the question.

question = "Who is a notable exponent of pluralistic idealism?"
query = vectorizer.transform([lemmatize(question)])
(query > 0).sum(), vectorizer.inverse_transform(query)

Output:

(4, [array(['exponent', 'idealism', 'notable', 'pluralistic'], dtype='<U42')])

Then we can compare the vectorized query against all paragraphs in the corpus.

%%time
scores = (tfidf * query.T).toarray()
results = (np.flip(np.argsort(scores, axis=0)))
[lemmas[i] for i in results[:3, 0]]

Output:

CPU times: user 5.93 ms, sys: 281 µs, total: 6.21 ms
Wall time: 5.41 ms['pluralistic idealism such as that of Gottfried Leibniz take the view that there be many individual mind that together underlie the existence of the observed world and make possible the existence of the physical universe . unlike absolute idealism , pluralistic idealism do not assume the existence of a single ultimate mental reality or " Absolute " . Leibniz \' form of idealism , know as Panpsychism , view " monad " as the true atom of the universe and as entity have perception . the monad be " substantial form of being",elemental , individual , subject to -PRON- own law , non - interacting , each reflect the entire universe . monad be center of force , which be substance while space , matter and motion be phenomenal and -PRON- form and existence be dependent on the simple and immaterial monad . there be a pre - establ...

Considering that we searched ~20 thousand articles, 6 ms is reasonably fast. To be fair, 20k articles is a drop in the bucket compared to the 6 million articles currently on Wikipedia. At that this rate, fetching candidates would take about 2 seconds which might be acceptable in most circumstances. Indexing, however, would take quite a while longer though, especially since articles in Wikipedia tend to be longer than context blurbs in SQUAD. Still, there are ways to do better with the help of other libraries but those are problems for another day.

To round it out and bring this article to a close, let’s run the results through a Transformers pipeline. Using a dataframe is quite unnecessary, but while prototyping in a Jupyter notebook, it helps keep related bits together.

qapipe = pipeline('question-answering',model='distilbert-base-uncased-distilled-squad', tokenizer='bert-base-uncased')THRESH = 0.01
candidate_idxs = [ (i, scores[i]) for i in results[0:10, 0] ]
contexts = [ (paragraphs[i],s)
    for (i,s) in candidate_idxs if s > THRESH ]question_df = pd.DataFrame.from_records([ {
    'question': question,
    'context':  ctx
} for (ctx,s) in contexts ])question_df.to_feather("question_context.feather")

Now for some answers.

%%time
preds = qapipe(question_df.to_dict(orient="records"))
answer_df = pd.DataFrame.from_records(preds) \
    .sort_values(by="score", ascending=False)answer_df.head()

Output (reformatted as ascii art):

+---+--------------------+-------+-----+-------------------+
|   |       score        | start | end |      answer       |
+---+--------------------+-------+-----+-------------------+
| 0 | 0.9867653277212511 |    37 |  54 | Gottfried Leibniz |
| 4 | 0.8895841622191369 |     0 |   4 | Kant              |
| 5 | 0.8794209005614064 |   200 | 216 | Woodrow Wilson's  |
| 1 | 0.6990175464584958 |   106 | 117 | Descartes's       |
| 7 | 0.6884047208304622 |   342 | 359 | Platonic idealism |
+---+--------------------+-------+-----+-------------------+

CPU times: user 16.7 s, sys: 1.28 s, total: 18 s
Wall time: 2.35 s

Clearly only one of these is the correct answer. Why did we get these other results? Only the first article is about “pluralistic idealism”. However the remaining documents concern different types of idealism. They were partial matches but since each had a “notable exponent” the model did its best to extract an answer from the texts it was given.

We can see from the score that Transformers was most confident about the first answer, but what does this score really mean and can we actually use it to rank answers?

Those are questions for another day…

Continue to Part 3

Jupyter Notebook on Paperspace

Question Answering with PyTorch Transformers: Part 2

A simple vector index with scikit-learn

tf-idf

In information retrieval, tf-idf or TFIDF, short for term frequency-inverse document frequency, is a numerical…

scikit-learn

"We use scikit-learn to support leading-edge basic research [...]" "I think it's the most well-designed ML package I've…

spaCy · Industrial-strength Natural Language Processing in Python

spaCy is designed to help you do real work - to build real products, or gather real insights. The library respects your…

Written by Paton Wongviboonsin