Answer scoring

Paton Wongviboonsin
11 min readJan 3, 2020

Welcome back! This is the third part of an on-going series about building a question answering service using the Transformers library. The prior article looked at using scikit-learn to build an indexing service for fetching relevant articles to feed into Transformers.

This time we’ll start working with the library in more depth. In this article we’re going to peel back a layer to examine the inner workings of the Transformers question answering pipeline. Then we’ll use the model API to build our own pipeline. Finally we’ll wrap it all up in a simple Flask service that can be accessed over a network. A little familiarity with PyTorch is assumed but not necessary to follow along.

Runnable notebook in Paperspace or static in GitHub

In Part 2, we used the pipeline API in the Transformers library as a black box that took our questions and spat out answers ranked by scores.

When did the last country to adopt the Gregorian calendar start using it?

+----+-----------+--------+------+--------------------------+
| | score | start | end | answer |
+----+-----------+--------+------+--------------------------+
| 1 | 0.973023 | 93 | 98 | 1923, |
| 8 | 0.920178 | 473 | 489 | 15 October 1582. |
| 3 | 0.809257 | 324 | 334 | 1 January, |
| 0 | 0.641907 | 441 | 450 | 1 January |
| 2 | 0.547818 | 566 | 590 | Friday, 15 October 1582, |
+----+-----------+--------+------+--------------------------+

What does the score mean? Is “15 October 1582” three times as wrong as “1923” or or is it just 95% as right? How is it calculated?

To answer that we’re going to have to understand the pieces of the pipeline and to do that, let’s first look at a simple example from the huggingface documentation.

# 1
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForQuestionAnswering \
.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')
#2
question, text = "Who was Jim Henson?", "Jim Henson was a nice puppet"
input_text = "[CLS] " + question + " [SEP] " + text + " [SEP]"
#3
input_ids = tokenizer.encode(input_text)
token_type_ids = [0 if i <= input_ids.index(102) else 1
for i in range(len(input_ids))]
#4
start_scores, end_scores = model(torch.tensor([input_ids]), \
token_type_ids=torch.tensor([token_type_ids]))
#5
all_tokens = tokenizer.convert_ids_to_tokens(input_ids)
print(' '.join(all_tokens[torch.argmax(start_scores) : torch.argmax(end_scores)+1]))

Output: a nice puppet

  1. Loads a pre-trained tokenizer and pre-trained question answering model
  2. Takes embeds the question and context into a string delimited by “[SEP]”
  3. Encodes the query and generates a boolean array which marks which tokens are part of the question and which are part of the context.
  4. Run the model to get two arrays with the same length as input_ids. Each element in start_scores is an activation value indicating how likely it is that the token is the start of the answer. Similarly, end_scores marks the end of the answer.
  5. We can extract the answer by slicing between the strongest activations of start_score and end_score.

I’m referring to the outputs of the neural network as “activation values” and “activations” as a short-hand and to distinguish from other common forms of output in computing like print outs and graphs.

Tokenization is the process of breaking the phrase into smaller chunks. Traditionally, this has meant splitting a sentence into words. However, with these models it is more helpful to split words into pieces. Usually BERT tokenizers split the words into stems and suffixes, but they will try to treat unknown words as compound words.

BERT-based architectures use different tokenization strategies, so when using pre-trained models, it is important to use the same process that the model is trained with. Moreover, encoding adds an additional step of converting each token to a numeric value based on the tokenizer’s vocabulary. This is so we can feed fixed size numeric tensors into the model rather than variable length strings for each token.

Let’s adapt this example to evaluate contexts we fetched out of Part 2. In Part 2, we combined the questions and contexts into a dataframe and cached it to disk. Let’s open it up in a new notebook and work with it from there.

device = torch.device('cuda'
if torch.cuda.is_available() else 'cpu')
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForQuestionAnswering \
.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad') \
.to(device)
question_df = pd.read_feather("question_context.feather")
question_df

Output:

<Sorry, too ugly for a blog. Go check out the notebook...>

I’m opting to use GPU acceleration if available by moving the model to(device) but this isn’t strictly necessary.

Fetch just one context to evaluate.

question, context = question_df[["question", "context"]].iloc[1]
question, context

Output:

('When did the last country to adopt the Gregorian calendar start using it?',
'During the period between 1582, when the first countries adopted the Gregorian calendar, and 1923, when the last European country adopted it, it was often necessary to indicate the date of some event in both the Julian calendar and in the Gregorian calendar, for example, "10/21 Febru...

Combine question and context into a string and encode

input_text = "[CLS] " + question + " [SEP] " + context + " [SEP]"
input_ids = tokenizer.encode(input_text, add_special_tokens=False)
token_type_ids = [0 if i <= input_ids.index(102) else 1 for i in range(len(input_ids))]
input_ids[:10], token_type_ids[:20]

I’m passing add_special_tokens=False to the tokenizer, otherwise it will insert duplicate [CLS] and [SEP] tokens.

A small annoyance with PyTorch: we need to make sure we place our model and our data on the same device, then move the results back to CPU memory. We can instantiate the input tensors on the target device rather than creating them in memory then moving them.

with torch.no_grad():
start_scores, end_scores = model(
torch.tensor([input_ids], device=device),
token_type_ids=torch.tensor([token_type_ids], device=device))
all_tokens = tokenizer.convert_ids_to_tokens(input_ids)
print(' '.join(all_tokens[torch.argmax(start_scores) :
torch.argmax(end_scores)+1]))
print(f'score: {torch.max(start_scores)}')

Output:

1923
score: 8.123082160949707

Well, this is different. This is the same answer we got from the pipeline API but back then, the score was a number between 0.0 and 1.0.

sns.distplot(start_scores.cpu())

In this histogram, we can see that the bulk of the start_scores lie between -10 and -5. This is a result of the loss function used in during training which applies applies the softmax function and calculates the cross entropy against the true start position.

The softmax squishes the range of values to be between 0 and 1 and normalizes the outputs so that they sum to 1. Put another way, it turns the activation values into probability values. Negative activations become close to 0 while strong positive values are closer to 1. It also inflates the relative distance between the largest points non-linearly. Compare the previous chart with 4 points in the interval (5.0, 10.0) to the one below which has 2 points in the same region.

When our network suffers from a crisis of confidence, we are unlikely to see multiple large positive values. Instead most will be close to zero or below, like this:

Let’s demonstrate this with another example.

question, context = question_df[["question", "context"]].iloc[0]
question, context

Output:

('When did the last country to adopt the Gregorian calendar start using it?',
'"Old Style" (OS) and "New Style" (NS) are sometimes added to dates to identify which system is used in the British Empire and other countries that did not immediately change. Because the Calendar Act of 1750 altered the start of the year, and also aligned the British calendar with the Gregorian cale...

This passage is about adoption of the Gregorian calendar, but there is no mention of Britain being the last country. The model detects that the question is asking about a date or year so it focuses on “1750”.

with torch.no_grad():
start_scores, end_scores = model(
torch.tensor([input_ids], device=device),
token_type_ids=torch.tensor([token_type_ids], device=device))
all_tokens = tokenizer.convert_ids_to_tokens(input_ids)
sns.distplot(start_scores.cpu(), kde=False, rug=True)

However, it’s not so confident that this answers the question. All the values are below 0, but there’s still a maximum. If the you were forced to assume that the context contained the right answer, you would still assign a non-zero probability to some position inside.

Back to the pipeline API. The QuestionAnsweringPipeline class takes these scores and applies the softmax function to them. It also does some fancy matrix manipulation to calculate the indices of the answer. The takeaway, however is that the scores for all excerpts from the passage sum to 1. If we think of this in terms of probability, this is essentially saying that the answer must be one of those spans. To put it another way, the pipeline assumes the answer is in the passage. (Side note: I need to look into how the training loop handles questions marked “impossible”)

While the training objective is not designed to produce activation values that can be meaningfully compared across different question/context pairs we’re going to do exactly that. After all, we’re being pragmatists here instead of purists. Also, we’re going to use softmax on the start_score values across all the contexts fetched by our indexer. This assumes that the answer is in one of them, which isn’t guaranteed, especially with our simplistic algorithm. It’s not “correct” but it’s good enough for now.

First, let’s encode the question/context pairs.

question_df["encoded"] = question_df.apply(
lambda row: tokenizer.encode("[CLS] " + row["question"]
+ " [SEP] " + row["context"] + " [SEP]",
add_special_tokens=False), axis=1)
question_df["tok_type"] = question_df.apply(
lambda row: [0 if i <= row["encoded"].index(102) else 1
for i in range(len(row["encoded"]))], axis=1)

Then run a batch through the model:

%%time
with torch.no_grad():
X = torch.nn.utils.rnn.pad_sequence(
[torch.tensor(row) for row in question_df["encoded"]],
batch_first=True).to(device)
T = torch.nn.utils.rnn.pad_sequence(
[torch.tensor(row) for row in question_df["tok_type"]],
batch_first=True).to(device)
start_scores, end_scores = model(X, token_type_ids=T)
max_score, max_start = torch.max(start_scores, axis=1)
soft_max = F.softmax(max_score, dim=0)

We’re using pad_sequence utility function that’s originally intended for working with recurrent networks. It takes a list of vectors with unequal size and packs them into a matrix. The new dimension is sized according to the longest vector. When vectors smaller than this are packed into the matrix, the excess space is filled with a padding value. (Caution: the default padding value for pad_sequence is 0, but not all BERT models accept 0 padding. Look at the documentation for the corresponding tokenizer).

While that matrix does have a lot of wasted space, passing everything as one chunk of memory to the GPU speeds up the run time overall. (Another caution: if you’re targeting CPU, it might make more sense to run each example sequentially and stream results to the user).

With real data, we’ll need to be cautious of really long documents. The tokenizer will truncate them at some point, but if we don’t specify, then it depends on the underlying model. One method is to cut the document into overlapping chunks and add each as a new row. This can be done at the database level or when generating a batch to pass to the model.

Let’s pack everything into another dataframe to keep things organized.

answer_df = question_df[["context", "encoded"]].copy()
answer_df["answer_score"] = max_score.cpu().numpy()
answer_df["answer_start"] = max_start.cpu().numpy()
answer_df["answer_softmax"] = soft_max.cpu().numpy()
max_len = torch.zeros_like(max_start)
for i in range(max_start.shape[0]):
max_len[i] = torch.argmax(end_scores[i,max_start[i]:]) + 1

answer_df["answer_length"] = max_len.cpu().numpy()
answer_df = answer_df[answer_df.answer_score > 1.0] \
.sort_values(by="answer_score", ascending=False)
answer_df.head()

Now we can decode the answer for each row:

def decode_answer(row):
input_ids = row.encoded
offset = row.answer_start
length = np.clip(row.answer_length, 0, 20)
return tokenizer.decode(input_ids[offset:][:length])
answer_df["answer"] = answer_df.apply(decode_answer, axis=1)
answer_df[["answer_softmax","answer"]].head()

Output:

+----+-----------------+---------------+-----------------+
| | answer_softmax | answer_score | answer |
+----+-----------------+---------------+-----------------+
| 1 | 0.834725 | 8.141849 | 1923 |
| 8 | 0.109714 | 6.112623 | 15 october 1582 |
| 2 | 0.029141 | 4.786901 | 4 october 1582 |
| 6 | 0.013237 | 3.997773 | 1 january 1926 |
| 3 | 0.010656 | 3.780912 | 15 october 1582 |
+----+-----------------+---------------+-----------------+

We’ve managed to make use of the start_score value in a way it wasn’t intended to be used. This is good enough for a first pass, but the right way to handle this is to train another BERT model to tell us how much an answer makes sense, given the context (a special form of entailment). This is something I might cover later in the series, but at the moment, I don’t see any pretrained models that will do this for us.

Thanks to the huggingface team, we’ve come a pretty long way without having to get too far into the weeds in terms of defining custom layers in PyTorch or writing training loops.

It’s time we wrapped everything into a simple web service which you can find at questionable.py. After wrapping the code cells above in functions we can create a flask server with a simple handler:

app = Flask(__name__)@app.route('/answer')
def answer():
question = request.args.get("q")
if not question:
return "Query parameter 'q' is required\n", 400

contexts, query = fetch_contexts(question, debug=True)
if len(query) < 1 or query[0].size < 1:
return "Ask a better question\n", 400

if len(contexts) < 1:
return "No relevant information found\n", 404
question_df = assemble_contexts(question, contexts)
answer_df = map_answers(question_df).drop(columns=["encoded"])
return answer_df.to_json(orient="records")
if __name__=='__main__':
port = os.getenv('HTTP_PORT', 8765)
app.run(port=port)

Then invoke it with the requests library

import requests
resp = requests.get("http://localhost:8765/answer", params=dict(
q="When did the last country to adopt the Gregorian calendar start using it?"))
resp.json()

… or curl from the command line.

curl -G http://localhost:8765/answer \
--data-urlencode "q=When did the last country to adopt the Gregorian calendar start using it?" \
| json_pp

Output:

[
{
"answer_length" : 1,
"context" : "During the period between 1582, when the first countries adopted the Gregorian calendar, and 1923, when the last European country adopted it, it was often necessary to indicate the date of some event in both the Julian calendar and in the Gregorian calendar, for example, \"10/21 February 1750/51\", where the dual year accounts for some countries already beginning their numbered year on 1 January while others were still using some other date. Even before 1582, the year sometimes had to be double dated because of the different beginnings of the year in various countries. Woolley, writing in his biography of John Dee (1527–1608/9), notes that immediately after 1582 English letter writers \"customarily\" used \"two dates\" on their letters, one OS and one NS.",
"answer_score" : 8.0448360443,
"answer" : "1923",
"answer_softmax" : 0.809645772,
"answer_start" : 34
...

We’re half way through the series now, so this is a good time for a break (and proofreading). The rest of the series will look at improving scalability and accuracy.

--

--