Creating Your Own Science-Question-Answering AI with Aristo Mini

Hi! I’m a research engineer working on Aristo, a project to teach a computer to answer standardized-test-like science questions.

Recently we open-sourced the slimmed down Aristo mini, so that people who aren’t us can play around with creating science question solvers. However, the barrier to entry is slightly non-trivial, so I thought I’d write a blog post to help you get started.

What is Aristo-Mini?

Conceptually, aristo-mini has two pieces. It contains solvers, which are HTTP servers that expose an /answerendpoint that accepts multiple choice questions and returns their answers. And it contains an evaluator that can feed a set of questions to a solver and tabulate the results. (It also contains a web interface for the evaluator.)

The Technology

AI2 is mostly a Scala shop, and so the evaluator is written in Scala, which means that you’ll need Scala 2.11 and sbt in order to run it.

A solver is just an HTTP server with some specified endpoints, which means it can be written in any language. Aristo-mini comes with example solvers in both Scala and Python, but you could choose Ruby or Haskell or JavaScript or whatever. We’ll be using Python, as I have a notion that a lot of the people who’d be interested in playing at creating solvers are Python people, and also I’m the one who wrote the example Python solvers.

What’s more, I wrote them using Python 3.5 features, which means that if you want to follow along, you’ll need to be using Python 3.5 or later. (Which you should be doing anyway!) If you were so inclined, I’m sure you could modify the code to work with 2.7, but this blog post isn’t going to help you do that.

So go ahead, follow the instructions to get aristo-mini up and running, I’ll wait. Also make sure that you’re able to run the Python randomguesser solver.

Anatomy of a Solver

If you look in python/aristomini/common/ you'll find a BaseSolver class that contains all of the boring server parts. It has two NotImplemented methods that you'll need to override in your solver.

The first is really simple:

def solver_info(self) -> str:
"""info about the solver"""
raise NotImplementedError()

That just needs to return some info about the solver to identify it to the evaluation UI. So if your solver was called “blogpost solver v.0.0.1” you could just return that.

The other is the interesting one:

def answer_question(
self, question: MultipleChoiceQuestion
) -> MultipleChoiceAnswer:
"""answer the question"""
raise NotImplementedError()

The Python-3.5-style type hints show that this function needs to take a MultipleChoiceQuestion and return a MultipleChoiceAnswer. These data types are defined using a hierarchy of NamedTuple types you can find in common/ They're mostly self-explanatory.

A question has a “stem” (e.g. “What is your favorite programming language?”) and a bunch of “choices”, each of which has a label (e.g. “A”) and some text (e.g. “Python”). And an answer is just a list of ChoiceConfidence objects, each of which contains a choice and a number indicating how confident we are that it's the correct choice.

The interesting part, of course, is how we come up with those confidences.


I thought hard about what was a good solver to write for this blog post. My criteria were

  • easy to explain
  • does something interesting
  • doesn’t require a ton of extra machinery
  • I can build it in less than a day

In the end I decided to use word2vec, a technique which learns to embeds words into a high dimensional vector space in such a way that the embeddings for “similar” words are “close” to one another.

One such notion of closeness is cosine similarity, which measures the “angle” between two such vectors.

We’ll use the gensim library to train our word2vec model, and we’ll train it on the collection of science sentences that we’ve also provided for the textsearch solver.

(All the code for our example solver is in the aristo-mini repo.)

Training the model

The script to train the model is in scripts/ Mostly it involves a wrapper class to yield a line at a time from the input file so we don't have to load the whole thing in memory. The only real design decision here is how to tokenize each sentence into words. In the end I settled on the following simple approach:

def tokenizer(sentence: str) -> List[str]:
"""use gensim's `simple_preprocess` and `STOPWORDS` list"""
return [stem(token)
for token in simple_preprocess(sentence)
if token not in STOPWORDS]

Given a sentence, we use gensim’s simple_preprocess (which lowercases and splits into tokens), filter out all the STOPWORDS (again, from a gensim list), and then stem each word to remove some variation. (If I had orders of magnitude more sentences I wouldn't bother stemming, but here words appear infrequently enough that I'd rather not treat "question" and "questions" differently.)

To generate the model, run the following command:

python python/aristomini/scripts/ /path/to/sentences /path/to/save/model

(There are several parameters you can tweak; the default are the ones I used.)

Exploring the model

A word2vec model is no fun if you don’t play around with it, so let’s load it into iPython:

In [1]: from gensim.models import Word2Vec
In [2]: model = Word2Vec.load("/path/to/saved/model")

Because we stemmed the words we used to learn the word2vec model, we also need to stem any words we plug into it:

In [3]: from gensim.parsing.porter import PorterStemmer
In [4]: stemmer = PorterStemmer()
In [5]: def stem(word): return stemmer.stem(word)
In [6]: model.most_similar(stem("dinosaur"))
[('pterosaur', 0.9011425375938416),
('theropod', 0.8855368494987488),
('dromaeosaur', 0.8718942403793335),
('ichthyosaur', 0.8578550815582275),
('archaeopteryx', 0.8556511998176575),
('tetrapod', 0.8287086486816406),
('dromaeosaurid', 0.8278105854988098),
('reptil', 0.8232177495956421),
('sauropod', 0.8070803880691528),
('maniraptoran', 0.8061385750770569)]

I don’t know what most of those things are, but they sound like dinosaurs, so that’s probably good. (Because of randomness in training, your model and results will not be exactly the same as these, although I hope they’re pretty similar.)

The real fun happens when you start adding and subtracting vectors. For instance, we could ask the model which it thinks are the “easy” sciences:

In [7]: model.similar_by_vector(model[stem("science")] - model[stem("hard")] + model[stem("easy")])
[('scienc', 0.8785766959190369),
('philosophi', 0.7075982093811035),
('knowledg', 0.705856442451477),
('disciplin', 0.687310516834259),
('anthropolog', 0.6556973457336426),
('scientif', 0.6551451683044434),
('technolog', 0.6524710059165955),
('astronomi', 0.6519479155540466),
('textbook', 0.6511344909667969),
('perspect', 0.6447985172271729)]

and which are the “hard” sciences:

In [8]: model.similar_by_vector(model[stem("science")] + model[stem("hard")] - model[stem("easy")])
[('scienc', 0.8738738298416138),
('laboratori', 0.6229091286659241),
('geologi', 0.616244912147522),
('physic', 0.6124218106269836),
('forens', 0.6025989055633545),
('nsb', 0.6010695099830627),
('anthropolog', 0.5982218980789185),
('geoscienc', 0.5882915258407593),
('institut', 0.5821720957756042),
('disciplin', 0.5820775032043457)]

No judgment here, that’s just what the model thinks!

You could waste all day playing around with this sort of thing. I encourage you to do so, but that’s as far as we’ll go here.

Using the model to answer questions

As I mentioned earlier, we’ll want to embed each question and each answer into a high-dimensional vector and use cosine similarity as a proxy for “goodness” of the answer:

def cosine_similarity(v1: np.ndarray, v2: np.ndarray) -> float:
num =, v2)
d1 =, v1)
d2 =, v2)
    if d1 > 0.0 and d2 > 0.0:
return num / math.sqrt(d1 * d2)
return 0.0

Now, we’ll create a class that wraps the word2vec model we just trained and that has the functionality our solver needs:

class WordTwoVec:
a wrapper for gensim.Word2Vec with added functionality to embed
phrases and compute the "goodness" of a question-answer pair
based on embedding-vector similarity
def __init__(self, model_file: str) -> None:
self.model = Word2Vec.load(model_file)

To start with, we want to measure the “goodness” of a question-choice pair; that is, how confident we are that the given choice is a correct answer to the question.

We’ll start by tokenizing the question and answer, deduplicating the tokens, and then throwing out the answer tokens that appear in the question. (This helps avoid spurious high scores that occur when an answer repeats part of the question, although it obviously has a corresponding downside.)

def goodness(self, question_stem: str, choice_text: str) -> float:
"""how good is the choice for this question?"""
question_words = {word for word in tokenizer(question_stem)}
choice_words = {word
for word in tokenizer(choice_text)
if word not in question_words}
    return cosine_similarity(self.embed(question_words),

The final piece is to write the function that embeds a list of words into our high-dimensional vector space. We'll do something very simple: embed each word to get a vector, and then take the (element-wise) mean of all the vectors. (If none of the words are in the model's vocabulary, we'll just return a vector of zeros.)

def embed(self, words: Iterable[str]) -> np.ndarray:
given a list of words, find their vector embeddings
and return the vector mean
# first find the vector embedding for each word
vectors = [self.model[word]
for word in words
if word in self.model]
    if vectors:
# if there are vector embeddings, take the vector average
return np.average(vectors, axis=0)
# otherwise just return a zero vector
return np.zeros(self.model.vector_size)

For example, we can look at the “goodness” of a simple question and its choices:

In [1]: from aristomini.common.wordtwovec import WordTwoVec
In [2]: w2v = WordTwoVec("/path/to/saved/model")
In [3]: w2v.goodness("during which season of the year would a rabbit's fur be thickest", "winter")
Out[3]: 0.57185169384361234
In [4]: w2v.goodness("during which season of the year would a rabbit's fur be thickest", "summer")
Out[4]: 0.51732586776964762
In [5]: w2v.goodness("during which season of the year would a rabbit's fur be thickest", "spring")
Out[5]: 0.36186933496267454
In [6]: w2v.goodness("during which season of the year would a rabbit's fur be thickest", "fall")
Out[6]: 0.30608466617051361

In this case it gives the highest score to the correct answer!

Building our solver

Now all we need to do is subclass SolverBase, give our solver class a WordTwoVec instance, and use it to implement answer_question:

class WordVectorSimilaritySolver(SolverBase):
"""uses word2vec to score questions"""
def __init__(self, model_file: str) -> None:
self.word_two_vec = WordTwoVec(model_file)
    def solver_info(self) -> str:
return "word_vector_similarity"
    def answer_question(
self, question: MultipleChoiceQuestion
) -> MultipleChoiceAnswer:
mca = MultipleChoiceAnswer(
[ChoiceConfidence(choice, self.word_two_vec.goodness(question.stem, choice.text))
for choice in question.choices]
        return mca

For each choice, we compute the goodness of question.stem and choice.text and return that as the confidence.

You can start the solver by running

python python/aristomini/solvers/ /path/to/saved/model

Evaluating the solver

Start the evaluator (as described in the instructions) and navigate to http://localhost:9000:

Click on “Evaluate an exam” and then choose one (say, the AI2-Elementary-NDMC-Feb2016-Dev exam)

It gets 33% correct, which is better than guessing at random, but not that much better. Suffice it to say that we haven’t solved the artificial intelligence problem today.

Next steps

Go write your own solver! Or just tweak this one. Some ideas:

  • there’s a fair amount of junk in those “science sentences”; clean them up and/or add to them
  • instead of training your own model, use pretrained word vectors
  • try a different tokenize logic (e.g. no stemming)
  • try different parameters for the word2vec model (e.g. number of dimensions)
  • instead of equally averaging the word vectors, use some kind of weighted average that assigns more weight to more “informative” words (e.g. tfidf)
  • use part-of-speech tagging and then compare nouns with nouns, verbs with verbs, and so on

If you come up with anything good, let me know. And have fun!

(Thanks to Peter Turney for helpful discussions about how to design the solver and for contributing most of the ideas for next steps.)

Joel Grus is an engineer on the Aristo team and the author of Data Science from Scratch. He blogs occasionally at and tweets all the time at @joelgrus.

Like what you read? Give Joel Grus a round of applause.

From a quick cheer to a standing ovation, clap to show how much you enjoyed this story.