Testing Smart Reply

Ivan Gavran
4 min readJan 21, 2016

--

It was a sound of future when two months ago Google announced they were introducing Smart Reply to their Inbox app. The promise was the new system would be able to determine which incoming mails are suitable for short reply and suggest up to three possible (short!) replies. Thus, it would be beneficial for the end user as it would let him answer an email by one tap (yes, it’s him, she’d never be so reckless to let computer answer her emails).

Here is how it looks like:

A quick test

Even though I was immediately skeptical about practical usability, I was delighted by the mastery of people who created a system that is able to understand natural language and form a (sensible) response.

To see how well it works, I sent immediately seven mails that could be answered in short sentences. Here are my mails and the answers by smart reply system

  • Sorry, I am running late. Can we meet 15 minutes later?

Sure, see you later | Sure, no problem. | I am not available at that time.

  • I checked my schedule, and cannot make it on Sunday. How about Monday? Or Tuesday? Does it work for you

Monday should work for me. |- | -

  • When would you prefer to meet next week?

Next week. | I’m pretty open. | I’m not sure yet.

  • I am organizing a party next Saturday at 8 pm. RVSP, please.

I’ll be there! | I’m in! | Sorry, I can’t make it.

  • I’ll be visiting Paris next weekend. Who could meet me and show me around?

I would be happy to meet you. | I can come to you. | Where are you staying?

A quick survey:Who is a better songwriter — Bob Dylan or Leonard Cohen? No need to elaborate…

I like it. | Good one! | I have no idea!

Wow! I am pretty impressed. Except for the last one, all mails offered meaningful and complete set of answers.

What happened in the following days was even more astonishing: I found more than one situation where using an answer suggested by Smart Reply would cover perfectly what I wanted to answer.

How it works

Black magic :) But this is how it is described on Google Research blog: it is based on pair of recurrent neural networks, one capturing the sense of incoming mail (encoder), and the other (decoder) creating a reply, word by word. Encoder creates a vector based on words appearing in the incoming mail — and vectors for sentences Are you free tomorrow? and Does tomorrow work for you? are similar (with respect to some metric), thus capturing the sense, and not only wording. Furthermore, all the knowledge of system comes from the bunch of emails it digested and learned upon them and their answers. Smart Reply is mentioned as one of the use-cases for recently open-sourced TensorFlow, Google’s machine learning library. (I will write more about how it works in some future post.)

Tweaking love

An interesting remark from the research team: the system — trained on actual users’ mails — used to answer large number of the mails with simple I love you, as this was a very common answer to variety of mails — so it needed to be tweaked a bit :) (that tweaking might be the reason for answering seemingly simple mail from the first illustration so awkwardly)

A more detailed test

I realized that Smart Reply is not just super interesting, but could be useful as well. Let answering typical mails aside, but an obvious next step would be integration of a calendar and then suggesting answers based on availability data from it. Especially given recent news about Google creating a smart messaging interface to browsing the web.

As most of my mail conversations is not in English (but in Croatian), my positive impression was based on very few examples, so I had to do a real testing. Bunch of emails from sampleemails.org were scrapped to do the test. However, as grading was performed by me, I had to cut down the number of testing samples to approximately 50. Sure, mails from that site are not always suitable for quick answers, but the idea was to see how often Smart Reply would offer sensible (not necessary complete) answers (at least one useful answer out of three was enough to mark a sample as successful. The samples without answers offered were marked unsuccessful. ).

It turned out that 58% of samples were offered a successful set of answers, which I find to be a very good result, given the form of testing mails. (check the mails and suggested answers here).

What I noticed while grading the answers was that it didn’t capture the sarcasm from one email, and that most emails that were without any suggested answer were promotional emails (where no answer could actually be the best answer).

After performing the test, Smart Reply still sounds like future, but let’s wait to see/learn more details on how it works and what will be built upon it.

--

--