The Full-Stack Guide to Actions for Google Assistant

Plus: How We Taught Google Assistant to Teach You Spanish

If you’re currently building or thinking about developing a custom Google Assistant Action, this post is for you!

My friend Daniel Gwerzman and I recently released Spanish Lesson, an Assistant that will help you learn Spanish by teaching you a number of new words every day, and reading you sample sentences in Spanish for you to translate into English.

Since we had so much fun building the app, we thought we would show you how we did it, including some interesting code snippets. Our goal is to help you save some precious time when developing your own Actions, and to show you how fun they are to create!

The Spanish Lesson Logo :)

So how did this whole things get started?

Some months ago, my life partner Ariella got a new Google Home device. She was very excited about it and tried all sort of things. At some point, I heard her ask the device: “Hey Google, Teach me Spanish”, to which the device responded: “I’m sorry, I don’t know how to help with that yet.

Coincidentally, earlier that same day, I read Daniel Gwerzman’s article in which he explains why now is a good time to build actions on Google. He had some very good points — people are lazy and would rather talk than type, and since it’s still the early days of the platform, there are many opportunities to have a big impact in the technology space.

So when the Assistant responded to Ariella that it can’t teach her Spanish “yet,” it struck me: why don’t I make that happen?!

The next day, I pinged Daniel, and asked him if he was ready to embark on a new adventure. Daniel was very excited about the idea, and we started collaborating. We sat together and designed a persona, which was the basis for creating the texts and possible dialog flows. We learned a lot throughout this process, and we will probably publish another post from the product / UX point of view in a few weeks.

However, the focus of this post is the technical part of creating an Assistant Action — the challenges that we had, the stack we chose, basically sharing with you how the architecture of a complete solution for providing a real-life, complex, Action on Google Assistant.

We are going to cover the tech decisions that we made, and share our experience with the outcomes of our choices.

Template, Dialogflow, or Actions SDK?

There are currently three approaches to building Assistant actions: ready-made templates, Dialogflow, and Actions SDK.

The ready-made templates are great for use cases such as creating a Trivia Game or Flash Cards app. When you use a template, you don’t need to write a single line of code, just fill-in some spreadsheet and the action is created for you, based on the information that you fill. This can be very useful for school teachers, who can easily create games for their students.

In our case, however, we needed more power: we wanted to be able to keep track of the user’s progress, and actually mixing Spanish and English in a single app is quite a challenge, as you will see in a minute. So we had to choose between Dialogflow and Actions SDK.

Dialogflow gives you a nice user interface for building conversation flows (in some cases you can even get away without writing any code), and also incorporates some AI to help you figure out the user intent.

Actions SDK gives you “bare-bone” access to user input, and it is up to you to provide a backend which will parse that input and generate the appropriate responses.

We decided to go with Dialogflow, as it could handle some of the flows for us (e.g. asking the user a yes or no question, and understanding the user response), and it would also let us prototype quickly.

Dialogflow’s built-in capabilities proved very useful to us. For instance, if a user didn’t know how to translate the sentence they were given, they could say, “I don’t know” to get the answer and skip to the next one.

Quickly after we published the app, we realized users had many ways to say they didn’t know the sentence — “I have no idea,” “I forgot,” “what is it,” and even the good old, “idk.”

“I don’t know” — Dialogflow in Action

Adding these alternatives into the app was easy, since Dialogflow allows us to quickly add new variations as we go. This means that we can easily improve our app as we learn from what users actually say to the app. Since it is so simple to iterate quickly, we do it all the time!

The Back-end and Continuous Integration

As a Web developer, JavaScript is my bread and butter, so in choosing a technology for the backend, I went with Node.js, a decision that was also backed by the fact there is an official Node SDK for developing Actions on Google using Dialogflow (the actions-on-google package). Daniel already had prior experience with it, and wholeheartedly recommended it.

To complement the development experience, I also set up TypeScript, what is now my go-to solution for any project larger than a few lines of code. Convientily, the Actions SDK comes with complete type definitions, so I got auto complete and type checking. Finally, I set up unit tests with Jest, a testing framework from Facebook. I use it in conjunction with Wallaby.js, a solution that continuously runs your tests and show you the execution results and code coverage as you type your code. I am thankful to Artem for creating it.

Excerpt from our answer verification logic — the green bullets on the left is how Wallaby indicates our coverage

In addition to testing the application logic, we also used Unit Tests to verify the consistency of our Spanish word curriculum — for example, here is an excerpt from a test verifying that each word in the database has at least 3 sentences it appears in:

Having a solid testing foundation is a key factor in Continuous Integration, which we employ in our action. We use CircleCI, and set it up so whenever I push a change to the master branch, the test suite is executed, the code is checked with tslint, and if everything goes well, the new version is immediately deployed and goes to production.

We use Docker to create a container with our transpiled application, which we then push to the Google Container Registry, and then deploy to a small machine we set up for that (and also for Metabase, see below). Three alternatives we had for deploying the code to Google Cloud Functions, Google App Engine or Google Kubernetes Engine, but we decided to go with simply running Docker on our on now.

Here is our Dockerfile, which uses the multi-stage builds feature introduced in Docker 17. First we build a container where we run our tests and transpile the source code with JavaScript. Once this has completed successfully, we move on to the second stage where we build the actual container for production, based on the leaner Node-Alpine image:

Finally, we use a service called Sentry for tracking errors. It collects information about any exception that happens in our production environment and sends us an email with information. You can also see in line 32 of the Dockerfile that we define an environment variable with the tag of the current build, which is then used by Sentry to track whether an issue still occurs after we deployed a new version. We populate the ${BUILD} variable with our CircleCI build number, by passing --build-arg argument to the docker command when building the image.

Example for an error we had, tracked by sentry down to the exact source code line

Overall, our current setup includes both the tests and Sentry as safety nets, which we trust, so that we can make big changes and deploy things to production without extensive manual testing and fearing that we will break the app. I love working this way.

Analytics

If there is one thing life taught me as an entrepreneur, it’s that whenever you start a new product, every single user that you have is an enormous opportunity to learn something new about your product.

While prototyping, I sat next to Ariella and observed her using the app, and we got very valuable feedback from just doing this. This is how, for instance, we decided to add the “I don’t know” scenario mentioned above.

Given this, it was clear to us that we wanted to track all the user interactions with our app, and also wanted to have an easy way to query these interactions, analyze them and visualize them to help us see the patterns. We wanted to collect Analytics from day-0, and looked for a simple solution that would not require too much coding and would be able to grow along with the app. We went with BigQuery, a data warehouse solution from Google, which can easily scale to massive amounts of data, and supports the SQL query language we are familiar with.

The way we use BigQuery is by directly reporting any user interaction to it using a feature called “Streaming Insert,” which lets you insert new rows one-by-one, whenever they occur. You can see the code snippet that reports the users’ interactions below:

Basically, you call this function with the DialogflowApp instance, and the response you sent to the user (if any), and you are all set. This is how we defined our BigQuery table schema:

Initially, we only tracked the input and the response, but when we started seeing users interacting with the app, we saw many cases where it seemed like the app did not understand the user correctly, and so we wanted to understand how users interact with the app, and which part of the world they were from.

This is where we started collecting the locale, inputType and surfaceCapabilities parts. The later two were not very straight forward, as they required some pre-processing before we could store them in the database:

We learned that the vast majority of the users interact with our app using voice:

By looking at the surface capabilities, we could also see that many of the users interacted with us using a device with just Audio Output and Media Response Audio capabilities, which we believe to be Google Home (in contrast with a Screen Output-capable device, such as a smart phone):

Overall, setting up BigQuery was a matter of about an hour of coding and testing, and then generating a service account file with the credentials and including it in our code. With luck, this is a part that we will not have to revisit any time soon, as the whole database is totally managed for us. As you can see, we are already producing some very useful insights that are helpful us to shape and improve our product.

You might be wondering where the graphs above come from. BigQuery doesn’t come with data visualizations built-in, and while you can export the query results to Excel, this is not very convenient. We use a free product called Metabase, which connects to BigQuery and allows you to easily query and visualize the data. For instance, this is how I set up the query for the first graph:

You can also opt-in to write your queries using SQL, in case you need more sophisticated aggregation, or to join data from several tables. Finally, Metabase allows you to create nice dashboard which you can set to display the metrics that you care about:

Our Current Metabase Dashboard

Well, based on the graph above, you can probably guess at which point in time we launched the app :-)

Google also provides some insights into your app in a dedicated Analytics section in the Actions on Google console, where you can learn about your app health, latency, discovery, and more.

Our latency stats from the Actions on Google console. How do they compare with your?

The take away is that setting up an analytics solution is pretty simple, and there is so much you can learn from your data, so this is definitely something that you want to do before you launch your app.

Talking Spanish

This part is somewhat specific to our use case, so I will keep it short. We wanted the app to be able to speak Spanish in addition to English. Google Assistant allows you to customize the responses using a language called SSML, which can specify voice parameters such as speed, pitch, and also give hints how to read specific portions of the text, such as adding an emphasis.

Unfortunately, Google’s SSML implementation currently does not support switching between different languages. It does, however, allows you to insert external audio files, so we could theoretically generate the Spanish speech using a different method.

Funny enough, however, we found the solution in a quite unexpected place: the Amazon Cloud. It has a cloud TTS service, called Amazon Polly, with a generous free tier: 5 million characters per month. They also support SSML, which means we can ask it to speak slower when the user asks to repeat a sentence.

The moral of the story is that sometimes, you can get better results if you mix and match services from multiple cloud providers to meet your specific needs.

A Database for User Data

The first prototype of the app simply gave users a random sentence in Spanish to translate into English. While watching our #1 beta tester Ariella use it, I realized that this was not the right approach, as there were many words she was not familiar with yet; having the Action spit out a random sentence without taking into account her current vocabulary didn’t make much sense.

At the point, I realized we would need to use some kind of database to store the progress for each individual user.

We chose the new Cloud Firestore database, as it is completely hosted and managed, and it is a schema-less document database, which makes it perfect for fast prototyping. I’m still pretty new to Firestore, but I really enjoy using Firebase (actually, I have been a user since their public beta in 2013, and I even wrote my own firebase-server implementation), so I decided we’d give it a go.

So far I am happy: interfacing to it from Node.js was a breeze, as well in the small Frontend app we built. Their console is a very convenient way to explore and manage your documents while prototyping:

Using the FireStore console to check on our users progress

At some point, Daniel asked me why we needed both BigQuery and Firestore. The answer lies in the fundamental difference between the two: BigQuery is great for complex queries on vast amount of data. You get answer within seconds, no matter what data size you have. In this is also where its weakness lies — even if you want just one record, the engine will still have to scan the entire table and you will get the answer within seconds. While waiting 5 seconds to get a result for complex query makes sense, having our users wait 5 or more seconds for each interaction with the app is not the kind of experience we are aiming for.

In addition, BigQuery is pretty limited when it comes to modifying existing data. So Firestore is great for quick read/insert/update operations, while BigQuery excels at storing and querying aggregate user data. So just like how we ended up using multiple services to make our app speak Spanish, it’s all about finding the right tool for the job.

The Front-end

Now you might be wondering: what does a front-end has to do with an assistant action? At some point, we will probably build a web page for our users so they can see their own progress, but for now, we decided to build a small tool for ourselves in order to explore how our users are interacting with the app. True, we have all the data in BigQuery, but reading a conversation off a table is not very convenient.

I decided to take on this opportunity to learn a new technology, Vue.js. I started from fuse-box-vue-seed, which also includes Vuetify, a Material Design component library for Vue, and fuse-box, a super-fast module bundler, which is a nice alternative to Webpack.

I then added Google Single Sign-On using Firebase Authenticaton, deployed the app to the web with Firebase Hosting, and used the firebase-admin SDK to verify user identity on our backend.

One of my testing sessions, visualized by our Vue.js frontend

Overall, I am very happy with Vue — the learning curve was gentle, working with Vuetify was quite easy, and I managed to quickly build a tool we now use daily and add new features to.

Conclusion

We already have some mileage with our app, but it is still the very beginning. Building on existing technologies and leveraging cloud services, such as Dialogflow, BigQuery, Firestore, Sentry, Amazon Polly, while using a Continuous Deployment setup with Jest, Docker, Circle CI, and Google Container Registry allowed us to get up and running really fast, and to keep the pace going even once the app kept live.

It’s been two weeks since our app launched, and we already served hundreds of users and have learned a bunch about what we could improve. Some of the smaller improvements already made it into the app, and we are now working on a overhaul of the lesson structure, to make it more effective. You will probably see another post from us in the coming weeks covering things more from the product perspective.

It brings us great satisfaction to see users returning to our app on a daily basis to learn new words, and Ariella has had fun learning new Spanish words, too!

Sound Fun? Give Our Action a Try!

Oh, and don’t forget to try out our Action! You can check it out on Google Home, Android 6.0+ TVs, Android 6.0+ phones, iOS 9.0+ phones, or on your Chromebook. Just install Google Assistant (if you haven’t already), and say:

“Talk to Spanish Lesson”

¡Diviértete! (“Have fun,” and don’t forget to leave us any feedback in the comments section!)