Developing for Voice. The multi-dev problem
If you’re a native voice developer (it doesn’t matter if it’s for Google or Alexa or any other vendor for that matter) then chances are that you’re probably either are the only one working on the skill or, if you do have colleagues, you take turns in coding.
That’s all because of this (Alexa has a similar screen, rest assure):
which lets us to configure a single webhook URL to use for externally fulfilling voice requests. That’s fine if your development environment looks like this:
But whenever a new person tries to join the effort, this occurs:
How can both devs target their own local voice-fulfiller instances to check their own changes?
To try and crack this, we would need some sort of a router that gets called first by DialogFlow which then somehow decides which Dev should the Google Assistant request target. But how can we know which Devs are which based on the Google request that we get?
It turns out that when you open a new conversation session, the JSON payload that Google delivers to the fulfiller has the following fields attached:
So, as you might have expected, there is a way to uniquely tag a conversation session via the conversationId field +, bonus!, Google is kind enough to provide us a conversation.type set to “NEW” when the current request is starting a new conversation session.
Even more useful then that is the fact that, unlike Alexa, we also get the full textual query used to fire-up the conversation inside the query field. For our example, that would be “Talk to my skill”. Sweet!
Now let’s see what happens if we try out a different query. If we type-in (or say) “Talk to my skill on dev 1”, this is what we get back:
It worked! We got the full query text as part of the request body and a new conversation session has been instantiated. Great! Now we can go on and implement our voice-router solution.
The voice-proxy (or router, if you wish) algorithm will thus be:
To make this work, we would need 3 things to happen:
- The voice-proxy code needs to be publicly accessible on the internet so that
- DialogFlow’s Webhook URL can be configured to point to it and
- There is a way to map X (from our “Talk to my skill on X” placeholder) to a publicly accessible URL for each developer runtime
The first 2 points might be easy to deal with: we would have to host the voice-proxy code on some VM which we would then make accessible on the internet and use that URL inside DialogFlow’s webhook setting. DialogFlow does require the HTTP endpoint to be secure, but nowadays, this can be easily (and freely) accomplished via Letsencrypt.
Point 3 is a little bit more difficult since it would require voice-proxy to keep a mapping between all developers and their publicly accessible runtimes. This could be a dictionary data-structure like so:
- “dev1” : “https://dev1-address”
- “dev2” : “https://dev2-address”
- etc …
Which would work when someone would say “Talk to my skill on dev1”, since the voice-proxy would know, from then onward, that the conversation session would need to forwarded to “https://dev1-address”.
The problem here is that these key-value pairs would need to be updated every time a developer would appear/change or leave the project.
A somewhat better approach would be to make some assumptions on the way the developer can be reached. For instance, if we would know that all developers would use a reverse-TCP tunnel service like ngrok (which is, by the way, free), then we could have X be the actual subdomain assigned for that particular developer.
Eg. If Dev 1 starts a local dev-runtime which he/she then makes it publicly accessible with ngrok at https://2e3ac9d.ngrok.com and Dev 2 does the same for his local code having ngrok assigning him/her https://4fd8320c.ngrok.io , then we could code the voice-proxy so that when someone (doesn’t matter who) says “Talk to my skill on 2e3ac9d”, the request, along with all future ones, would get routed to Dev 1. Subsequently, “Talk to my skill on 4fd8320c” would do the same thing for Dev 2.
So we don’t need to update any dictionary and, better yet, we don’t need to manually secure the communication link (via https) since ngrok already provides this for free, out of the box.
Bonus! You could have the voice-proxy default to a staged, stable runtime if no developer is provided (eg. “Talk to my skill”). That way, the testers/QAs could be kept happy.
The resulting configuration map would look something like this:
And there you have it: we now can do native, multi-dev work on DialogFlow skills. Easy, right?
One thing left to mention here, though, is that this won’t work for Alexa Skills due to the fact that Amazon does not send a textual query to the external fulfiller. This is by design. Oh well … better luck using something else there.😇 TBD