Chat room translation using Watson, MQTT, Openwhisk, and Twilio
The motivation behind this particular project comes from playing one of my favorite Android games titled “Rusted Warfare”, one of the few real time strategy games on the mobile marketplace. It’s very similar to “Age of Empires” and “Command and Conquer”. It’s a lot of fun, but the multiplayer games require effective communication for strategizing and teamwork. That tends to be difficult simply because most of the players speak different languages, as we can see in the gameplay screenshot below.
Our proposed solution here is to allow for each player to be able to define their preferred language, and then send and receive all messages in that selected language. In the scenario in the gameplay screenshot, I’d see each of the incoming messages in English, Gabriel would see theirs in Spanish, etc.
This can be accomplished by reimplementing the game chat clients to send and receive messages via a MQTT messaging broker. As messages are received by the broker, they’re then transcribed/translated by a series of serverless functions, and the results are then broadcasted to the subscribed clients. The structure of the MQTT channel each client publishes to describes which “event” has occurred, and subsequently which serverless functions will be called to handle the event.
So for example, an audio message from a device used by an English speaking client can be published to the following MQTT channel, which will specify the message type and language like so
A serverless action which is bound to the
fromClient/voice/+ channel (+ representing a wildcard variable) will process each incoming message by identifying the input language if it’s not specified in the topic, transcribing it, and then translating to other languages. Once translated, each result is then published to the
toClients/<format>/<language> channels, which broadcasts the translated result to the subscribed clients in the format/language of their choice.
- Message received from a client, which can be a web browser, CLI, Openwhisk action, SMS text, etc.
- If message payload contains an audio file, it is transcribed to text.
- Transcribed text is translated to other supported languages.
- If message sent via SMS, sender phone number is added to an etcd key/value store. etcd is used here to maintain a list of subscriber’s phone numbers, as well as their respective languages. An adjustable TTL value is used here to remove numbers from the store if the subscriber does not participate in the conversation for 300 seconds.
- Translated messages/audio streams are published to various channels on the MQTT broker, which then distributes the messages amongst subscribing clients.
By using the MQTT broker as a mediator, we are able to decouple our logical components, which allows us to move away from a more traditional “call stack” and more towards a truly “event driven” architecture. This decoupling method makes our architecture’s logic more modular and independent. This is great for agile development, since components are independent and isolated they can be updated in place without affecting others, or having to restage/push the entire system.
We also chose to extend the system to support SMS clients by using Twilio. Twilio allows for us to trigger Webhooks whenever a call or text is made to a registered phone number. So in this case, we have a Twilio “messaging” number that waits for incoming SMS messages, and forwards the message information (sender number, city, body, etc.) to the OpenWhisk sequence.
We’ve deployed an UI that can be accessed at https://translation-mqtt.mybluemix.net. Since the logic is maintained by serverless actions, the UI is not required for the services to work, it simply provides an accessible MQTT client and a way to capture/send/receive audio via websocket. Since we’re utilizing websockets, the voice input module can be captured continuously to allow for a more natural, free flowing conversation.
We also have registered a Twilio number to demonstrate the SMS integration. Send a text to (310)340–2202 in any language, and your phone number will be added as a subscriber for 5 minutes.
Use Cases / Next Steps
In addition to handling game/messaging chat clients, this system can possibly be beneficial for live streaming scenarios, such as sporting broadcasts, political hearings, university classes, podcasts, etc.
We’re looking into adding VoIP (Voice over IP) capabilities from Twilio, which will enable the system to be used to handle a multilingual conference call.
We’ll also investigate Lyrebird, which is an API that has the ability to “learn” and mimic speech patterns by recording and analyzing the user’s voice. If Lyrebird can be used to mimic a user’s speech and tone in different languages, that’ll make the conversation flow much more naturally.
If that’s possible, it’d be interesting to experiment with music, possibly feed in artist discography, acoustic recordings and interviews to train Lyrebird to better differentiate between their voice and the melody. This might open a world of possibilities, such as being used at concerts or silent parties, where the same song could be processed in real time and broadcasted in different languages. As an alternative to Lyrebird, we may also look into modifying the pitch of the speaker voice using the following project.