Make Audio Recordings for Actions on Google

Published in

Google Developers

7 min readFeb 12, 2019

Did you know that you can let your users record their own audio for a Google Assistant Action? In this article, we’ll show you how to create a simple prototype that allows users to make an audio recording that can then be played back on the Google Assistant.

We need a way for users to record audio. However, we didn’t want to make the user install another app, so we decided on using a web page instead.

Actions on Google supports playback of audio with either SSML, which is limited to 240 seconds, or with a media player for longer tracks. So, all we need to do is to create a recording in the supported media file format and then play it back as part of a response in an Action.

The design

The design for our prototype consists of two parts:

A web client for recording and uploading the audio files to Google Cloud Storage, which is a massively scalable, cloud-hosting file storage system.
An Action that plays the audio file from Cloud Storage.

Since the Actions on Google SSML audio tag requires an audio file in OGG format for playback, the recorded audio has to be converted to that format before it is uploaded to Cloud Storage.

The UI for the web app will provide a button that the user needs to press to start the recording and then press again to stop the recording:

Since we want to control which users can upload recordings, we will also provide a way for users to sign in before they’re allowed to make a recording.

Progressive web app

The web app is implemented as a single page Progressive Web App (PWA). The main advantage for our use case is that users can install our web app on their mobile device home screens with its own icon. When the user clicks on the icon, the web app is launched full-screen and behaves like a native app.

We provide an app manifest for the PWA to specify the icon and app metadata:

{
  "short_name": "Audio Demo",
  "name": "Audio Demo",
  "icons": [
    {
      "src": "images/android-desktop.png",
      "sizes": "192x192",
      "type": "image/png"
    }
  ],
  "start_url": "/index.html",
  "display": "fullscreen",
  "orientation": "portrait"
}

User authentication

Next we want to track which users are uploading content and also to enforce Cloud Storage security rules for storing data about the uploads, we need to ask the user to log in first before uploading any recordings. To authenticate the user we use FirebaseUI for Web, which provides a UI for web pages that use Firebase Auth. The following scripts and styles are required:

<script src= "https://www.gstatic.com/firebasejs/ui/3.5.2/firebase-ui-auth__en.js">
</script>
<link type="text/css" rel="stylesheet" href=
 "https://www.gstatic.com/firebasejs/ui/3.5.2/firebase-ui-auth.css" />
<script src=
 "https://cdn.firebase.com/libs/firebaseui/3.5.2/firebaseui.js">
</script>
<link type="text/css" rel="stylesheet" href=
 "https://cdn.firebase.com/libs/firebaseui/3.5.2/firebaseui.css" />

The FirebaseUI widget then needs to be initialized:

const ui = new firebaseui.auth.AuthUI(firebase.auth());
ui.start('#firebaseui-auth-container', getUiConfig());

Firebase Auth supports a wide range of sign-in providers, but for this app we’re only going to use Google Sign-In:

const getUiConfig = () => {
  return {
    signInSuccessUrl: '/',
    signInOptions: [
      {
        provider: firebase.auth.GoogleAuthProvider.PROVIDER_ID,
        authMethod: 'https://accounts.google.com',
        clientId: null
      },
    ],
    tosUrl: '/tos.html',
    privacyPolicyUrl: '/privacy.html'
  };
}

We keep track of the signed-in user by listening to the authentication state events. A state change callback is triggered when a user signs in and out of the app:

firebase.auth().onAuthStateChanged((authUser) => {
  user = authUser;
  authUser ? handleSignedInUser(authUser) : handleSignedOutUser();
});

Once the user has signed in, we display two buttons that allow the users to sign out or delete the account for the app.

Recording audio

To get access to the microphone from a web page, we need to use the MediaDevices getUserMedia() method, which prompts the user for permission to use a media input device to record audio:

navigator.mediaDevices.getUserMedia({audio: true,  
    video:false}).then(function(stream) {
  ...
}).catch(function(err) {
  ...
});

Most modern browsers support the Web Audio API, which provides advanced audio playback, recording, and visualization support. WebAudioRecorder.js is a JavaScript library that uses the Web Audio API to record audio input and supports encoding to several audio file formats, including OGG.

To use WebAudioRecorder.js, the following script needs to be loaded in the HTML:

<script src="javascripts/WebAudioRecorder.min.js"></script>

In the getUserMedia callback, invoked after the user has given permission to access the microphone, an audio recorder object is created:

var audioContext = new AudioContext();
var audioSource = audioContext.createMediaStreamSource(stream);webAudioRecorder = new WebAudioRecorder(audioSource, {
  workerDir: 'javascript/',
  encoding: 'ogg',
  onEncoderLoading: (recorder, encoding) => {
    console.log('onEncoderLoading');
  },
  onEncoderLoaded: (recorder, encoding) => {
    console.log('onEncoderLoaded');
  },
  onEncodingProgress: (recorder, progress) => {
    console.log('onEncodingProgress: ' + progress);
  },
  onComplete: (recorder, blob) => {
    console.log('onComplete');
    persistFile(blob);
  }
});

The audio recorder object uses a web worker to load the following WebAudioRecorder.js encoder files from the javascript/ directory:

WebAudioRecorderOgg.min.js
OggVorbisEncoder.min.js.mem

The audio recorder object is then configured to meet the OGG requirements for the Actions on Google media player:

webAudioRecorder.setOptions({
  timeLimit: 240, // max number of seconds for recording
  encodeAfterRecord: true, // encode the audio data after recording
  ogg: {
    bitRate: 160 // 160 Hz bitrate
  }
});

Cloud Storage

Once WebAudioRecorder.js has completed the encoding, its onComplete event handler is invoked with the blob data for the encoded file. We will use Cloud Storage for Firebase for persisting the file and making it publicly available via HTTP. Firebase adds client SDKs for use in mobile apps, built on top of products like Google Cloud Storage.

The following scripts are required to use Cloud Storage:

<script src=
    "https://www.gstatic.com/firebasejs/5.8.1/firebase-app.js">
</script>
<script src=
    "https://www.gstatic.com/firebasejs/5.8.1/firebase-storage.js">
</script>

We then initialize Cloud Storage access using Firebase:

var storageService = firebase.storage();
var storageRef = storageService.ref();
var metadata = {
  contentType: 'audio/ogg'  // OGG mime type
};

We then use an uploadTask to track the progress of the file being uploaded to Cloud Storage:

var uploadTask = storageRef.child('files/' +
    (new Date().toISOString()) + '.ogg').put(blob, metadata);
uploadTask.on(firebase.storage.TaskEvent.STATE_CHANGED,
    (snapshot) => {
  var progress = 
      (snapshot.bytesTransferred / snapshot.totalBytes) * 100;
  console.log('Upload is ' + progress + '% done');
  switch (snapshot.state) {
    case firebase.storage.TaskState.PAUSED:
      console.log('Upload is paused');
      break;
    case firebase.storage.TaskState.RUNNING:
      console.log('Upload is running');
      break;
  }
}, (error) => {
  ...
}, () => {
  // Upload completed successfully
  ...
 });
});

Cloud Storage security rules are used to require Firebase Authentication in order to perform any read or write operations on all files. Once the file has been uploaded, we need to retrieve the HTTP URL to access the file and then persist that in a database to track all the recordings. We will use Cloud Firestore, which is a cloud-hosted, NoSQL, realtime database, for persisting the recording metadata. We’ll add a document to a collection called “files” that contains the metadata for the audio uploaded to Cloud Storage:

<script src=
   "https://www.gstatic.com/firebasejs/5.8.1/firebase-firestore.js">
</script>

For each recording we track the user ID, a durable HTTPS URL that anyone can use to download the contents of the file, and a timestamp:

const db = firebase.firestore();
uploadTask.snapshot.ref.getDownloadURL().then((downloadURL) => {
  console.log('File available at: ' + downloadURL);
  db.collection("files").add({
    user: user.uid,
    url: downloadURL,
    timestamp: firebase.firestore.FieldValue.serverTimestamp()
  })
  .then((docRef) => {
    console.log("Document written with ID: ", docRef.id);
  })
  .catch((error) => {
    console.error("Error adding document: ", error.message);
  });
});

Notice that we’re adding the authenticated user’s UID as a field in the new document. This informs the Firestore security rules for this project who is allowed to later modify and delete this document.

So, that covers the main features of our web app. Now we can move on to the design of our Action.

Assistant Action

Next, we’ll implement a conversational Action using Dialogflow. When the user invokes the Action, the latest recording is played back to the user using the Actions on Google media player. So, the Dialogflow agent for our Action is very simple in that it mostly just needs a main welcome intent. The welcome intent handler responds with an SSML audio tag.

For fulfillment, we use the Dialogflow inline editor, which automatically provisions a Cloud Function for your agent. The Node.js code for the function needs to use the Firebase Admin SDK which provides access to Firebase and Google Cloud resources in server side code. We will use it to read the latest recording data from Cloud Firestore:

const admin = require('firebase-admin');
admin.initializeApp();
const db = admin.firestore();

The intent handler then reads the data from Firestore and generates a response that uses the SSML audio tag to play the latest recording. For this prototype, we will be using a simple Cloud Firestore query to determine latest recording, but in a production quality app you will need to track uploads and their status more granularly:

app.intent('Default Welcome Intent', (conv) => {
  return db.collection('files').orderBy('timestamp',  
       'desc').limit(1).get()
    .then(snapshot => {
        if (snapshot.size > 0) {
            snapshot.forEach(doc => {
              conv.close(`<speak>
                <par>
                  <media xml:id="intro">
                    <speak>Welcome to the Audio Demo.
                           Here's the latest recording:</speak>
                  </media>
                  <media xml:id="introSound" begin="intro.end+0.5s" 
                      soundLevel="5dB" fadeOutDur="1.0s">
                    <audio
                      src="${INTRO_SOUND_URL}"/>
                  </media>
                  <media xml:id="recording"  
                         begin="introSound.end+0.5s">
                    <audio
                      src="${doc.data().url.replace(/&/g,        
                            '&#38;')}"/>
                  </media>
                  <media xml:id="endSound" 
                         begin="recording.end+0.5s">
                    <audio
                      src="${OUTRO_SOUND_URL}"/>
                  </media>
                  <media xml:id="bye" begin="endSound.end+1.0s">
                    <speak>Bye for now. Hope to see you soon. 
                    </speak>
                  </media>
                </par>
              </speak>`);
        } else {
            conv.close('There are currently no recordings.
                           Please try again later.');
        }
    })
    .catch(err => {
      console.log('Error getting documents', err);
      conv.close('Oops! Something went wrong. 
                      Please try again later.');
    });
});

Note that the HTTP URL for the audio recording contains ‘&’ characters, which will clash with the SSML syntax and need to be encoded to ‘&’.

If you want to know more about the powerful capabilities of SSML, then read our previous post “Advanced SSML for Actions on Google”.

Next steps

Our web app and Action is quite simple, but could be expanded into various other use cases — maybe a CMS for podcasters, a voice social network, or even some kind of collaborative voice game?

We’ve shown technically how easy it is to let users record audio that can be used in Actions. If you want to give users more control over the playback, you an update the fulfillment code to rather use the media player. Now it’s up to you to take this code and turn it into something more interesting.

The code has been open sourced on Github for you to customize. We can’t wait to see what you come up with!

Want more? Head over to the Actions on Google community to discuss Actions with other developers. Join the Actions on Google developer community program and you could earn a $200 monthly Google Cloud credit and an Assistant t-shirt when you publish your first app.