Box Skills + IBM Watson Speech to Text Tutorial

Published in

Box Developer Blog

10 min readFeb 8, 2023

If you have been following Box Platform’s content the past few months, read the many other amazing technical posts on the subject, or even just gone to the movie theatre, you’ll know machine learning and artificial intelligence has been at the forefront of everyone’s mind.

As I stated in my Box Skills overview blog back in November, Box Skills were introduced in 2017, and since that time, over 1000 skills have been used across various industries to bring intelligence to Box content.

More recently, I put out a tutorial using Box Skills + Open AI. If you haven’t seen that, make sure to check it out. Continuing on the train of knowledge from Terminator to M3GAN, I’m taking us back to the beginning of Box Skills with a tutorial using IBM Watson.

IBM and Box are strategic partners, and when used together, provide powerful insights across the Content Cloud landscape — Not least of which can come from a Box Skill + IBM Watson integration. That’s why IBM was an original partner when Skills launched all those years ago!

Let’s get this train rolling!

IBM Watson Overview

Originally released back in 2010 as a question/answering service, IBM has drastically expanded Watson’s capabilities. Back in October of 2022, IBM released new embeddable libraries developers can use directly in their applications, without having to call a separate cloud service.

Watson has three main services: speech to text(STT), text to speech(TTS), and natural language processing (NLP). STT can be used to transcribe audio into written text, as well identify keywords spoken. TTS can speak written text, and NLP can process audio for sentiment.

For the tutorial today, I will be using the speech to text service.

Solution Diagram

Below you will see the diagram for what we are going to build, as well as a screenshot showing the final result we are going for.

This solution can be altered slightly for many uses cases across industry verticals, but today, I will show an example for a fictional financial management company, FMC.

At FMC, there is a ton of content being created across the organization, including regular soundbites used in podcasts and social media that discuss hot financial topics. Upper leadership wants to use Box Skills and IBM Watson to make sure certain topics aren’t being talked about in these soundbites in order to legally protect the company from providing bad financial advice.

Once a soundbite is uploaded, moved, or copied into the folder configured for the Box Skill, a serverless/HTTP function in Google Cloud Platform is called. The function runs a block of code to download the audio and sends it to the IBM STT service, along with a list of keywords to search for. STT processes the audio, and sends back the transcript and keyword data. After the serverless function parses the response, transcript and keyword information is written back on the audio file in Box on a Box Skills card.

Once complete, anyone who looks at the Skills card on the file can see a time stamped transcript, with keywords identified. If you click on a statement in the transcript or a specific keyword, the audio file will auto forward to that timestamp without having to manually scroll or listen to the entire audio file.

Setup Box Skill

Create a Box developer account (optional but recommended)

If you don’t have a Box enterprise account or developer account already, you can sign up for a free developer account here. I recommend using the developer account for the tutorial instead of using your production environment.

Please note that you cannot use the same email address during sign up due to the restriction of having a unique email address across all of Box.

Create the Box Skill

Navigate to the Developer Console, and click Create New App.

Select Box Custom Skill.

Developer console application selection screen

Give the application a name, and click Create App.

After creating the application, you will see the below screen. The red box is where you will put the URL where you would like the Box Skills payload to go. We will add this URL later on.

In the security keys tab, you will find two keys that can be used to verify that Box is the service that called the serverless function.

Enable/Authorize the Box Skill (completed by admin)

Just like other application types, an administrator of your Box instance will need to enable and authorize the Box Skill in the skills section of the admin console. You will need to provide the admin with the client id of the application, which is found in the Box Skill configuration screen. If you set up a developer account at the beginning of this tutorial, the user account created will be the administrator of the instance.

You will also need to provide the folder name(s)/owner of the content you wish the Skill to be triggered. If you haven’t set up a folder for the Skill to monitor yet, you will want to do that prior to requesting authorization from your admin.

On the Skills Admin Console screen, click Add Skill.

Enter the Client ID of the Skill, and click Next.

Select whether the skill should run for all content or a subset of folders.

For (a) specific folder(s), filter the pop-up by user and folder name. Check the folder(s) for which the Skill should be triggered.

Confirm selections, and click Enable.

Setup IBM STT

Create an IBM account

If you don’t already have one, head on over to the IBM website to create an account for their cloud services. An account is free, and you won’t be charged unless you use their services in excess of the free tier.

Create STT service

Once at the dashboard, search for the speech to text service in the top bar and click it.

If you don’t already have a service created, you will be able to create a new one as shown in the screenshot below.

After creation, you’ll be able to see your API key and URL. Save these to use in the serverless.yml configuration file later on.

Setup GCP Account

This tutorial will be using the awesome Serverless Framework to deploy our code.

Before continuing, you will need to setup a GCP account with a billing method attached. I won’t be going over all of those steps here, but you can find the steps for that on the Serverless website. Make sure to complete all the steps including creating a project + enabling the APIs, creating a service account and downloading a JSON key file.

Save the JSON key file for the next step.

Deploy GCP Function

Download or fork the code for the GCP function to your computer wherever you place your typical code projects.

Open the project in a code editor like Visual Studio Code.

Drag the JSON key file that was downloaded from the step above into the .gcloud folder and rename it serverless.json.

Update the serverless.yml file to have the configuration and naming information for your GCP account, Box Security keys, and IBM STT service created earlier.

This is also where you update the keywords to find in the audio files sent to the service. It needs to be a comma delimited list. For example, you could put “dollars, guarantee, return, investment” in the environment variable to see any instances those words are said an the audio file.

The Box primary and secondary keys come from the developer console’s security keys section I mentioned a few sections up. It is important to use these keys to make sure only Box can run the serverless function’s code.

In the terminal, run npm install followed by running sls deploy. Deployment can take several minutes, especially the first time. After it completes, you will get back an invocation URL. Copy and paste that into the correct field under the configuration tab for the application.

Visit the GCP console to see that your serverless function is active. You also need to add an additional permission to the function so that Box can call it. Click permissions > add. Type “allUsers” in the new principals box with a role selected of Cloud Functions Invoker. Click Save.

Upload an audio file to the Box folder configured earlier where you say words that you listed in the keyword list. If you don’t have one, you can easily make one using an online recorder, like this. Open the file in Box to see the transcript and keywords added to Box Skills cards.

You can also check the logs in the serverless function for verification that the process completed successfully.

Appendix

In the above tutorial, I just show deploying the code as written — but if you are interested in the technical details, feel free to keep reading below about how the GCP function works.

For reference, this is the index.js file.

index.js file

Security & Configuration

On line 17 in the index file, I use the validateWebhookMessage method and the security keys I mentioned earlier to verify that Box is the entity calling the serverless function. This is vital to any Skills code you run, or anyone who knows the invocation url could trigger the function.

On lines 30 and 31 in the index file, I do something different than I have in previous tutorials. Here, I’m checking that the file submitted is of a certain format listed in the serverless.yml file (mp3, wav, etc), as well as confirming the file is under a certain size — also listed in the serverless.yml file. The Skills Kit will automatically throw an error and write the information back to the Skills card for you, without any need for an if statement. Also, this means you don’t have to use the file extension limiter in the developer console configuration page.

Using the IBM Service

If you want to customize the parameters sent into the STT service, you would do so in the recognizeParams object starting on line 42. For example, I pass in a profanityFilter of true, but if you wanted to keep in the bad words, you can just remove this line. The default is false for that option.

You could also change the default model being used by passing in a model parameter; however, some models don’t have all of the features I’m using in the code — like timestamps. This example uses the default US English en-US_BroadbandModel.

Find out about all of the options in the IBM API docs. If you do change things, be aware that some of the other results processing code that happens later will probably break.

Processing Results

Once results are returned from the STT service, they are processed in the addSkillsData method on line 78. The transcription and keyword information from IBM lives in several nested objects — hence the many for loops that you see in the code.

Every statement is its own results object with further nested keyword results and time stamps. All of these are combined into a map that is transformed into what Box expects for Skills card data starting on line 117.

Production Readiness

This tutorial was made to show the art of the possible, and as such should be used as a jumping off point for using Box Skills and IBM Watson — not as production ready code. It has not been tested at scale nor with the many use cases and audio files/sizes that could exist out there. Please keep that in mind if you want to implement this for production.

Also, it uses the IBM STT synchronous service, which means audio files sent in need to be under 100MB. They do have an asynchronous service meant for files over 100MB and under 1GB. That option requires a job to be created, but can still be done using a Box Skill.

Thanks for checking out my tutorial using Box Skills with IBM Watson. Stay tuned for more Skills content coming soon!

Additional Resources

Developer guide on Box Skills

IBM STT documentation

GCP Functions

…And as always, feel free to post questions on our developer forum.