Analysing a Speech in VR using Speech-to-Text and Motion Tracking Technology

Dom Barnard
May 22, 2018 · 6 min read

It can be difficult to get feedback on a speech and even harder to quantify your performance. In this article, we use the VirtualSpeech app to review a speech on a range of specific criteria, from eye contact to pace of voice. The speech was just over 6 minutes long, giving the app a good amount of data (over 800 words) to work with and provide feedback to the user.

The topic of the speech was on how automotive companies can use virtual reality to reduce prototyping costs. The actual topic is not important, as the speech analysis can be applied to any speech or presentation for real time feedback.

The virtual meeting room the user gave their speech in. The users presentation was loaded onto the left wall, where the Welcome placeholder image is. The user pressed the ‘Start Analyse’ button to begin the speech analysis.

The virtual reality app served two purposes:

  1. Immerse the user in a realistic meeting room
  2. Provide real time feedback to the user

The first point is important as it goes some way to recreating the fear and excitement you might experience when presenting in front of a real audience. In virtual reality, we can simulate lighting distractions, mobile phones going off, audience members talking to each other and a wide range of other scenarios which you wouldn’t get without practicing in VR.

The second point is covered in depth in this article. The app gives feedback on these areas of the users speech:

  • Pace of voice (how quickly the user is speaking)
  • Number of hesitation words
  • Volume (loudness) of the users voice
  • Eye contact performance
  • Speech insights (not used in this speech)
  • Speech concepts (not used in this speech)

The users performance is reviewed by the app according to the criteria above. We also discuss other useful features available to the user, such as saving the speech to listen back to later, and uploading the speech to the VirtualSpeech team for detailed feedback.

  • Speech title: How automotive companies can use virtual reality to reduce prototyping costs
  • Speech length: 6 minutes
  • Virtual environment: Meeting room with 11 audience members

Uploading your own presentation slides

Before starting the speech, the user uploaded their presentation slides into the VR app. This allowed the user to present with their own slides and use them as visual cues for the speech.

The method for uploading your own slides is fairly straight forward — create a presentation using PowerPoint, Keynote or similar software and save your slides as a PDF document. You can then upload the PDF to the app by emailing them to yourself or transferring through iTunes (iPhone) or file transfer (Android).

Screenshot showing the users presentation slides inside the meeting room.

Having your own slides in the virtual room with you helps you better prepare for an upcoming event, as you can work on getting the correct timing and pauses. To change the slides inside the app, you can either use the buttons to move to the next or previous slides, or press your VR headset trigger while looking at the slides to get them to change.

Speech analysis using speech-to-text technology

Receiving feedback is essential for improving public speaking skills and ensuring that each time you practice, you’re becoming a presenter. In order to do this, the app uses speech-to-text and other vocal technology to analyse the speech and provide feedback for the user. If you are planning on using your own speech-to-text software, we recommend the following:

Each API offers unique benefits and issues for analysing a speech and converting it to text.

Once the app has converted the speech to a body of text, we created several algorithms to analyse this and provide meaningful and understandable results to the user. This allows the user to quantify their performance and improve areas of their speech each time.

The first thing you will notice when listening back to the audio (in the Appendix below) is that the user speaks very quickly, in particular towards the end. The user averages 141 words per minute, which is a little too fast — around 120 words per minute is more preferable for a presentation.

The user had around 19 filler words, such as ‘um’ and ‘ah’, out of a total of around 860 words. This is not bad, as filler words are not always a negative in a speech (particularly in a conversation). When listening back to the audio, you’ll notice quite a few of the filler words were when the user was thinking of what to say next. A pause from saying anything (i.e. silence) would be preferable in this case.

Speech analysis results from the VirtualSpeech VR app.

Eye contact analysis and process

For eye contact analysis, the app assumes the eyes are looking directly forward from the head. In this way, when the user moves their head to look at something, the app assumes the eyes move as the head moves. If you watch presentations, you’ll notice this mostly holds true and is a fair assumption to make.

The app records the users eye contact throughout the speech and then provides a heatmap of where the user was looking while speaking. This allows the user to easily see any areas they have neglected or focussed too much on.

The data for the heatmap is collected in two ways:

  • Head movement is extracted from the Google VR or Oculus VR software, which is used to build the VR experience.
  • The room is broken up into a grid, with hundreds of squares making up the grid. When the user looks at one of these grids, the matrix increases the value of that grid. After the speech has been completed, the weighted average of the matrix is calculated.

Combining these two sets of data gives an accurate reference of where the users was looking throughout the speech or presentation.

The user performs well on the eye contact with the audience, scoring 8/10. The audience is more likely to engage with the user when speaking and understand the message. From the heatmap analysis, we can see that the user spends time with each audience member and spreads the eye contact evenly amongst them over time.

A point to note is that we are not able to determine how long the user maintained eye contact with audience members for in each session, just that the total eye contact was well distributed. For example, the user might have spent 45 seconds with the first audience member, then 45 with the second audience member and so on, instead of in 3–5 second periods, which is recommended for an audience of this size.

Saving and uploading the speech for detailed feedback

After the user has reviewed their performance, they have the option of saving their speech to listen back to later, or uploading the speech to the VirtualSpeech team for further, detailed analysis.

The saved speeches are located in the Progress room, found from the VR main menu. Up to 5 speeches can be saved at any time within the app.

In order to upload the speech for additional feedback, the user needs to enter their email address which the VirtualSpeech team will use to send the feedback to. The additional feedback can be used to get an insight into areas of your speech the app currently cannot, such as:

  • Where any literary techniques used?
  • What is the tone of the speech?
  • How persuasive or influential is the speech?
  • Is the user emphasising the key message?
  • Is the key message clear?

Track progress within the app

All the feedback you receive from the speech analytics feature is stored on your mobile device and displayed in a section of the app. It’s really easy to measure progress and determine how you are progressing over time, including areas such as eye contact, hesitation words and pace of speaking.

Track progress for eye contact, hesitation words and volume of your voice.

In conclusion

The VirtualSpeech app provides a powerful way for people to analyse their speech or presentation. The feedback allows users to identify weaker areas of their speech and work to improve those parts. In addition, with the realistic environments, audience and personalisation (load in your own slides), the app takes you close to being fully immersed in the environment.

Originally published at


Improve Soft Skills with VR and Realistic Simulations


Award-winning off-the-shelf training courses designed to improve skills in the most efficient way. We combine e-learning with VR training and simulations so you can learn from experience in realistic immersive scenarios.

Dom Barnard

Written by


Award-winning off-the-shelf training courses designed to improve skills in the most efficient way. We combine e-learning with VR training and simulations so you can learn from experience in realistic immersive scenarios.