Can AI improve the way you speak?

Either you are a politician speaking to millions, or a professor giving a lecture to 100 or even just talking with your friends in a casual context, the way you speak matters. And modern speech analytics can help you improve it.

What makes a good speaker?

There are certain things that you can do in order to win over your audience, according to public speaking coaches:

  • use pauses and silence when necessary: there are two types of pause, and you should work on both. First, the pauses one makes when talking with someone else: we should always pause for a second when the other person finishes a sentence. This lets the other person know that we have been listening. Second, the pause we make between our sentences: this adds authority and draws attention to the audience
  • try to eliminate filled pauses: according to the above, pauses are good. Filled pauses (“uum”, “liiike”, “soooo”) on the other hand, totally ruin your speech! They show uncertainty and hesitation and totally distract your audience, hence they should be totally eliminated
  • have a normal speech strength: how loud you speak is also important. Low energy can mean weakness but loudness is neither good and it is not necessarily interpreted as confidence
  • have a normal speaking rate (not too fast): sometimes, we tend to speak fast especially when in an anxious environment. Fast speaking rate shows nervousness. Speaking relatively slow, on the other hand, shows confidence, and gives us time to think and “plan” our next sentence
  • be expressive (relatively high pitch but not too high!): it is more than obvious that a monotonous speech signal is a huge factor that contributes to a bad speech
  • smile and show positive emotion: smiling not only shows you are friendly and warm, but research has also shown that people who smile when speaking seem more intelligent to their audience
  • use body posture, facial expressions and body language properly: a good presence also depends on the body language, the gestures and the facial expressions that accompany your speech. Using a variety of facial expressions and gestures indicate passion about what you say and confidence

All of the points listed above — excluding the last one— can actually be encoded in a single sentence: “how we speak matters”. Certainly, what you say is the most obvious factor of a successful speech, but how you say it is in many cases equally or even more important.

Non-verbal speech analytics

With the latest advances in Deep Learning and the recent pretty accurate Automatic Speech Recognition (ASR) systems, we can safely say that modern ASR can — after decades of research efforts — understand what we, humans, say. In addition, during recent years, focus has been given to understanding how we talk to machines and how we interact with other humans: modern non-verbal speech analytics take into account prosodic attributes and therefore extract information that is independent of language and culture.

For example, Speech Emotion Recognition (SER) methods can detect the underlying emotion of a speech utterance or segment, without necessarily taking into consideration the respective semantics (what has been said). Emotions can be recognised either as discrete classes such as anger, happiness, and sadness or as continuous representations such as positiveness (negative vs positive emotion) or emotional strength (weak vs strong emotions). These continuous dimensions are also known as valence and arousal in bibliography. As an example, anger is usually characterised as negative and strong, while sadness as negative and weak and happiness as positive and strong.

Apart from emotions, non-verbal speech analytics can recognise:

  • behaviours such as politeness and engagement to the dialog
  • confidence or hesitation
  • expressiveness (monotonous vs expressive)
  • extraversion (if the speaker is shy or extravert)
  • speaking rate (slow vs fast speech)
  • pause patterns and filled pauses (is the speaker making pauses between phrases and words? is she using a lot of filled pauses?)
  • non-verbal events such as laughters or cries

Emotional and behavioral speech analytics technology has been lately used with success in various real-world applications, such as call center analytics, where it has managed to leverage conversational intelligence towards optimising the call center’s pipelines. For example, the technology we build at Behavioral Signals uses AI to recognise emotions and behaviours in call center dialogs, and then automatically scores the performance of the agents, or it even matches agents to customers based on their “behavioral similarity”.

A use case on public speakers data

So non-verbal speech analytics can optimise a call center through modeling emotions and behaviors in dialogs, through scoring agents and matching them with clients based on their “behavioral similarity”. But can it help agents to improve the way they speak? In general, can emotional AI improve the way anybody speaks?

A small demonstration of the ability of two non-verbal attributes, namely the non-fillers rate (i.e. the percentage of speech data where no filled pauses occur) and excitement, to describe speech content is presented in the figure bellow. We have used data from three categories of speeches: Ted Talks, Scientific conferences (in the area of AI and speech analytics) and politicians (mostly talking in EU parliament). We can see that most of the Ted speakers (and a couple of politicians) have achieved very high non-fillers rates. In particular, Simon Sinek has achieved a close to 100% non-filled rate (which makes sense if one listens to the How great leaders inspire action talk as an example, where the speaker rarely adds fillers). Most politicians achieve remarkable non-fillers ratios, but with a significantly higher excitement (emotional arousal), which is useful for particular populists like Nigel Farage (this is one of his speeches we used for sampling). Scientific conferences seem to have the most frequent fillers, which is something expected if we consider that most speakers are postgraduate students and that even professors are not always well-prepared, not at least related to the top-visited-ted-talks speakers, who usually do training sessions with a speaking coach before going on stage.

Another important note on the results shown above is that filler rate is associated with the speaker’s familiarity with the speaking language. For example, G. Verhofstadt’s speech (this is the one we used in our data) illustrated in the diagram above is not in his native language (Dutch) but in English. Obviously, the same stands for the vast majority of speakers in scientific conferences.

So each speech signal can be “scored” based on the non-fillers rate or the emotional arousal, or using any other possible prosodic methodology. In this way, the speaker could know how well she did based on some speech analytic metrics. And this is the first step for a system that can automatically help you improve your speaking style: to know your score. But is that enough?

Apart from knowing our “score” when speaking, we obviously need to know what to do, to improve ourselves. This can be easily accomplished using a listen and learn by example approach. A system that will guide you with advice like “this is where you could do better” and “this is where you sounded confident” provides the first step towards AI-driven speaking style improvement. As an example, we have scored Bill Gates’ interview for The Verge, on “How the World will Change by 2030”, in regards to its fillers percentage per sentence. Then, we collected the most characteristic sentences, sorted from the worst (i.e. too many filled pauses) to the best (i.e. the minimum amount of filled pauses). See the result bellow. Such an example-based AI-driven speaking style improvement could also include both “good” and “bad” examples from other speakers’ utterances within the same domain: “you have a very low non-filler score, listen to this example from a similar field where the speaker achieves a much better score”

Related articles

About the author (tyiannak.github.io)

Thodoris is currently the Director of ML at behavioralsignals.com, where his work focuses on building algorithms that recognise emotions and behaviors based on audio information. He also teaches multimodal information processing in a Data Science and AI master program in Athens, Greece.

--

--

Theodore Giannakopoulos
Behavioral Signals - Emotion AI

PhD in audio signal analysis and machine learning. Over 15 years in academia and startups. Currently Director of Machine Learning at Behavioral Signals.