Analysis of a Personal Public Talk

I recently gave a talk about my analysis on Fitbit sleep data, at the Dublin Quantifies Self meetup. Being a Quantified Self meetup, seemed more than appropriate (if not obligatory) for me to “quantify” and analyze all the data I gathered and generated during such talk.
I will here explore two kinds of data: heart-rate measurements from my Fitbit and a transcript of my speech.

This article is supplemented with a Jupyter notebook, which explores the code and methods used for obtaining the results I illustrate here. I relied on common Python libraries (Pandas, Sklearn, NLTK, and Seaborn for visualization) and IBM Watson APIs for the speech-to-text task.

Heart Rate

It’s always a fun exercise to monitor your heart-rate in uncommon, out-of-the-comfort-zone events. We are often aware of our state, we feel the stress, agitation, and palpitation! But we are most likely to lose the focus on such internal state eventually, cause something else requires or shifts out attention.

Tracking your heart rate allows you to partially observe your body reactions to a situation even when the situation is already long gone, gives you the opportunity to approach with a neutral analytical state your past inner behavior.

Here is my heart-rate for the whole day of Tuesday 22nd November 2016

Plot of my Fitbit heart-rate measurements for 2016–11–22. One entry per minute

Average value is 81 heart beats per minute (bpm), with lowest and highest measurements respectively of 53 and 134 bpm.

Here a zoomed view around the time-frame during which I gave the talk

I stressed each relevant point with a letter:

  • A: arrived at the location of the event (preceded by a 15 minutes fast-walk)
  • B: start of the first speaker talk. Here I’m just sitting and listening, also most likely consciously stressing myself by trying to relax myself
  • C: on the stage, start of my talk
  • D: start of the Q&A session
  • E: back to the chair

I already saw a couple of similar graphs, for comparable situations: like many others, I was simply not able to avoid that peak. As you might have noticed, that’s exactly the highest value I reached for the day, evened by a later quick-run after a couple of pints. It is interesting to observe also how HR dropped down consistently as soon as I started presenting. I remember to have been initially highly aware of my words, of what I was saying, for then simply leave space to “auto-pilot mode”.

Theoretically, the more talks I will give, the more I will present in public, the “better” that graph should get. This is my first personal dataset of such kind, but I hope to collect much more data in the following years. Is not only practice/experience that I will have to take into considerations, but also context, aging, and God only knows how many more possible explanatory variables.

Speech Analysis

For the speech analysis, I am going to analyze only the actual transcripts of my talk. A lot of data will be lost cause of this decision, and I’m not talking just about possible inaccuracy of speech-to-text results… here a list of important aspects of public speaking which are lost when considering only a basic textual representation:

That’s surely a lot, but let’s see what we can do and get from the basic text, and let all these aspects to a later stage.

Speech To Text

First thing first: speech-to-text. I didn’t do it on the spot, using tools for real-time generation of text, I instead relied on the video recordings setup for the event. 
I got my video file, extracted the audio content, cropped the unnecessary parts (including Q&A session) and went for a speech-to-text solution.

Well known services for speech-to-text (with space for some free usage) are from big names likes Google, Microsoft, and IBM.

Google wanted my credit card number and Microsoft APIs seemed unresponsive to my effort. On the other hand, IBM results are really rich and, together with various optional settings, each recognized word can be accompanied by a confidence level, start and end time (“beginning and ending time in seconds relative to the start of the audio”).

Basic Text Analysis

Considering the pure textual info I already ended up with this basic but neat summary of my talk:

“Here a summary of the conversation. Overall 3315 words have been said, of which 951 unique ones, giving a lexical richness of 28.69%. 
With the talk total duration of 21.8 minutes, the speech rate is of 152 Words Per Minute (WPM).”

Basic aspect one might want to explore in this circumstance is words usage: which are the most common words, bigrams, trigrams, used during the talk. Top results will most likely be common and low-informative constituents like articles, adverbs, and pronouns. For my results I found that first significant word is data, ranking 65 in the words that occur the most, which might make sense given the topic of the talk.
To better explore actual relevant words, you might first of all try to simply remove stop words, or rely on more specialized statistics like tf-idf (explained in more practical details in the notebook).

Words Alignment and Speech Rate

Let’s consider again the speech-to-text results from IBM Watson service, here what the first five rows (out of 3315, one for each word) of the cleaned results look like:

The alignment info can be used to overcome some of the limitation of pure text analysis. The following histogram should provide an approximate view of my speech rate trend; the entire conversation has been split in 10 bins of equal size (time) and the x value is the count of words which fall in the corresponding bin.

Histogram for binning on time_end variable. Equivalent to a binned word count

You can notice for example a slight but constant decrease in my speech rate. Fitting a regression line between such points I obtain a coefficient of -5.23. Considering the 10 bins used, each bin has a size of 2.18 minutes. Very roughly, this is equivalent to say that on average my speech rate decreased by 5.2 words each subsequent 2.1 minutes.

Even more generic, one could simply say that the more time passed, the slower I talked. We should then clarify what one means with “talking slower”. There are two options I can think of:

  • using fewer words, which can be caused by two factors: more spacing/pauses between words or usage of longer words, which take more time to be pronounced
  • lower word speed (time to pronounce a word, measured as length of word w divided by time to pronounce w)
Scatterplots for each derived measures. X axis is the bin index

Based on the results from linear regression fitting and correlation coefficients, an additional summary for the alignment part would be:

Your average speech rate is 152 Words Per Minute (WPM), but an approximately constant and significant decrease can be observed, bringing you from an initial WPM of 166 to a final value of 142. The primary cause of this is the usage of increasingly longer pauses between words, secondarily reinforced by a combination of using longer words, as well as a tendency to slow down the pronunciation of words, while the talk unfolds.

Finally, another interesting addition in the results from IBM APIs, is the presence of a specific keyword: %HESITATION, which unfortunately doesn’t seem to be well documented, but should represent fillers and speech disfluencies such as “Uh”, “Ah”, “Erm”, “Um”, etc.
Here a visualization of the occurrences of such keyword.

Violinplot showing the distribution of %HESITATION occurrences

I want to stress again that all the text analysis here demonstrated depends merely on the quality of the speech-to-text results, which considering the setup, audio quality, and brief observations are way below optimal. At the same time, the proposed framework is a reusable one, which I’m definitely planning to further expand and put into use on future data of — hopefully — higher quality. As usual, all feedback, critiques, and corrections in particular, are more than welcome.