The DAP Journey: Talk Like TED

How to be TED Talk Worthy with Analytics

Published in

SMUBIA

6 min readMay 28, 2019

In this Medium series, BIA extracts the introspection of our Data Associates as they recall their academic exploration. This post features an analytics project on TED Talks, directed by Qi Haodi, Fabian Toh & Tan Kin Meng.

Speakers of the day

Qi Haodi, SIS Y2

“I wanted to join DAP is because I find it more motivating to have a group of like-minded people to learn data analytics together, and on top of that, I wanted to have hands-on experiences in data analytics projects so that I can really apply what I learn.”

Fabian Toh, SIS Y2

“Ever since poly, I have discover my love for analytics and develop a passion for it. Hence, I join DAP to gain exposure to analytical projects as well as understanding the algorithm behind machine learning which it will greatly enhance my knowledge.”

Tan Kin Meng, SIS Y1

“I wanted to join DAP was because of the unique value proposition of the programme which are to build a close-knitted, like-minded community of learners and also the opportunity to learn about data science!”

Talking Like TED

Public speaking is one of the most essential yet inadequate skills amongst young adults today. Ideally speaking, one should convey his/her ideas to the audience clearly, persuasively and engagingly. While it is a soft skill that cannot be acquired shortly, we believe the best way to start is through imitating the best TED speakers around the world as they share their inspirations and ideas on the round red carpet.

Given the 18 TED talk ratings, we realised that not all talks are equally good. While some talks can appear to be more engaging and humorous, others may be more informative and persuasive. Besides that, there are also talks which are deemed longwinded and unconvincing.

So, what sets these talks apart from one another?

We decided to explore the attributes of a successful talk, which are commonly indicated with higher positive ratings (on Ted.com). To understand what is successful, we also needed to define the failure of other talks, which are coupled with higher negative ratings.

From there, we can impart the good skills into our presentations and avoid the mistakes that make a speech unpopular.

Crafting the perfect speech

We first started with a statistical analysis on TED talks to discover the trend, the distribution and any correlations among the quantitative data, after which we dived into the transcript analysis.

Stage 1: Statistical analysis on TED talks

During the statistical analysis a.k.a. exploratory data analysis, we started off with a simple summary statistic. We conducted a quick analysis of our dataset distribution, and then made use of Tableau to visualise our data. We were able to visualize the data in the form of a simple bar graph and even correlation matrix plots (see below).

Correlations among Ratings and Other Quantitative Measures

During this phase, we also tried out feature engineering such as combining various numbers together, which resulted in an analytical report that brought deeper insights.

We decided to use other analytical tools and moved on to Python libraries, such as Pandas and Matplotlib, to further process and visualize the data.

Instead of obtaining a general trend discovery, we decided to consider the specific ratings of talks to make a more accurate conclusion. With that, we calculated a correlation matrix using DataFrame and used Tableau to visualize the video rankings.

Interestingly, we discovered that for videos with neutral or negative ratings — commonly described to be longwinded, unconvincing, obnoxious, or “Ok” — are positively correlated with the top correlation values.

On the flip side, videos that have been rated as “informative” are the most negatively correlated with “beautiful”. We hypothesized that beautiful talks are often disguised as performances, where the audience are entertained rather than presented with meaningful information or knowledge.

To discover more insights and test our hypothesis, we moved on stage 2.

Stage 2: Transcript analysis of TED talks

Using the available transcripts of TED Talks, we first tried typical text analysis approaches, such as TF-IDF (term frequency-inverse document frequency)and LDA (Linear Discrimination Analysis).

Initially, the TF-IDF model ranked “♪” as the term with the highest frequency. In a spoken speech, the symbol would be impossible to be verbalized. That was when we realized that our data was not yet clean enough.

Turns out, each transcript is bound to contain such a symbol. Whenever there is music being played in the duration of the talk, music symbols would appear in the transcripts. We filtered our transcripts and removed the symbol, along with other irrelevant stop words.

The results after removing the noise were promising. Having obtained the top 10 most frequently used words, we searched the terms using the official TED Talk search engine. As expected, videos containing the particular word in the transcript appeared on the first page of the search.

Search Results using Top 10 Words of TF-IDF in Official TED Talk Websites

In order to identify TED Performances from informative lectures, we realized we could cross-reference the presence of the musical symbol with the video’s tags. Unsurprisingly, performances tend to have ratings like “beautiful”, “jaw-dropping” or “funny”. Non-performing sessions were associated with more tags like “informative” or “persuasive”.

Average Rating % for Performances and Non-Performances Talks

With that, we have proven our hypothesis that a video tagged as beautiful is more often a performance act.

(Laughters) = Humourous Talk?

Another thing we found out is that the transcript would capture the audience’s laughter in brackets. We wanted to know if the number of “(Laughters)” is an indication of how humourous the talk was. After a simple test, the correlation between the two factors was only 0.60+ and thus they are not really strongly correlated.

Correlation between Sentence Length and Ratings

Guessing TED ratings from sentence length

We initially guessed that talks with longer sentences might be viewed as long-winded or confusing as the audience might not be able to follow through. We calculated the average sentence length for each talk and calculated the correlations with all ratings. It turns out that this guess was proven untrue, because the correlation values were nowhere near -1 or 1.

Topic Modelling?

Lastly, we attempted topic modelling using LDA. Topic modelling is a model that discovers the abstract “topics” that occur in a collection of documents.

So far, we were not able to conclude with a very satisfactory result, because the top keywords for each topic included many stop words, which diluted the essence of the transcript. We will explore more in this area and examine ways toimprove our model.

Moving forward, we would like to explore more text analytics algorithms on our transcript and proceed with building our model-based new features extracted from our transcript analysis.

Thank you for reading!