Things to keep in mind as a TED Talk Presenter [SQL & Python & Tableau]

Using data from Kaggle, I’ve completed a data analysis that provides insights into those prospective TED Talk presenters!

Kuan-Yu (Iris) Chen
4 min readJun 17, 2022

In this article, I would like to share how I started the project by creating a data pipeline of extracting data using web crawling, transforming, cleaning, and then loading the data for visualization.

Project Goal

I was first looking at the data purely for curiosity’s sake, using exploratory data analysis to find out more about the contents of the TED Talk. I soon noticed that since TED is a non-profit organization devoted to spreading ideas, I could take a step further to provide some insights into those prospective presenters. My interest piqued, I decided to see what I could find out about this idea from publicly available data, which contains 6 different features of each talk available on TED’s website for the past 20 years. Since I already have the target audience in mind, I examined the data set first and removed the author under TED-education, TED audio, and TED performance to avoid inefficient results. Then, let’s take a look at the tools and steps I used for the project.

The Data Pipeline

  1. Web Crawling (Beautiful Soup)

Web Crawling is a useful technique that allows the extraction of data from web content. Let’s break down the process into broad steps:

  • Visual inspection: Figure out what to extract
  • Make a request to the webpage
  • Parse the HTML response
  • Utilize the relevant data

Below is how I crawled the length and the first tagged genre of each video, which is usually the most relevant genre.

2. Data cleaning and preprocessing (Pandas, SQL)

After crawling the information I need from the TED website, I wrote data into a CSV file.

Then, I imported the original file and the one I crawled from the website to Microsoft SQL to further analyze. Firstly, it seemed weird to me when noticing some videos are tagged ‘interview’, ‘music’, ‘education’, etc. I looked back to the website. Not surprisingly, those videos are not related to my target audience, that is, the video form of prospective TED Talk presenters should be a speech instead of a performance or interview. I then removed them. Secondly, I merged two all-set tables together. Next, I categorized length into groups by using common table expression (CTE).

Moreover, I came up with a scenario. By using the window function and subquery to show average views over 12 months and 99 genres, the prospective presenters can get estimated views under specific conditions for future use.

3. Data Visualization (Tableau)

After I finished my Python & SQL analysis, I brought my cleaned data set into Tableau and created a long-form data visualization titled “Things to keep in mind as a TED Talk Presenter”. The data visualization allows users to see trends in the popularity of genre/ length/ month from 2011 to 2022, based on yearly average views. I broke it down into three sections. First, the viewer can find out the top genre for each year in the rankings. In another section, they’re able to check out video length changes in popularity over time. Next up, I presented months in popularity for each year.

The reason why I chose to focus on the relative changes year by year instead of directly showing the most popular genre/length/month to present was that the earlier the videos are released, the higher views they may gain, and this phenomenon will affect the result. Seeing the trends year by year makes the analysis more reliable.

“Things to keep in mind as a TED Talk Presenter” on Tableau Public

Challenges & Improvements

Building my first portfolio could be challenging, below are some of the struggles I have faced:

  1. It seems that the webpage content of TED is generated by a dynamic rendering mechanism. The HTML content from some of the pages changed after a few minutes. For those links that I failed to extract length and tagged genres, I did not include them in my dataset -> To get comprehensive data, I need more investigation into dynamic webpages.
  2. Showing rank changes year by year with mark labels on the bar chart could be more straightforward in data visualization. As the types and numbers of genres are not fixed each year, there would be no sense to use the technique. > Select the top 10 genres to better analyze.

Please check out the whole project on my Github and Tableau Public.

--

--

Kuan-Yu (Iris) Chen
0 Followers

Graduate student in MS in Information Systems at University of Washington Foster School of Business | Data Analyst | Product Manager