Visualizing data — Lessons Learned

As part of the Bertelsmann Data Science Scholarship, Udacity and Bertelsmann challenged 15.000 students to learn about Data Science, complete a course about Descriptive Statistics and Advanced Concepts with Python and SQL and interact with other data enthusiasts from all over the Globe.

These are the key concepts and lessons learned I have gathered from Lesson 6, Visualizing Data. For concepts from Lesson 3 and research methods, please see the article here.

#UdacityDataScholars #PoweredByBertelsmann

It’s difficult and tedious to draw conclusions just by looking at a table with data in it. This is where statistics come in handy.


Frequency:

A frequency table is a table in which you count the frequency each row appears in an analyzed table. This is useful for organizing data in a coherent way, in order to better describe data, find patterns and draw conclusions.

The relative frequency defines how many times something happens, divided by all possible outcome, so basically the percentage of times something happens. The relative frequency is a number in the interval [0, 1] and the sum of all relative frequency is equal to 1. The relative frequency is a proportion, which we can convert to a percentage.

Frequency table and relative frequency, as seen in the class

When creating frequency distribution tables there can sometimes be a trade-off between convenience and informativeness when grouping the data into bigger categories, as the smaller more specific information is lost (For example gathering the results from different countries and then making a frequency table for each continent — you only see in the table gathered data from each continent and are unaware of the contribution each country had). Grouping the data is done in categories called intervals, bins or buckets.


Visualizing data

Given a set of messy unorganized data, we can easily visualize it using a frequency table and putting the data in bins of a chosen size.

Visualizing data can also be done using histograms. A histogram is a representation of the distribution of numerical data. The first step to build a histogram is to bin the range of values (divide the entire range of values into a series of intervals — equal or not), count how many values fall into each interval and plot it in an XoY axis type graph.

Example of a histogram of Average SAT Scores, N=173, Bin size=50

Same as with frequency distribution tables, when choosing a bin size for the histogram we also sometimes trade-off detail for convenience.


Histograms and bar graphs

histogram ≠ bar graph

The difference between a histogram and a bar graph is that the bar graph relates to 2 variables (x and y), while the histogram relates only to 1 variable and its distribution.

In a bar graph, each value on the x axis represents a distinct category and the order each of these categories is displayed in doesn’t matter, as most often the variable on the x axis is categorical or qualitative. In a histogram, the variable on the x axis is numeric and quantitative (hence we can chose a bin size)

Histogram vs bar graph, as seen in the class

Also, the shape of a histogram is very important, while the shape of a bar graph is arbitrary, depending on the way we chose to sort and display the categories. The histogram above denotes that it represents a positively skewed distribution, as most values are on the left side, than on the right.


Data Science class — students’ locations statistics

As a final step, to put into practice what was presented in this lesson, I chose to make statistics about the students attending this class. Before the course started, one of the students created a poll to see from which countries each of us came from. At the time of writing this article, a number of 2437 students participated, originating from 94 countries.

Frequency table
Students’ locations from a country POV Bar Graph (zoom in for more details)
Students’ locations from a continent POV Bar Graph (zoom in for more details)

As the frequency table shows, most students attending this challenge come from Brazil (17.44%), Germany(14.24%) and the USA(12.52%). However, if we were to sort all countries into continents, most students come from Europe (~38.9%), followed by South America(~18.28%), Asia(~17.82%), North America(~15.32), Africa(~9.28%) and Australia (with only ~0.49%).


Conclusions and final thoughts

A frequency table is good for calculating the total number of in the sample/population and its distribution, while a histogram is good for visualizing the shape of a distribution. A histogram can always be created as long as you have the frequency table corresponding to it.

This was an excellent introduction to basic data interpretation and organization, calculating percentages, types of illustrative data rendering and visualization.

I am very much looking forward to the next level and what they teach us next!


Contact

For more inquiries, you can find me here:


If you enjoyed this article please recommend and share.