Basic Statistics: Data and Its Tabular Representation

Mining quick insights from data is a foundational step in statistics. This article offers tips to mine information from data efficiently using tabular representation.

Priyam Banerjee
Analytics Vidhya
11 min readMay 10, 2020

--

Photo by Christine Sandu on Unsplash

The internet today is abuzz with multiple articles, e-books and online courses claiming to make your journey smoother in the field of AI, Data Science and Machine Learning. I have personally learnt from many such courses and have found them to be immensely helpful in picking up the basics or advancements in many new areas that can solve age old problems with efficiency. While these courses are really useful, there is some preparatory reading absolute beginners would need to do.

My posts would have a humble goal of summarising already available literature on the basic concepts in statistics. My hope is that a beginner who wants to succeed in the field of Data Science and Machine Learning can climb the first step in the ladder after reading this compilation of important basic concepts.

I plan to run a series of a few articles to put forward some concepts in statistics that are of day to day use in any data mining work that is done.

What is Data?

Before we dive into data, let us start with the very basic question on our subject of study, what is Statistics?

My perspective is that it is a stream of mathematical science that help us collect data and make sense of it by putting it into an analysable format, enabling exploration, drawing interpretations and presentation.

First thing in the definition is Data. It is a collection of facts; numbers, categories, text, measurements, speech, images, etc. For example, amount of rainfall received in a city in India every day in a week in July. In the current day world with the COVID-19 pandemic raging on in 2020, there is a deluge of data on number of infections, recoveries and deaths. Not only data at the level of people, there are a lot of data on how the virus behaves.

There are various types of data. At the highest level, there are 2 types:

  • Quantitative: This type of data deals with numbers. For example, the number of recoveries from COVID-19 every day in India, the amount of rain received in the past few days or a company’s stock prices daily over the last 6 months. It can further be broken down into two sub-types: discrete and continuous. A discrete variable can only take certain specified or isolated values. For example, number of students in a class. A continuous variable on the other hand can take any value within a specified interval, including fractions. For example, height of students in a class. The height can take values like 166 cm or 165.8 cm or even more accurate values like 165.7916 depending on how we define it. The quantitative data is also called variable (as used above) as they can vary based on the situation we are measuring.
  • Qualitative: Qualitative data, also known as attribute, is a type of data which cannot be expressed as numbers. They cannot be measured on any scale. However, in certain cases they can be ordered. Nominal attributes are simple categories or classes. They have no measurement scale nor can they be ordered. Example would be a person’s gender, like male, female, etc. or country of residence like India, USA, Germany, Japan, etc. There are a second type of attributes called ordinal. These can be ordered unlike nominal attributes. For example, your response to a service satisfaction survey at a restaurant or online food delivery can be ordered like very satisfied, satisfied, dissatisfied and extremely dissatisfied. Even though you can’t measure them, you can at least put them in order.
  • There are also some other types of data that are frequently used like intervals and ratios. Intervals are those types of data which can be ordered and their differences have meaning. For example, predicted temperature range for the coming month like 25–30 degrees in summer, 17–22 degrees in winter, or credit score like 500–750, 750–850, etc. The values can fall below zero as well like in the case of temperature. Ratios on the other hand are an extension of intervals where there is a true zero defined. For example, 200 grams is twice as much as 100 grams but it cannot fall below 0.

Types of Data Collection

Now that we understand different types of data, let us understand how the data is collected and how they are framed.

  • Primary Data: Primary data is collected first hand. The end user or investigator would go out in the field to collect this data himself or herself. For example, as a kitchen manager at your college hostel, you need to plan the daily meals and the primary information you need for that is the number of vegetarian vs non-vegetarian students in your hostel. You go to every student’s room and ask them their meal preferences which help you order the right amount of kitchen supplies. Or, there may be a company doing a market survey prior to launching one of their skin care products in the city of Mumbai. They send some of their employees with a questionnaire out in the city to survey people on what they look for in a skin care product. (Food for thought: do you think they can survey all people in a densely populated city like Mumbai?)
  • Secondary Data: This data is not collected by the investigator himself/herself. In most cases you purchase it from someone else who has taken the pains of collecting it or download it for free from public websites. For example, if you want to know the monthly unemployment rate in the USA for last 1 year, you can go to the U.S. Bureau of Labor Statistics website to download. Covid-19 infection and death rates that are widely searched for in 2020 are also obtained from secondary sources like a country’s ministry of health websites, like the one from India here.

There are advantages and disadvantages of both types of data collection and at the end of the day one needs to weigh the costs versus benefits. Primary data is more accurate, is very specific the investigation and provides flexibility in collection. But it suffers from high cost, planning, time and effort required. Secondary data has less hassles as it takes out the elaborate data collection exercise but it suffers from limited flexibility and all of the data may not be relevant for investigation.

Tabulation and Representation of Data

Once you have collected the data how do you store it and extract basic insights from it? This is a very important concept. You may have done a diligent job in collecting all the information required for your investigation but if you are not able to present in a proper way before your superiors, they will be unsure of the credibility of your analysis. So, that brings us to the concept of tabulation and representation of data. Personally, this is one of the most exciting things I work on with data. For this section, we shall walk through a common day to day example.

Let us consider you work for a educational content creator. Your manager tells you that the company plans to launch a new undergraduate level statistics and programming online course named ‘StatAdv’ in the state of Maharashtra, India. Now your job is to help her review the marketplace for this content. She asks you to start simple and concentrate on the 4 main urban clusters in the state — Mumbai, Pune, Nagpur and Thane and collect data on students appearing for school leaving board exams in 2021. You go to all possible schools in these cities and collect data on the students. Every school provides you this data in different possible formats.

After a few days of hard work in collecting the data, you finally get back to office, get a cup of coffee and think on how to take this data, which is all over the place, to your manager. One sip into the coffee and voila! You have a spreadsheet application on your computer (e.g., Microsoft Excel and its ilk). You open it up and try putting the collected data into a nicer shape, in the form of tables. This is the process of Tabulation. You first create the base that will be used for multiple other purposes. A snapshot is shown below for illustration (complete data not given):

Tabulated version of the tabulated data
Fig 1: Put the collected data into a table

Your data table is now created (Fig 1). You have an observation for each student in each school in each city across the rows. The columns show you the variables and attributes (remember the earlier section?) you have collected for each observation.

If you now take this simple table to your manager, do you think she will be impressed? Probably not. She might be too busy to look through all of it and ask you to highlight the key insights. What do you do now? You need to create something that is more intuitive to look at. Before you go there, it is worth highlighting, that even in a simple table like Fig. 1 where you just recorded the collected data, it is a good practice to make it intuitive. Look at the column headers. In columns like ‘City’, our mind is already conditioned to think about the possible values (Mumbai, Nagpur, etc.). However, in columns like pre-test marks you need a bit more information. While Math is a subject by itself, science subjects are a combination of let us say Physics and Chemistry. So, while Math has maximum possible marks of 100, physics and chemistry both combined can have a maximum of 200. It is a good practice to call them out in the headers, like in the above we have used parentheses to clarify what is the maximum marks.

So, how do you draw insights now? Let us think what can you obtain with the least bit of effort but that can show something to chew on. You would first want to know what is total number of students and if possible, by city. Since each observation in the table is a student, you simply count the number of observations in each city. Thus, you get the following:

Fig 2: Simple Frequency Table

You now have created something is known by a couple of names — simple table and Frequency table. A simple table is a table that shows summary of only one characteristic. In Fig 2, you are showing only the count of students in each city. Frequency table simply shows the count of the number of observations for a particular item of concern. In fig 2, that item is students.

Your manager now starts talking. It is evident that your company has much higher base of students to sell to in Mumbai compared to Nashik. You, now encouraged, go back to drilling down some more. What do you do next? Notice that the schools have not only given you the list of students appearing for exam in 2021, they have given the 2020 data as well. You probably did not ask for it, but it might be an interesting addition. Think that, 2020 students will be enrolling into college sooner than 2021 students. So, the former group may have more urgent need of your content than the latter:

Fig 3: Complex table showing frequency of more than one characteristic

You have successfully created a cut of the frequency table in Fig 3 that shows the count of students by city and year of appearance in school leaving exams. You know that there are 45,000 students in Mumbai who may need the content in the next few weeks vs 55,000 who still have some time and may need it later in the year. This is an example of complex table which shows summary for more than one characteristic. Notice a few things here:

  • Construct of a table: A table has various components. Fig 4 below uses Fig 3 as an example and shows different possible components. Notice how it becomes much easier to read with addition of title, source & footnote:
Fig 4: Different components of a table

The title lays out the description of the table or the purpose it is meant for. Caption shows the description of columns and sub-columns. Stub shows the description of the characteristics in the rows in the left most column. Body is the main part and contains all the numeric information categorised by the caption and stub elements. Source captures information on where the data is collected from and footnote lists any points worth noting, for example, here the reason of not collecting data for 10 schools in Nashik have been given.

  • Colour highlight: Notice how the different rows have been coloured. You can see in all cities number of students appearing has gone up between 2020 to 2021, except in Nagpur where it has gone down. This may have various reasons — enrolment going down, less students qualifying or a data collection/tabulation issue. It is easy to call these out using a colouring scheme. Look how an orange sticks out in a table body which is mostly green. Also, take a moment to look back at the image at the beginning of the post. It lists COVID-19 statistics by day and country. Notice how number of deaths is highlighted red, drawing an obvious attention that something needs attention. Thus, using colours helps highlight any information that you want the audience to take note of.

Excellent! You have learnt some basic concepts about tabulation. However, there are more cuts to the data you can do. For example, do you think all students need StatAdv course content? If you think reasonably it is probably science and commerce stream students who need it more as a part of their curriculum and maybe there are a small handful of Arts students who would read it only out of interest and not for any curriculum. Fig 5 shows you a further cut by stream to make the real customer base (science and commerce students) more evident.

Fig 5: Further drill down to find more insights from the tabulation

All these cuts of the raw data in Fig 1 can be done using simple tools like Pivot Table in MS Excel. See a Microsoft tutorial here. However, after a point you need to stop further cuts as they might complicate the representation. So before starting any type of data tabulation, you need to clearly set the objective that you want to solve. In the example problem in this post, we set out to get a basic estimate of a customer base that is very likely to purchase the course content. By drilling down using tabulation methods, we can now see that we have a good base among science and commerce students appearing for 2020 school leaving board exams in Mumbai! Manager is impressed with this initial story even though there are multiple other statistical techniques which can draw more inferences from this data. Those are at a notch higher level than this. So, let us discuss those another day.

Your manager is a math graduate and understands all of it. But now the two of you need to take it to the marketing head next week and he has a secret enmity with math. Will these tables impress him as much? Or would many more visuals have a better effect? I shall cover this part of pictorial representation of data in my next post.

Note: All data used in this post are for illustrative purposes only and may not be accurate in the real world.

--

--

Priyam Banerjee
Analytics Vidhya

A seeker of knowledge with a strong belief that one learns more with research, writing and teaching. Here to discuss, share and learn. Comments personal.