For Beginners in Data Science — Different Types of Data

Lakshmi Prakash
Design and Development
5 min readAug 13, 2022

Before learning any theory in data science, you must firstly familiarize yourself with the different types of data that are available. This would ideally be your lesson 101 in data science. I realized that explaining any theory in statistics or data science or the different algorithms used in machine learning would be worth it only when the basics are understood well.

What are the different forms in which you can find data around you? Before answering this question, ask yourself, what does “data” mean to you? When you read the word or hear someone talking about it, what occurs to you? In today’s world, there’s data everywhere. Do you have an account on an e-commerce platform like Amazon or do you subscribe to Netflix? There, they have your data. Do you have a bank account or a government-issued ID proof or certificate? They have your data. Did you pass out of a school or university? They have your data. Do you visit a hospital? They have your data. Do you work with clients and provide goods or services to them? You have their data. Every time you see something on google or YouTube, what you’re seeing is data. And YouTube keeps a track of what you, as a member, would enjoy watching. They have your data. Your groceries store owner (should most likely have) has your data. When you email a friend or a colleague, you are sharing data.

All of us contribute to data collection and receive data, many of us record or save lots of data for different purposes, and some of us study the data to make sense out of it. This is one of the reasons that data science has become so popular in the last few decades: the availability of large volumes of data on almost anything humans would be interested in. This wasn’t the case earlier, but now, most of the privileged people have access to machines and can easily give and get lots of data.

There is data everywhere around us in today’s world

Coming to the different forms of data, data could be found in any of the following forms: text, voice, videos, images, tables, code, photos, documents, and such.

Structured data and Unstructured data: This is easy to understand once you understand the different forms of data. Structured data is what is neatly arranged for a viewer or reader to be able to easily understand what the data is about and what it tells us. The most common examples would be an accounts notebook that records monthly income and expenses or a school’s database of students who have enrolled in different courses for that year. Unstructured data, on the other hand, is just data that does not have any classification or label or structure, and therefore, understanding the content could be harder; you have to go through every piece of the data to see if you can make sense of it. Common examples could be a database of random photos and screenshots on a user’s smartphone or a collection of chat messages on a group. You don’t know what it’s all about; it could be about anything.

Static data and Dynamic data: As the names tell us, static data is data that doesn’t change. It remains the same starting from the time it is recorded. Examples would include values like the grades of a batch of students or a table consisting names of people and their biological parents. Dynamic data is data that keeps changing either regularly or irregularly. Examples would be the location of a smartphone on maps, the price of a product in the market, the vitals of an individual, etc.

In Statistics, how is data generally classified? In terms of statistics, data can be broadly classified into two types of variables: Categorical variable and Numeric variable. These are also called “qualitative variables” and “quantitative variables”.

Categorical variables could be of two types: nominal and ordinal. A nominal variable is something that can be merely used to identify a value, something like a name or username or a group’s name or a category’s name or a Boolean value. It is just used for identification purposes and there’s nothing more to it. An ordinal variable, on the other hand, is one which has some kind of order for its values. For example, rating value on a scale, customer satisfaction value, ID number of students, batch number, etc. can be called ordinal variables, as they can be ordered in ascending or descending order.

Numeric variable, as the name suggests, represents a numeric or quantitative value. Numeric variables come in four forms: continuous, discrete, interval, and ratio. Imagine a simple graph on the real number space. If each individual value of a numeric variable can be distinctly pointed on the axis, then it is a discrete variable. For example, the number of properties you own, the number of meals you take per day, how many cars are sold each year, how many people have been affected by Covid-19, all these are discrete variables. A continuous variable, on the other hand, is one which runs continuously on the number line; it can take not only two distinct values but also any value between them. Examples for continuous variables would be height, weight, stock price, the length (time) of a video, blood pressure, etc.

Also, keep in mind, discrete variables are used for the purpose of counting data, while continuous variables are used for the purpose of measuring data.

Are all data useful or meaningless?

Huge volumes of data are generated and a lot of data is stored. But is every piece of data really worth being stored or studied? Not necessarily. Note that storing and studying huge amounts of data would be very expensive, so it is important to be aware of what is necessary and what is not. Also, in terms of relatively smaller projects and studies, the same fact applies: you don’t have to take into account every variable in a dataset.

To be able to decide which pieces of data you’d need and not need, you should be either clearly aware of your needs or have a hypothesis, trying to find the connection between two or more variables. Consider a dataset that you’d get from a resource, or a set of data that you’d get using an API, or the data that a client might provide to you. None of these is guaranteed to give you all the necessary data that you seek, but at the same time, not all of this can be useful either.

To save time, money, and effort, you should be able to decide which variables or values are necessary for your project and which are not, so that you can get rid of all the unnecessary data before asking a person or machine to study it all.

--

--

Lakshmi Prakash
Design and Development

A conversation designer and writer interested in technology, mental health, gender equality, behavioral sciences, and more.