Of Data

Anwita Ghosh
7 min readApr 29, 2023

--

Created on Canva by Anwita G

What is Data?

Data, stated simply, is a collection of facts about something. That is, we gather words, numbers, measurements, observations and/or descriptions relevant to a problem we might be interested in or its components — and then present it in a form that helps us make sense of it faster. This presentation is typically a table, where the facts are sorted by what they are trying to convey about our problem of interest, the types they fall into, etc. — which makes it easier to read than a bunch of words and numbers just jumbled up and thrown onto a page.

Data and Information:

The word ‘data’ is often used interchangeably with ‘information’ when we’re having conversations in everyday life — information of certain kinds, presented in certain ways, etc., but information regardless. However, there is a subtle difference in the meaning of the two words. Data, by itself, is unrefined and raw, and we would need to process it further before we can draw any insights from it.

However, information is data that has been processed, organised and given a context. Information depends on data (i.e. there would be no information without data), and is generally sufficient to make decisions. We can draw actual insights from information and make decisions based on these insights, which would not be possible from data in its rawest forms.

For example, a typical university admin department would have records of their students, the degrees and courses they have enrolled for, their academic history, current grades, expected graduation year, classroom attendance (if they’re keeping track), non-classroom activities like student club memberships, etc. All of these records would collectively be data, i.e. piles and piles of facts about the student body in all their raw, unprocessed glory.

However, if a company wants to hire students in campus placement drives, it would need to know a few things about them, which it would communicate to the university. Now, the university cannot just hand their student data over as it is. They would need to sort the data, and provide only what’s relevent to the placement drive — for eg., a list of students who are expected to graduate that year, their qualifications and job experience (if any), their grades, non-academic performance, etc. The data that goes to the company has been processed, analysed and placed in context, making it information.

The typical university admin department doesn’t often leave data lying around without, at least sorting it, processing it and drawing some insight from it. The university would almost always have some information about their students at hand, for immediate use when the need arises. That means, the process of sharing information with companies interested in hiring their students is typically faster than what the above example might suggest.

Representing Data as Tables:

We often use data for analysis in the form of tables in order to organize the data based on certain characteristics. A table is a rectangular arrangement of data, with rows and columns which carry specific meaning. For example, suppose a bicycle store sells ten bicycles on a given day, and the proprietor records the name of each customer who bought a bike, the model of bike they bought and its price. He would organize his data something like this for easier reference later:

Constructed in Jupyter Notebooks by Anwita G

Each horizontal array in the table is a row. It represents an observation or case in the data — i.e. each sale from the bike store, and all attributes for that particular sale: who bought the bike, its model and price. If ten such people come and buy bikes from this store, there will be ten rows

Meanwhile, each vertical array in the table is called a column or variable, i.e. we expect to find different values for a common attribute of the classes — for example, the above table has one column for the customers who bought the bikes, one for the model, and one for price. If these bikes had more attributes in common, the table would’ve shown them as columns.

Tables are also called data frames, or datasets, and can later be used to construct graphs and pictures, which make it easier to understand for someone otherwise inexperienced in reading tables (or even a data scientist who wants to understand what their data looks like in a single glance, before going into further, more complicated analyses of the data).

Types of Variables:

The variables in a table can be one of two basic types, based on the kind of values they hold:

  • Quantitative Variables, which hold exclusively numerical values such that some values are larger than others, implying quantifiable differences in magnitude.
    Also, observations similar in value are expected to be similar in properties. That is, if two students have similar scores, one might expect them to have spent similar amounts of time studying, etc.
  • Qualitative/Categorical Variables, which expect their values to belong to one of a finite set of categories, like the year in which a student is expected to graduate, or whether they’re residing in the university dormitories or not, and so on.
    Typically, categorical variables do not allow explicit ordering/ranking between the categories. That is, students graduating in 2022 are in no way superior to students graduating in 2023 based on graduation year alone.
    However, there is a subtype of categorical variables, called the ordered categorical variable that does allow ranking between categories. For example, a student may be ranked first, second, or third in their class. However, there is no fixed, quantifiable magnitude between these ranks. That is, the difference between in the first and second ranks may not be the same as the difference between the second and third ranks. Also, the difference between the first and second ranks in one class may not be the same as the difference between the first and second ranks in another class.
    It’s just that the student who ranked first performed better in class than the student who ranked second, who, in turn, performed better than the student who ranked third.

Creating Data Tables and Importing Data in Python

When we want to analyse data in python, we’d first like to get our data into our compiler (in my case, Jupyter Notebook). We can do this in two ways:

  1. Manually create the table:
    Here we enter the data manually, and then create a table from it. Unlike SQL, where we enter each row separately, we can simply create a dictionary in Python, with the column headers for indexes, and convert it into a data frame using the pandas library.
    For example, let’s create a toy sales dataset for a bicycle shop, showing the name of the customer who bought a bike, the model of bike and its price.
import pandas as pd
Data = {'Customer':['A','B','C','D','E','F','G','H','I','J'],
'Model': ['a','b','c','d','e', 'b', 'd', 'c', 'e','a'],
'Price':[2000, 2500, 3000, 4000, 1200,2500,4000,3000,1200,2000]}
df = pd.DataFrame(Data)
df # df is now the name of our dataframe

Which gives the following table as output:

Created on Jupyter Notebooks by Anwita G

Here, the column on the extreme left (i.e. the one before the ‘Customer’ column) shows the row indexes for each row in the table. That is, it gives the row number of each observation. Indexing in Python begins at zero instead of one. Thus, index ‘0’ refers to row 1 and so on.

Also, note that using lists to create datasets is only one of many ways of doing it. This method is just the one I find easiest to execute.

2. Importing Data Directly:
The above example showed a table with only 10 rows and 3 columns. However, real life data often contains thousands of rows and columns, which would take forever to enter manually, and would cost a fortune in time and money.
Thus, Python has a way of directly importing data files from someone’s computer. Like the case above this, importing data directly also uses the pandas library.

df = pd.read_csv('toy_dataset.csv', sep=',',header=0)
df # df is now the name of our dataframe

This gives us the output:

Data downloaded from Kaggle, and table created in Jupyter Notebooks by Anwita G

Note that this dataset has 150,000 rows and 6 columns, which is a lot of data!

Thus, we have taken our first step in analysing data: finding data to analyse and setting it up in a tool. Now, we must first explore its properties before we move forward. This, I will cover in future posts.

Thanks for stopping by!

P.S. There’s a lot more to data itself than what I’ve written up here. The Internet has vast resources that cover various aspects of data — and I have only tested the waters with my post.

--

--

Anwita Ghosh

Data Scientist in FinTech, PGP Data Science, M.Sc. Economics