Unraveling the Magic Carpet of Data Types: A Pythonic Expedition

Reza Shokrzad
8 min readJun 6, 2023

--

1. Introduction

1.1. The Start of My Data Journey

Think back to the time when you first started exploring data. It’s like walking into a party where you don’t know anyone. Everyone is speaking a language that seems familiar but oddly confusing. Numbers, categories, sequences of time — they all merge into a challenging puzzle. Understanding this puzzle was my initial step towards advanced data analysis, and it’s the reason behind this blog.

1.2. Welcome Note

Hello and welcome! This blog is designed for anyone who’s interested in data, whether you’re a beginner data scientist, a student, or someone who’s simply curious about the power of data. You don’t need any prior knowledge — just an open mind and a willingness to learn.

1.3. What This Blog Covers

In this blog, we’ll talk about three types of data: categorical, numerical, and time-series data. We’ll use Python, a popular programming language, to understand how to work with these types. We’ll explore the Titanic dataset, which includes examples of all these data types. Plus, we’ll also discuss how to visualize our data effectively.

1.4. Importance of Understanding Data Types

Knowing your data types is a crucial first step in any data analysis. Different data types need different methods of analysis and visualization. For example, ‘age’ is numerical data, while ‘gender’ is categorical. Recognizing these differences can help you make better decisions about how to handle your data.

So, let’s start our journey into the world of data. You’ll be surprised by how much you can discover!

2. Casting Spells with Python

Python is our magic wand in this data exploration journey. This versatile programming language has cemented its position as a preferred tool in the data science world, thanks to its simplicity and vast collection of libraries and frameworks.

First, let’s briefly touch on what Python is: Python is a high-level, interpreted programming language that emphasizes readability and reduces the cost of program maintenance. It supports modules and packages, which encourages program modularity and code reuse. Python is also very beginner-friendly, making it an excellent choice for novice programmers and data enthusiasts.

But, why do we choose Python for data exploration? Primarily because of its expansive ecosystem of data-centric libraries, including but not limited to pandas for data manipulation, NumPy for numerical computation, matplotlib and seaborn for visualization, and scikit-learn for machine learning.

In the next section, we will get our hands dirty and start playing with data using Python. But remember, it’s okay if you don’t get it right away. The key is to stay curious and keep practicing. After all, every great data scientist started from the basics, just like we are doing right now.

3. The Tale of the Titanic: Our Dataset

The Titanic dataset is a compelling mixture of numerical, categorical, and even implicit time-series data. It contains information about the passengers on the ill-fated Titanic voyage in 1912. From passenger class, sex, age, and fare paid (numerical and categorical data) to survival status, we’ve got a lot to explore here. The sequence of events, though not explicit time-series data, can be inferred from the dates of embarkation and survival status of the passengers.

This dataset has become a popular choice for budding data scientists to practice their skills, owing to the diverse range of data types and the wealth of learning opportunities it offers.

3.1. How to Access and Load the Dataset in Python

You can access the Titanic dataset from Kaggle. To download the dataset, visit this link. The dataset is split into two parts, train.csv and test.csv. For our purposes, we will use the train.csv file as it contains both the features and the target (survival status).

To load this dataset into Python, we will use the pandas library, which provides excellent data structure and tools for data manipulation in Python. Here’s a simple code snippet to load the Titanic dataset.

import pandas as pd
# Load the dataset
titanic_df = pd.read_csv('train.csv')
# Let's take a peek at the first few rows of the dataframe
print(titanic_df.head())

In this code, pd.read_csv('train.csv') reads the CSV file and converts it into a DataFrame, a two-dimensional, size-mutable, and heterogeneous tabular data structure with labeled axes. The head() function is used to get the first 5 rows of the DataFrame.

Remember, you need to replace 'train.csv' with the actual path where your downloaded Titanic dataset is stored. For instance, if you've stored it in a directory named 'datasets' in your 'D' drive, the path would look something like 'D:/datasets/train.csv'.

Now, we’re ready to dive into the world of data types with our Titanic dataset in hand. Buckle up, and let’s get started!

4. Embarking on the Data Voyage: Understanding Different Data Types

The Titanic dataset has a total of 12 features. Among these, ‘Survived’, ‘Pclass’, ‘Sex’, ‘Embarked’, and ‘Cabin’ are categorical. ‘Survived’ and ‘Pclass’ are ordinal variables, which means they’re categorical with a natural ordering, while ‘Sex’, ‘Embarked’, and ‘Cabin’ are nominal, i.e., categorical without a natural ordering.

‘PassengerId’, ‘SibSp’, ‘Parch’ are discrete numerical variables, as they can only take on integer values. ‘Age’ and ‘Fare’ are continuous numerical variables, as they can theoretically take on any value within a range.

The ‘Name’ and ‘Ticket’ features are a bit special — they’re textual data, but can also be considered categorical in some way. For example, you might extract the title (Mr, Mrs, Miss, etc) from the ‘Name’ feature and treat that as a new categorical feature.

Lastly, the ‘Cabin’ feature could be considered as a timeseries if we make the assumption that the cabin numbers were allocated over time, but this would be a bit of a stretch and isn’t a typical way of handling this kind of data.

4.1. Categorical Data

Categorical data represent types of data which may be divided into groups. Examples of categorical variables are race, sex, age group, and educational level. They are often non-numeric and if they are numeric, they are discrete and countable.

Example from Titanic Dataset

In the Titanic dataset, an example of categorical data is the ‘Sex’ column, which represents the gender of the passengers.

Python Code: Working with Categorical Data

Let’s look at how we can explore the ‘Sex’ column in our Titanic dataset.

# Count the number of each category
print(titanic_df['Sex'].value_counts())

# Output
# male 577
# female 314
# Name: Sex, dtype: int64

4.2. Numerical Data

Numerical data represent measurements or quantities. It is always numeric and can be ordered or unordered. Numerical data can be further divided into continuous data (such as age or temperature) and discrete data (like number of students in a class).

Example from Titanic Dataset

In the Titanic dataset, ‘Age’ and ‘Fare’ are examples of numerical data.

Python Code: Working with Numerical Data

Let’s calculate the mean age and fare of the Titanic passengers.

# Calculate mean age and fare
print("Mean Age:", titanic_df['Age'].mean())
print("Mean Fare:", titanic_df['Fare'].mean())

# Output
# Mean Age: 29.69911764705882
# Mean Fare: 32.2042079685746

4.3. Time-Series Data

Time-series data are measurements that are collected at different points in time. This is opposed to cross-sectional data which observe individuals at a single point in time. While the Titanic dataset doesn’t directly have time-series data, we could create a simple synthetic time-series dataset based on the Titanic dataset for the purpose of this tutorial.

Example from Titanic Dataset (Implicitly)

While there’s no explicit column for date or time in the Titanic dataset, we can imagine a scenario where we’re tracking the number of survivors over time after the incident.

Python Code: Working with Time-Series Data (Creating synthetic Time-Series data based on the Titanic dataset)

We’ll use the ‘Survived’ column to create a simple time-series data that represents the number of survivors per day after the incident. Please note that this is synthetic data and not derived from real data.

import numpy as np

# Let's assume that the survivors were rescued over a period of 10 days
time = pd.date_range("1912-04-15", periods=10, freq="D")

# Let's randomly distribute the number of survivors across these 10 days
np.random.seed(0) # for reproducibility
survivors = np.random.randint(0, titanic_df['Survived'].sum(), size=10)
survivors = np.sort(survivors)[::-1] # sort in descending order

# Create a time-series dataframe
ts_df = pd.DataFrame({"Date": time, "Survivors": survivors})

print(ts_df)

# Output
# Date Survivors
# 0 1912-04-15 338
# 1 1912-04-16 331
# 2 1912-04-17 221
# 3 1912-04-18 165
# 4 1912-04-19 73
# 5 1912-04-20 72
# 6 1912-04-21 56
# 7 1912-04-22 46
# 8 1912-04-23 31
# 9 1912-04-24 15

This data indicates the number of survivors rescued on each day after the incident.

As we proceed, remember that this time-series data is purely synthetic and doesn’t reflect historical facts. However, it gives us a simple example of how time-series data could be represented. The main thing to note is that each data point is associated with a timestamp, which opens up possibilities for additional types of analysis, such as trend detection or forecasting.

All data types in Titanic dataset.

To see the types of data with Pandas, it is sufficient to call dtype as below:

# Assuming that 'titanic_df' is your DataFrame

print(titanic_df.dtypes)
#PassengerId int64
#Survived int64
#Pclass int64
#Name object
#Sex object
#Age float64
#SibSp int64
#Parch int64
#Ticket object
#Fare float64
#Cabin object
#Embarked object

In Python, categorical data is typically known as ‘object’ data type when using pandas. This is because the ‘object’ data type is used to represent strings or a mix of other data types, and categorical data is often represented as strings. However, pandas also offers a specialized ‘category’ data type for storing categorical data more efficiently and performing certain operations faster. It’s important to note that not all ‘object’ data types are categorical and vice versa. The interpretation depends on the context of the data.

5. Conclusion

We’ve embarked on an exciting journey into the world of data types, casting Python spells along the way. We used the story of the Titanic to dive deep into categorical, numerical, and even synthetic time-series data. We learned how to manipulate and interpret these types of data using Python and the power of the pandas library.

Understanding the nature of data is a crucial first step in any data analysis or machine learning project. As we’ve seen, different types of data require different handling and offer different insights. Numerical data can be summarized with statistics, categorical data can be counted and compared across categories, and time-series data can reveal trends over time.

Remember, the aim here is not to memorize every detail but to get a sense of the data’s language. The more you interact with various datasets, the more you’ll appreciate these subtleties and the more fluent you’ll become.

In our upcoming blogs, we will dive into the art of visualizing these different types of data, a crucial skill for any data explorer. We will also start exploring the various ways we can transform and prepare our data for modeling.

For now, keep practicing, stay curious, and never stop exploring. The world of data is vast and full of interesting insights waiting to be discovered. Happy coding!

مقاله فارسی (علم داده چیست و چه کاربردی دارد؟)

مقاله فارسی (ورود به علم داده یا دیتاساینس از کجا شروع می‌شود؟)

--

--