Python for Data Science: Pandas I

Sawsan Yusuf
14 min readAug 29, 2023

--

Photo by Bruce Hong on Unsplash

In the past articles of this series, we described Python Basics including Data Types, Expressions, and Data Structures. Then we applied Python programming logic using Branching, Loops, Functions, Objects, and classes.

For now, we’ll begin exploring the best Python packages to help with data manipulation and mathematical calculations: Pandas and Numpy. We will start with pandas. So, What is pandas?

1. Pandas

Pandas is a Python library used for working with datasets. It does that by helping us make sense of DataFrames, a form of two-dimensional structured data, like a table with columns and rows. But before we can do anything else, we need to start with data in a CSV file.

2. Importing Data

2.1. CSV Files

CSV stands for Comma Separated Values and is a file type that allows data to be saved in a table. Because it upholds the notion that there is a significant relationship between the columns and rows, data presented in a table is referred to as structured data. A CSV might also show panel data, which is data that shows observations of the same behavior at various times.

If you came across a CSV file and opened it to see what it looked like, you'd see something like this:

Name, Branch, Year, CGPA
Nikhil, COE, 2, 9.0
Sanchit, COE, 2, 9.1
Aditya, IT, 2, 9.3
Sagar, SE, 1, 9.5
Prateek, MCE, 3, 7.8

2.2. Dictionaries

We can create a DataFrame from a Python dictionary using the from_dict function.

import pandas as pd

data = {"col_1": [3, 2, 1, 0], "col_2": ["a", "b", "c", "d"]}
pd.DataFrame.from_dict(data)

By default, DataFrame will be created using keys as columns. Note the length of the values should be equal for each key for the code to work. We can also let keys be the index instead of the columns:

pd.DataFrame.from_dict(data, orient="index")

We can also specify column names:

pd.DataFrame.from_dict(data, orient="index", columns=["A", "B", "C", "D"])

2.3. JSON Files

JSON is short for JavaScript Object Notation. It is another widely used data format to store and transfer the data. It is lightweight and very human-readable. In Python, we can use the json library to read JSON files. Here is an example of a JSON string.

info = """
{ "Name": "Nikhil",
"Branch": "COE",
"Year": 2,
"CGPA": 9 }
"""
print(info)
{ "Name": "Nikhil",
"Branch": "COE",
"Year": 2,
"CGPA": 9 }

We use json library to load the JSON string into a Python dictionary:

import json
data = json.loads(info)
data
{'Name': 'Nikhil', 'Branch': 'COE', 'Year': 2, 'CGPA': 9}

We can load a JSON string or file into a dictionary since they are both structured in the same way: key-value pairs.

data["Name"]

>>> 'Nikhil'

A dictionary may not be as convenient as a DataFrame in terms of data manipulation and cleaning. But once we've turned our JSON string into a dictionary, we can transform it into a DataFrame using the from_dict method.

df = pd.DataFrame.from_dict(data, orient="index", columns=["subject"])
df

3. Load Compressed file in Python

In the big data era, we’ll almost certainly need to read data from compressed files. Using gzip is one method of unzipping the data.

To test this library, let's load the bankruptcy-data.json.gz file, which is a dataset about bankruptcy prediction of Polish companies, using the following code:

import gzip
import json

with gzip.open("bankruptcy-data.json.gz", "r") as f:
data_gz = json.load(f)

data_gz is a dictionary, and we only need the data a portion of it.

data_gz.keys()

>>> dict_keys(['schema', 'data', 'metadata'])

We can use the from_dict function from pandas to read the data:

df = pd.DataFrame().from_dict(data_gz["data"])
df.head()

3.1. Pickle Files

Pickle in Python is primarily used in serializing and deserializing a Python object structure. Serialization is the process of turning an object in memory into a stream of bytes so you can store it on disk or send it over a network. Deserialization is the reverse process: turning a stream of bytes back into an object in memory.

According to the pickle module documentation, the following types can be pickled:

  • None
  • Booleans
  • Integers, long integers, floating point numbers, complex numbers
  • Normal and Unicode strings
  • Tuples, lists, sets, and dictionaries containing only objects that can be pickled
  • Functions defined at the top level of a module
  • Built-in functions defined at the top level of a module
  • Classes that are defined at the top level of a module

Let’s demonstrate using a Python dictionary as an example:

import pickle

clothes = {"shirt": ["red", "M"], "sweater": ["yellow", "L"], "jacket": ["black", "L"]}

pickle.dump(clothes, open("/data/clothes.pkl", "wb"))

Now in the data directory, there will be a file named clothes.pkl. We can read the pickled file using the following code:

with open("/data/clothes.pkl", "rb") as f:
unpickled = pickle.load(f)

4. Working with DataFrames

The first thing we need to do is import pandas; we’ll use pd as an alias when we include it in our code.

Pandas is just a library; to get anything done, we need a dataset too. We’ll use the read_csv() method to create a DataFrame from a CSV file.

Note: In these Pandas articles, we’ll work with a dataset for sale in Colombia through the real estate website Properati.com.

import pandas as pd

df = pd.read_csv("colombia-real-estate-1.csv")
df.head()

5. Working with DataFrame Indices

A DataFrame stores data in a row-and-column format. The index is a special kind of column that helps identify the location of each row. The default Index uses integers starting at zero, but you can also set up customized indices like "name", "location", etc.

For example, in the following real estate dataset, the default index is the integer counts. We can call the index column through .index:

df.index[:5]

>>> RangeIndex(start=0, stop=5, step=1)

We use the set_index method to change the index column. For example, we can set the column department as the index instead of the integer counts. But note that the index column cannot have duplicate rows. Like here, we cannot set property_type as the index column.

df.set_index("department", inplace=True)
df.head()

Now you can see the index column has changed:

df.index[:5]

>>> Index(['Bogotá D.C', 'Bogotá D.C', 'Quindío', 'Bogotá D.C', 'Atlántico'], dtype='object', name='department')

Using the reset_index() function, we can reset the index back to default integer counts, and department will become a column again.

df.reset_index(inplace=True)
df.head()

6. Inspecting DataFrames

Once we’ve created a DataFrame, we need to inspect it to see what’s there. Pandas has many ways to inspect a DataFrame, but we’re only going to look at three of them: shape, info, and head.

If we’re interested in understanding the dimensionality of the DataFrame, we can use the df.shape method. The code looks like this:

df.shape

>>> (3066, 6)

The shape output tells us that the colombia-real-estate-1 DataFrame -- which we called df -- has 3066 rows and 6 columns.

If we’re trying to get a general idea of what the DataFrame contains, we can use the info method. The code looks like this:

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3066 entries, 0 to 3065
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 department 3066 non-null object
1 property_type 3066 non-null object
2 lat 2967 non-null float64
3 lon 2967 non-null float64
4 area_m2 3066 non-null float64
5 price_usd 3066 non-null object
dtypes: float64(3), object(3)
memory usage: 143.8+ KB

The info output tells us all sorts of things about the DataFrame: the number of columns, the names of the columns, the data type for each column, and how many non-null rows are contained in the DataFrame.

If we want to see all the rows in our new DataFrame, we can use the print method. Keep in mind that the entire dataset gets printed when you use print, even though it only shows you the first few lines.

That’s not much of a problem with this particular dataset, but once you start working with much bigger datasets, printing the whole thing will cause all sorts of problems.

Instead of doing that, we’ll look at the first five rows by using the head method. The code looks like this:

df.head()

By default, head returns the first five rows of data, but you can specify as many rows as you like.

7. Sorting

Even though the DataFrame in many ways behaves similarly to a dict, it also is ordered. Therefore we can sort the data in it. Pandas provides two sorting methods, sort_values and sort_index.

We can sort the whole DataFrame by the values of a column.

df.sort_values("area_m2").head()

We can also sort the DataFrame by its index.

df.set_index("lat").sort_index().head()

8. Working with Columns

8.1. Adding Columns

Sometimes, it’s handy to duplicate a column of data. It might be that we’d like to drop some data points or erase empty cells while still preserving the original column. If we’d like to do that, we’ll need to duplicate the column. We can do this by placing the name of the new column in square brackets.

For example, we might want to add a column of data that shows the price per square meter of each house in US dollars. To do that, we’re going to need to create a new column called "price_m2", provide the formula to populate it and inspect the first five rows of the dataset to make sure the new column includes the new values:

df["price_m2"] = df["price_usd"] / df["area_m2"]
df.head()

8.2. Dropping Columns

Just like we can add columns, we can also take them away. To do this, we’ll use the drop method. If I wanted to drop the “department” column from colombia-real-estate-1, the code would look like this:

df = df.drop("department", axis="columns")
df.head()

8.3. Dropping Rows

Including rows with empty cells can radically skew the results of our analysis, so we often drop them from the dataset. We can do this with the dropna method. If we wanted to do this with df, the code would look like this:

print("df shape before dropping rows", df.shape)
df.dropna(inplace=True)
print("df shape after dropping rows", df.shape)
df.head()

By default, pandas will keep the original DataFrame and will create a copy that reflects the changes we just made. That’s perfectly fine, but if we want to make sure that copies of the DataFrame aren’t clogging up the memory on our computers, then we need to intervene with the inplace argument. inplace=True means that we want the original DataFrame updated without making a copy. If we don't include inplace=True (or if we do include inplace=False), then pandas will revert to the default.

8.4. Splitting Strings

It might be useful to split strings into their constituent parts and create new columns to contain them. To do this, we’ll use the .str.split method, and include the character we want to use as the place where the data splits apart.

First, we need to import the CSV and inspect the first five rows using the head method, like this:

df2 = pd.read_csv("data/colombia-real-estate-2.csv")
df2.head()

In the colombia-real-estate-2dataset, we might be interested in breaking the "lat-lon" column into a "lat" column and a "lon" column. We’ll split it at “,” with code that looks like this:

df2[["lat", "lon"]] = df2["lat-lon"].str.split(",", expand=True)

Here, expand is telling pandas to make the DataFrame bigger; that is, to create a new column without dropping any of the ones that already exist.

8.5. Recasting Data

Depending on who formatted our dataset, the types of data assigned to each column might need to be changed. If, for example, a column containing only numbers had been mistaken for a column containing only strings, we’d need to change that through a process called recasting. Using the colombia-real-estate-1 dataset, we could recast the entire dataset as strings by using the astype method, like this:

print(df.info())
newdf = df.astype("str")
print(newdf.info())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2967 entries, 0 to 3065
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 property_type 2967 non-null object
1 lat 2967 non-null float64
2 lon 2967 non-null float64
3 area_m2 2967 non-null float64
4 price_usd 2967 non-null float64
5 price_m2 2967 non-null float64
dtypes: float64(5), object(1)
memory usage: 162.3+ KB


<class 'pandas.core.frame.DataFrame'>
Int64Index: 2967 entries, 0 to 3065
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 property_type 2967 non-null object
1 lat 2967 non-null object
2 lon 2967 non-null object
3 area_m2 2967 non-null object
4 price_usd 2967 non-null object
5 price_m2 2967 non-null object
dtypes: object(6)
memory usage: 162.3+ KB

This is a useful approach, but, more often than not, we’ll want to only recast individual columns. In the colombia-real-estate-1 dataset, the "area_m2" column is cast as float64. Let's change it to int. We’ll still use the astype method, but we'll insert the name of the column. The code looks like this:

df["area_m2"] = df.area_m2.astype(int)
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2967 entries, 0 to 3065
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 property_type 2967 non-null object
1 lat 2967 non-null float64
2 lon 2967 non-null float64
3 area_m2 2967 non-null int64
4 price_usd 2967 non-null float64
5 price_m2 2967 non-null float64
dtypes: float64(4), int64(1), object(1)
memory usage: 162.3+ KB

8.6. Access a substring in a Series

To access a substring from a Series, we use the .str attribute from the Series. Then, we index each string in the Series by providing the start:stop:step. Keep in mind that the start position is inclusive and the stop position is exclusive, meaning the value at the start index is included but the value at the stop index is not included.

Also, Python is a 0-indexed language, so the first element in the substring is at index position 0. For example, using the colombia-real-estate-1 dataset, we could the values at index positions 0, 2, and 4 of the department column:

df["department"].str[0:5:2]

0 Bgt
1 Bgt
2 Qid
3 Bgt
4 Aln
...
3061 Bgt
3062 Blv
3063 Cni
3064 Bgt
3065 Bgt
Name: department, Length: 3066, dtype: object

8.7. Replacing String Characters

Another change we might want to make is replacing the characters in a string. To do this, we’ll use the replace method again, being sure to specify which string should be replaced, and what new string should replace it.

For example, if we wanted to replace the string “house” with the string “single_family” in the colombia-real-estate-1 dataset, the code would look like this:

df["property_type"] = df["property_type"].str.replace("house", "single_family")
df.head()

Note that the old value needs to come before the new value inside the parentheses of str.replace.

8.7.1. Rename a Series

Another change we might want to make is to rename a Series in pandas. To do this, we’ll use the rename method, being sure to specify the mapping of old and new columns.

For example, if we wanted to replace the column name property_type with the string type_property in the colombia-real-estate-1 dataset, the code would look like this:

df.rename(columns={"property_type": "type_property"})

8.7.2. Determine the unique values in a column

We might be interested in the unique values in a Series using pandas. To do this, we’ll use the unique method. For example, if we wanted to identify the unique values in the column property_type in the colombia-real-estate-1 dataset, the code would look like this:

df["property_type"].unique()

>>> array(['single_family', 'apartment'], dtype=object)

8.8. Replacing Column Values

If we want to replace a column’s values, use the .replace() function. But first, we need to import the second CSV of our dataset and inspect the first five rows using the head method:

# Series.rename() example
df = pd.read_csv("data/colombia-real-estate-2.csv")
df.head()

Then, we can replace a specific row with other values:

df["area_m2"].replace(235.0, 0).head()

0 0.0
1 130.0
2 137.0
3 346.0
4 175.0
Name: area_m2, dtype: float64

If we want to replace multiple values at the same time, we can also define a dictionary ahead of time, with dictionary keys the originals and dictionary values the replaced values. Then pass the dictionary to the replace() function.

replace_value = {235: 0, 130: 1, 137: 2}

df["area_m2"].replace(replace_value).head()

0 0.0
1 1.0
2 2.0
3 346.0
4 175.0
Name: area_m2, dtype: float64

Or we can apply specific operations to a whole column. In the following example, we have changed the price_cop unit to millions.

df["price_cop"] = df["price_cop"] / 1e6
df.head()

9. Concatenating

When we concatenate data, we’re combining two or more separate sets of data into a single large dataset.

9.1. Concatenating DataFrames

If we want to combine two DataFrames, we need to import Pandas and read in our data.

df1 = pd.read_csv("data/colombia-real-estate-1.csv")
df2 = pd.read_csv("data/colombia-real-estate-2.csv")
print("df1 shape:", df1.shape)
print("df2 shape:", df2.shape)

df1 shape: (3066, 6)
df2 shape: (3066, 6)

Next, we’ll use the concat method to put our DataFrames together, using each DataFrame's name in a list.

concat_df = pd.concat([df1, df2])
print("concat_df shape:", concat_df.shape)
concat_df.head()

9.2. Concatenating Series

We can also concatenate a Series using a similar set of commands. First, let’s take two Series from the df1 and df2 respectively.

df1 = pd.read_csv("data/colombia-real-estate-1.csv")
df2 = pd.read_csv("data/colombia-real-estate-2.csv")
sr1 = df1["property_type"]
sr2 = df2["property_type"]
print("len sr1:", len(sr1)),
print(sr1.head())
print()
print("len sr2:", len(sr2)),
print(sr2.head())
len sr1: 3066
0 house
1 house
2 house
3 house
4 house
Name: property_type, dtype: object

len sr2: 3066
0 house
1 house
2 house
3 house
4 house
Name: property_type, dtype: object

Now that we have two Series, let’s put them together:

concat_sr = pd.concat([sr1, sr2])
print("len concat_sr:", len(concat_sr)),
print(concat_sr.head())
len concat_sr: 6132
0 house
1 house
2 house
3 house
4 house
Name: property_type, dtype: object

10. Saving a DataFrame as a CSV

Once we’ve cleaned all our data and gotten the DataFrame to show everything we want it to show, it’s time to save the DataFrame as a new CSV file using the to_csv method.

For example, maybe we’re only interested in the first five rows of the colombia-real-estate-1 dataset, so let’s save that as its own new CSV file using the to_csv method. Note that we're setting the index argument to False so that the DataFrame index isn't included in the CSV file.

df = df.head()
df.to_csv("data/small-df.csv", index=False)

Conclusion

We examined the crucial Python library, Pandas, in this article. In the following article, we’ll continue our discussion of pandas in greater detail.

Until then, stay well 😊

References & Further Reading

--

--