Essential Libraries To Have In Your Toolbox For Data Science And ML — Series #2 — Pandas

Kaan Ceylan
8 min readMar 13, 2022

--

In the first article of this series, we took a look at NumPy and learned how it can help us make calculations on the data and how we can manipulate it according to our needs. In case you missed that one, you can find it here. In the second post of the series, we will learn about the Pandas library and the ways it lets us move around the ndarray’s we created using NumPy.

Pandas is one of the most popular Python libraries especially among data scientist and ML engineers and 9 times out of 10, — unless you’re on the R camp :) — it is the go-to tool for data analysis, cleaning and preparation before we can feed it to our model.

You can find the source code for the functions that I talk about throughout the post and more in the notebook here. Feel free to copy and run the code yourself alongside the article, mess around and try different values/filters etc. (I actually encourage it because it’ll 100% lead to better understanding.) So let’s go and take a look!

Pandas is a very powerful data manipulation and analysis library built on top of NumPy and we use it to gain insight about the data, clean it and process it so that we can improve the performance of our model and get better results. All of those steps can be summed up under the title of Exploratory Data Analysis or EDA.

Remember, “Garbage In = Garbage Out” .

EDA is crucial if you want to significantly cut down the time you spend training and adjusting your parameters. That’s why it’s really important that you do a thorough job with your EDA and prepare your data. You might have to do some visualization and decide on how you’re going to handle outliers. See the descriptive statistics of your dataset and take a look at how each feature is distributed, is there any skewness or how many missing rows does each feature have and how you’re going to impute them or if you should discard that feature all together. But in order to do all of that, first we will learn about some basic functions of Pandas so that we can move our data in and out.

Reading And Writing The Data

Let’s quickly go over how to get our data in for Pandas and how to save our results to one of the many file extensions that Pandas support. This part will not contain any code snippets. I’ll highlight the important parts so that you can skip the rest easily if you want.

Pandas IO API has a common naming pattern for read and write functions and they support a bunch of common file extensions and languages such as csv, xml, excel, json, pickle, parquet, sql and more.

The read functions follow the naming pattern of “read” followed by an underscore and the file extension (i.e. “read_csv”, “read_sql” etc.).

The most basic parameters of the read functions is “filepath” which takes either the path of a local file you need to read in or it can take a url like the url of an S3 bucket that’s storing your target data. You can also manually specify the delimiter value in your file using the “delimiter” or “sep” argument but the default value is None and python can automatically detect the correct value. You can use the “header” parameter to specify which row is storing your feature names so that they can be separated from the rest of the data. If you do not pass a value then the headers will be inferred by Python.

You can specify the data type that is in each column using the parameter dtype by passing the values in the form of a dictionary such as {“feature_1”:np.float64, “feature_2”:np.int32}.

There are a lot more you can do like determining what should be done if a row is not structured the way it should be but those are too detailed to be covered in this post. The basics are pretty much what you’re going to need 9 times out of 10.

Write functions are structured the same way, “to” followed by an underscore and the file extension just as it is for the read functions. You pass the name of the file that will be saved as a string to the functions first and then the write functions accept the same basic parameters such as the path of the file, the separator value etc.

Functions For EDA

Like I said, Exploratory Data Analysis is something that is almost as important as the modelling itself, if not even more. It helps you take a good look through the data, see the feature distributions and any outliers, determine if the data is clean enough to be used, if it’s dense or sparse, whether the features are correlated or not and so much more. Overall it helps you gain an understanding of the kind of treatment your data needs and let’s you pick out possible models/algorithms that you can use moving forward. Let’s go over these functions with the help of the “A Waiter’s Tips” dataset from Kaggle.

  • You can use shape and see the row and column count of your dataset.
  • Columns will let you see the column names, so your features.
  • The head() functions returns the first 5 rows of your dataset by default but you can pass in any number to be returned. Allows you to examine a few rows of the dataset and take a look at your features, data types. Although there are more useful functions that let you do that. You can use head to check if the datatype in each column is correct.
  • A better function to use for this purpose is info(). It returns the column names, the datatype for each column and the count of non-null values so that you can see how dense or sparse your dataset is. The higher the non-null count, the denser the dataset. We have a total of 244 entries so we can see here that there are no missing values in this dataset.
  • The describe() function returns some basic descriptive statistics about your data such as the mean and standard deviation of numerical column, the min, 50% and max values. You can also get the number of unique values and how frequent a value occurs in for each feature. If you take a look at the mean and standard deviation values of the size feature, it’s easy to see that most of the size values in the dataset are between 2 and 4. The 75% value being 3 supports this deduction. You can see that confirmed in the notebook.

A key part that will give significant insight into how you should process your data moving forward is to determine whether your features are nominal or ordinal, numerical or categorical etc.

Ordinal data has a natural order or rank between different values whereas nominal data does not have an order. Think of different levels of higher education, you can rank them as Bachelor’s, Master’s and Doctoral and these have a natural ranking between them. Numerical (or Quantitative) data is the type of data that can be expressed in numbers or can be quantified like height, weight etc. Categorical (or Qualitative) data is the type of data is labeled and can be separated into categories/groups. Knowing the type of your data will allow you to decide how to interpret and process them.

Slicing, Filtering And Manipulating Data

Slicing

The two most common functions for selecting data in Pandas are loc and iloc. They have their differences but sometimes they can also be interchangeable. You can use loc and iloc to get a specific part of the dataframe and assign it to a separate variable if you want to make calculations on that part without affecting your original dataframe. Loc accepts label based arguments or boolean arrays whereas iloc only accepts integers.

Let’s take the first 10 rows of the ‘sex’ and ‘smoker’ columns. The loc function includes both the start and stop indexes in the results, so :9 will give us the first 10 rows starting from the index 0.

To get the same results with iloc, we will need to pass both the column and row indexes as integers, the first being the row and the second column indexes. Unlike loc, iloc does not include the stop index, so if you want to get the first 10 rows you will have to pass :11 as the row values.

Filtering

Let’s say we want to filter out the rows where the total bill is greater than 24 which is the top 25% of the column values. We need to pass the condition that we want to the loc value which will return a boolean series which we will pass through our main dataframe to mask the values. It’s not that complicated once you take a look at the code.

Both lines return the same result but the first one is more readable in my opinion.

We can even go one step further and get the tip amount for every row that has a total bill of over 24 and has a group size of 6.

When using the iloc function, we need to pass the resulting boolean values from the filtering in a list, an array of boolean values.

Advanced Pandas Functions

There are hundreds of functions in Pandas, I’ve talked about some of the most basic and popular ones. There are more advanced functions that let you perform SQL operations such as calculating rolling window functions, rolling sum’s which are really useful when working with time series data. You can perform merge/join operations with your window results to your original dataframe. But these are outside of the scope of this blog post.

Wrapping Up

In the second post of the series, we’ve learned about exploratory data analysis and worked on the tips dataset to learn the basics of Pandas. Have a play around with the code in the notebook and try different functions with the help of the Pandas documentation and in the next post we will cover the other half of the data analysis process, data visualization!

If you’ve found this post helpful or spotted a mistake, I’d really appreciate any feedback. You can reach out to me on Twitter or leave a comment on Medium. Thank you for your time and I’ll see you on the next post!

--

--

Kaan Ceylan

Aspiring ML Engineer. Enthusiast of data and everything about it. Doing my best to document my self-learning journey while hoping to help others.