May I take a “Filtering Data”?

Ahmet Talha Bektaş
6 min readOct 15, 2022

--

Of course, you can take “Filtering Data”, madam/sir. I am preparing a “Filtering Data” in 5 minutes.

Photo by Adam Jaime on Unsplash

Filtering Data is one of the most important topics for Data Scientists. Countless amounts of data are generated every day. On the other hand, not every piece of data is important for your project.

At this point, You should filter your data!

In this article, I will explain:

You can find the notebook which I used for this article on my GitHub.

Let’s start!

Firstly, let’s read our data! If you don’t know how to read data check out my article about reading data.

Importing Library

import pandas as pd

Reading data

df=pd.read_csv("titanic.csv")
df.head()

Output:

Selecting Specific Columns

df[“column_name”]

If you wanna look at just one column, you can use df[“column_name”]

Let’s make an example!

df["Sex"]

Output:

Or we can use df.column_name; however, the column name shouldn’t include any space.

df.Sex

Output:

You can use all EDA functions with these functions.

Let’s make an example!

df["Sex"].head()

Output:

df.Sex.tail()

Output:

df[[“column_name1”, “column_name2”, “column_name3”]]

If you want to look at more than one column, you can use df[[“column_name1”, “column_name2”, “column_name3”]]. Be careful, there are 2 brackets.

df[["Survived","Sex","Age"]]

Output:

df[["Survived","Sex","Age"]].head()

Output:

Selecting specific rows and columns

df.loc[x:y,[“column_name1”,“column_name2”]]

If you want to see specific rows and specific columns, you can use df.loc[x:y,[“column_name1”, “column_name2”]] where “x” are “y” are integers.

Basically, we are saying “Give me rows from x to y (it is not including y ) and give me columns which are column_name1 and column_name2

df.loc[0:10,["Age","Sex"]]

Output:

Other Option

df.iloc[x:y,z:w]

df.iloc is very similar to df.loc. A small difference is that we write indexes in df.iloc; that’s why there is the letter “i” in df.iloc.

For looking for specific rows and columns, you can also use df.iloc[x:y,z:w] where “x”, “y”, “z” and “w” are integers.

df.iloc[0:15,5:7]

Output:

Sorting values of a column

df[“”].sort_values()

If you want to sort values in data from bigger to smaller or smaller to bigger, you can use df[“”].sort_values().

How do we use .sort_values() ?

First, you should write what you want to sort by.

For instance;

df.sort_values("Age").head()

Or

df.sort_values(by="Age").head()

Output:

As you can see they sorted from smaller to bigger.

For sorting from bigger to smaller, we use “ascending=False”.

Let’s make an example!

df.sort_values(by="Age",ascending=False).head(10)

Output:

And now, it is sorted from bigger to smaller.

Counting values in column

df[“”].value_counts()

If you want to see the total counts of values for all rows, you can use df[“”].value_counts().

For example;

df["Sex"].value_counts()

Output:

df["Embarked"].value_counts()

Output:

Finding unique values in column

df[“”].unique()

If you want to see just the unique values which are in the column, you can use df[“”].unique().

An example;

df["Pclass"].unique()

Output:

We understand that the Pclass column includes 3 values which are 1, 2, and 3.

Finding the number of unique values in the column

df[“”].nunique()

If you want to see just the number of unique values, you can use df[“”].nunique().

nunique= number of unique

For instance;

df["PassengerId"].nunique()

Output:

We understand that We have 891 different PassengerId so that we have 891 passengers.

Comparing two different columns

df.groupby(‘’)[‘’].mean()

For comparing different columns, you can use df.groupby(‘’)[‘’].mean(). However, you can also use statistical methods instead of mean().Just like , sum(), std(), min(), max(), median() etc.

Examples

df.groupby('Sex')['Survived'].mean()

Output:

We can say that 74.2% of females survived and 18.89% of males survived.

df.groupby('Sex')['Survived'].sum()

Output:

We can say that 233 females survived and 109 males survived.

df.groupby('Sex')['Survived'].std()

Output:

Std = Standard deviation

df.groupby('Sex')['Fare'].min()

Output:

We see that the minimum fare for females is 6.75 and the minimum fare for males is 0.00.

df.groupby('Sex')['Fare'].max()

Output:

We see that the maximum fare for females is 512.3292 and the maximum fare for males is 512.3292.

df.groupby('Sex')['Age'].median()

Output:

We see that according to Age, the median of females is 27.0 and the median of males is 29.0.

Or you can compare a column with more than one column

df.groupby('Sex')[['Survived','Age']].mean()

Output:

Filtering Data by Values

For example, I want to see rows in which their Sex is male;

df[df.Sex=='male'].tail(10)

Output:

Or let’s look at females who are bigger than 18 ages;

df[(df.Sex=='female')&(df.Age>=18)].head()

Output:

Or the mean of males who is bigger than 18 and survived ;

df.Survived[(df.Sex=='male')&(df.Age>=18)].mean()

Output:

Conclusion!

You have learned how to filter data

Congratulations! 🎉🎈🎁

Now you should learn how is data visualized.

My next story is How can you make up your data? (Part-1)” !

Author :

Ahmet Talha Bektaş

If you want to ask anything to me, you can easily contact me!

📧My email

🔗My LinkedIn

💻My GitHub

👨‍💻My Kaggle

📋 My Medium

--

--