Basics of Pandas — Part 3

Aarish Alam
Analytics Vidhya
Published in
3 min readDec 30, 2020

In my previous article’s I addressed some of the common queries faced by a beginner while working with various datasets. This article is the continuation of my previous articles.

I’ll be continuing to demonstrate further concepts using the same dataset(UFO) as used in the first and second part of this article.

How do I change Categorical Features to Numerical Features?

The categorical Features needs to be changed to Numerical one’s to fit it into a any specific model. Although it is beneficial and convenient to use Label Encoder , pandas provide a method to change the Categorical Features to Numerical ones using get_dummies.

pd.get_dummies(ufo,columns=['City'])
Before implementing get_dummies function
After implementing get_dummies function or method.

How do I apply a function to a pandas Series or DataFrame?

This can be achieved using three methods

  • applymap-Apply a function to every element in a DataFrame
  • apply-Apply a function to each element in a Series
  • map-Map the existing values of a Series to a different set of values

Let’s separate out the year from the given time format in the DataFrame

ufo['Time']=ufo['Time'].apply(lambda x:x.split('/')[2])
#splits string using '/' as a separator
ufo['Time']=ufo['Time'].apply(lambda x:x.split(' ')[0])
#splits string using ' ' as a separator

For the sake of demonstration of map and applymap method I have created a new column and named it ‘New’(lack of creativity) containing values 0’s and 1’s.

Modified UFO Dataset
ufo['Valid']=ufo.New.map({0:'No',1:'Yes'})
After applying mapping
ufo.loc[:,'Time':'New'].applymap(float)
#apply map is only valid for DataFrame and not series object
After applying applymap method

How do I find and remove duplicate rows in pandas?

You can find out duplicate rows by attributing .duplicated() to the whole DataFrame. One can also check the similar values of a column using the same attribute to a series object.

Logic for duplicated:

  • keep='first' (default): Mark duplicates as True except for the first occurrence.
  • keep='last': Mark duplicates as True except for the last occurrence.
  • keep=False: Mark all duplicates as True.
ufo.duplicated().sum()
#checks the total no of rows that are identicle
ufo.drop_duplicates(keep='first',inplace=True)
#dropping duplicate entries keeping the very first of each
Duplicated Entries

Certainly there remains a lot more techniques that one eventually discovers while playing with the dataset but my series of articles highlights some if not all the queries that I faced while working with Datasets. Hope you Enjoyed reading my articles

This marks the end of the series “Basics of Pandas”. Hope you enjoyed reading it. Check out the other two articles related to this series here

Basics of Pandas — Part 1

Basics of Pandas — Part 2

Thanks 😉

--

--