Pandas: First Step Towards Data Science (Part 2)

InfiniX
Python in Plain English
4 min readSep 5, 2020

--

Photo by fabio on Unsplash

If you haven’t read the first article then it is advised that you go through that before continuing with this article. You can find that article here. So far we have learned how to access data in different ways. Now we will learn how to analyze data to get better understanding and then to manipulate it.

So just to give overview, in this article we are going to learn

  1. How to summarize data?
  2. How to manipulate data?

Summarizing Data

We have been using different methods to view data which is helpful if we wanted to summarize data for specific rows or columns. However, pandas provide simpler methods to view data.

If we want to see few data items to understand what kind of data is present in dataset pandas provide methods like head() and tail(). head() provides few rows from the top, by default it provide first five rows and tail(), as you might have guessed, provide rows from bottom of dataset. You can also specify a number to show how many rows you want to display as head(n) or tail(n).

>> print(titanic_data.head())output :
PassengerId Survived Pclass .......
0 1 0 3
1 2 1 1
2 3 1 3
3 4 1 1
4 5 0 3
[5 rows x 12 columns]
>> print(titanic_data.tail())output :
PassengerId Survived Pcl Name .........
886 887 0 2 Montvila, Rev. Juozas
887 888 1 1 Graham, Miss. Margaret Edith
888 889 0 3 Johnston, Miss. Catherine Hele..
889 890 1 1 Behr, Mr. Karl Howell
890 891 0 3 Dooley, Mr. Patrick
[5 rows x 12 columns]>> print(titanic_data.tail(3))output :
PassengerId Survived Pcl Name .........
888 889 0 3 Johnston, Miss. Catherine Hele..
889 890 1 1 Behr, Mr. Karl Howell
890 891 0 3 Dooley, Mr. Patrick
[3 rows x 12 columns]

We can also display the data statistics of our dataset. We use describe() method to get statistics for every column. We can also get statistic for a specific column.

>> print(titanic_data.describe())output :
PassengerId Survived Pclass Age SibSp ...
count 891.000000 891.000000 891.000000 714.000000 891.000000
mean 446.000000 0.383838 2.308642 29.699118 0.523008
std 257.353842 0.486592 0.836071 14.526497 1.102743
min 1.000000 0.000000 1.000000 0.420000 0.000000
25% 223.500000 0.000000 2.000000 20.125000 0.000000
50% 446.000000 0.000000 3.000000 28.000000 0.000000
75% 668.500000 1.000000 3.000000 38.000000 1.000000
max 891.000000 1.000000 3.000000 80.000000 8.000000
>> print(titanic_data.Fare.decribe())output :count 891.000000
mean 32.204208
std 49.693429
min 0.000000
25% 7.910400
50% 14.454200
75% 31.000000
max 512.329200
Name: Fare, dtype: float64

Remember, it only return statistical data for numerical columns. It displays statistics like count i.e number of data points in that column, mean of data points, standard deviation and so on. If you do not want to see this whole stats then you can also call on these parameters individually.

>> print(titanic_data.Fare.mean())output :32.204208

Manipulating Data

  1. map(): It is use to manipulate data in a Series. We use map() method on a columns of dataset. map() takes a function as parameter and that function takes a data point from specified column as parameter. map() iterates over all data points of a column and then returns new updated series.
  2. apply(): It is used to manipulate data in a Dataframe. It behaves almost same as map() but it takes Series (row or column) as parameter to given function which in return provide updated Series and finally after all iteration of Series, apply() returns a new Dataframe.
# Here we define a function which will be used as parameter to map()>> def updateUsingMap(data_point):
'''
This function make data more readable by changing
Survived columns values to Yes if 1
and No if 0
Parameters
----------
data_point : int
Returns
-------
data_point : string
'''
updated_data = ''
if(data_point==0):
updated_data = "No"
else:
updated_data = "Yes"
return updated_data
>> print(titatic_data.Survived.map(updateUsingMap))output :
0 No
1 Yes
2 Yes
3 Yes
4 No
.....
Name: Survived, Length: 891, dtype: object
# Here we define a function which will be used as parameter to apply()def updateUsingApply(row):
'''
This function make data more readable by changing
Survived columns values to Yes if 1
and No if 0
Parameters
----------
row : Series
Returns
-------
row : Series
'''if(row.Survived==0):
row.Survived = "No"
else:
row.Survived = "Yes"
return row
>> print(titatic_data.apply(updateUsingMap,axis = 'columns'))
output :
PassengerId Survived Pclass .......
0 1 No 3
1 2 Yes 1
2 3 Yes 3
3 4 Yes 1
4 5 No 3
.. ... ... ...
[891 rows x 12 columns]

One thing needs to be clear here that these methods do not manipulate or change original data. It creates a new Series or Dataframe. As you noticed that we used another parameter in apply() method that is axis. It is used to specify that we want to change data along the rows. In order to change data along the columns we would have supplied value of axis as index.

I think it is enough for this article. Let this information sink in and then we can start with next article to explore few more methods in Pandas till then keep practicing. Happy Coding! 😄

Python In Plain English

Did you know that we have three publications and a YouTube channel? Find links to everything at plainenglish.io!

--

--