Pandas: First Step Towards Data Science (Part 2)

Published in

Python in Plain English

4 min readSep 5, 2020

If you haven’t read the first article then it is advised that you go through that before continuing with this article. You can find that article here. So far we have learned how to access data in different ways. Now we will learn how to analyze data to get better understanding and then to manipulate it.

So just to give overview, in this article we are going to learn

How to summarize data?
How to manipulate data?

Summarizing Data

We have been using different methods to view data which is helpful if we wanted to summarize data for specific rows or columns. However, pandas provide simpler methods to view data.

If we want to see few data items to understand what kind of data is present in dataset pandas provide methods like head() and tail(). head() provides few rows from the top, by default it provide first five rows and tail(), as you might have guessed, provide rows from bottom of dataset. You can also specify a number to show how many rows you want to display as head(n) or tail(n).

>> print(titanic_data.head())output :
   PassengerId  Survived  Pclass  .......
0            1         0       3   
1            2         1       1   
2            3         1       3   
3            4         1       1   
4            5         0       3
[5 rows x 12 columns]>> print(titanic_data.tail())output :
        PassengerId   Survived  Pcl            Name  .........
886          887         0       2     Montvila, Rev. Juozas   
887          888         1       1     Graham, Miss. Margaret Edith   
888          889         0       3  Johnston, Miss. Catherine Hele..
889          890         1       1     Behr, Mr. Karl Howell   
890          891         0       3     Dooley, Mr. Patrick[5 rows x 12 columns]>> print(titanic_data.tail(3))output :
        PassengerId   Survived  Pcl            Name  .........
888          889         0       3  Johnston, Miss. Catherine Hele..
889          890         1       1     Behr, Mr. Karl Howell   
890          891         0       3     Dooley, Mr. Patrick[3 rows x 12 columns]

We can also display the data statistics of our dataset. We use describe() method to get statistics for every column. We can also get statistic for a specific column.

>> print(titanic_data.describe())output :
       PassengerId    Survived      Pclass         Age    SibSp  ...
count   891.000000  891.000000  891.000000  714.000000  891.000000   
mean    446.000000    0.383838    2.308642   29.699118    0.523008   
std     257.353842    0.486592    0.836071   14.526497    1.102743   
min       1.000000    0.000000    1.000000    0.420000    0.000000   
25%     223.500000    0.000000    2.000000   20.125000    0.000000   
50%     446.000000    0.000000    3.000000   28.000000    0.000000   
75%     668.500000    1.000000    3.000000   38.000000    1.000000   
max     891.000000    1.000000    3.000000   80.000000    8.000000>> print(titanic_data.Fare.decribe())output :count    891.000000
mean      32.204208
std       49.693429
min        0.000000
25%        7.910400
50%       14.454200
75%       31.000000
max      512.329200
Name: Fare, dtype: float64

Remember, it only return statistical data for numerical columns. It displays statistics like count i.e number of data points in that column, mean of data points, standard deviation and so on. If you do not want to see this whole stats then you can also call on these parameters individually.

>> print(titanic_data.Fare.mean())output :32.204208

Manipulating Data

map(): It is use to manipulate data in a Series. We use map() method on a columns of dataset. map() takes a function as parameter and that function takes a data point from specified column as parameter. map() iterates over all data points of a column and then returns new updated series.
apply(): It is used to manipulate data in a Dataframe. It behaves almost same as map() but it takes Series (row or column) as parameter to given function which in return provide updated Series and finally after all iteration of Series, apply() returns a new Dataframe.

# Here we define a function which will be used as parameter to map()>> def updateUsingMap(data_point):
    '''
    This function make data more readable by changing 
    Survived columns values to Yes if 1 
    and No if 0
    Parameters
    ----------
    data_point : int    Returns
    -------
    data_point : string    '''
    updated_data = ''
    if(data_point==0):
        updated_data = "No"
    else:
        updated_data = "Yes"
    return updated_data>> print(titatic_data.Survived.map(updateUsingMap))output :
0       No
1      Yes
2      Yes
3      Yes
4       No
  .....
Name: Survived, Length: 891, dtype: object# Here we define a function which will be used as parameter to apply()def updateUsingApply(row):
    '''
    This function make data more readable by changing 
    Survived columns values to Yes if 1 
    and No if 0
    Parameters
    ----------
    row : Series    Returns
    -------
    row : Series    '''if(row.Survived==0):
        row.Survived = "No"
    else:
        row.Survived = "Yes"
    return row
>> print(titatic_data.apply(updateUsingMap,axis = 'columns'))output :
     PassengerId Survived  Pclass  .......
0              1       No       3   
1              2      Yes       1   
2              3      Yes       3   
3              4      Yes       1   
4              5       No       3   
..           ...      ...     ...
[891 rows x 12 columns]

One thing needs to be clear here that these methods do not manipulate or change original data. It creates a new Series or Dataframe. As you noticed that we used another parameter in apply() method that is axis. It is used to specify that we want to change data along the rows. In order to change data along the columns we would have supplied value of axis as index.

I think it is enough for this article. Let this information sink in and then we can start with next article to explore few more methods in Pandas till then keep practicing. Happy Coding! 😄

Python In Plain English

Did you know that we have three publications and a YouTube channel? Find links to everything at plainenglish.io!

Pandas: First Step Towards Data Science (Part 2)

Summarizing Data

Manipulating Data

Python In Plain English

Written by InfiniX