Getting Started With Pandas for Data Science

Jahid Hasan
26 min readJul 21, 2019

--

Pandas is one of my favorite libraries. Pandas have a lot of advanced features, but before you can master advanced features, you need to master the basics. The data structure in Pandas is one of the most basic knowledge.

There are two common data structures used by Pandas: Series and DataFrame. These data structures are built on top of Numpy arrays, which means they are very efficient. Let’s take a look at what these data structures look like.

# Import related library 
>>> import numpy as np
>>> import pandas as pd

Series

Series is a one-dimensional array with a name and an index. Since it is an array, it is definitely the type of element in the array. The data types contained in the Series can be integers, floats, strings, Python objects, and so on.

Suppose, there are 10 students in a school and today their results of Physics exams have been published. We will randomly generate 10 student’s Physics exams result by python code. Hence, the Physics exam is 25 marks.

>>> import random
>>> marks =[]
>>> for x in range(10):
marks.append(random.randint(0,25))
>>> print(marks)
[10, 11, 13, 12, 6, 23, 4, 11, 14, 21]

We can store it through Series, here we store 10 student’s Physics exam result through Series: 10, 11, 13, 12, 6, 23, 4, 11, 14, 21, just build the data to be stored into an array and assign it to the data parameter.

>>> import pandas as pd
>>> exam_result = pd.Series(data=[10, 11, 13, 12, 6, 23, 4, 11, 14, 21])
>>> exam_result
0 10
1 11
2 13
3 12
4 6
5 23
6 4
7 11
8 14
9 21
dtype: int64

As you can see, you have correctly stored multiple exams result in the Series. You may think that it is useful to store the result separately. How do I know which student obtain these marks belong to?

We can solve this problem by using the Series index. Since there is ten student’s exam result, naturally we also need ten names, so we need to build an array of the same length as data, and then meet the requirements by the following operations.

>>> exam_result.index = ['Kitty','Lily','Elizabeth','Johnson','Harry','Ramsay','Bear','Austin', 'Nelly', 'Parker']
>>> exam_result
Kitty 10
Lily 11
Elizabeth 13
Johnson 12
Harry 6
Ramsay 23
Bear 4
Austin 11
Nelly 14
Parker 21
dtype: int64

You see, the name and marks are now exactly the same. Although we know that Kitty, Lily… is a name, other people can not understand. How do we tell others?

To let others know, we can give the index a name.

>>> exam_result.index.name = "Name"
>>> exam_result
Name
Kitty 10
Lily 11
Elizabeth 13
Johnson 12
Harry 6
Ramsay 23
Bear 4
Austin 11
Nelly 14
Parker 21
dtype: int64

Maybe you will still think if someone is watching the code I wrote, how can I quickly know what I am writing?

Don’t worry, just as we give the index a name, we can also give the Series a name.

>>> exam_result.name = "Student’s exam result"
>>> exam_result
Name
Kitty 10
Lily 11
Elizabeth 13
Johnson 12
Harry 6
Ramsay 23
Bear 4
Austin 11
Nelly 14
Parker 21
Name: Student's exam result, dtype: int64

Through the above series of operations, we have a basic understanding of the structure of the Series, in simple terms, a Series includes data, index, and name.

The above operation is very convenient to use for the demonstration. If you want to quickly implement the above functions, you can do it in the following ways.

>>> student_name = pd.Index([‘Kitty’, ‘Lily’, ‘Elizabeth’, ‘Johnson’, ‘Harry’, ‘Ramsay’, ‘Bear’, ‘Austin’, ‘Nelly’, ‘Parker’], name=”name”)
>>> exam_result = pd.Series(data=[10, 11, 13, 12, 6, 23, 4, 11, 14, 21], index=student_name, name="Student's exam result")
>>> exam_result
name
Kitty 10
Lily 11
Elizabeth 13
Johnson 12
Harry 6
Ramsay 23
Bear 4
Austin 11
Nelly 14
Parker 21
Name: Student's exam result, dtype: int64

In addition, it should be noted that when we construct the Series, we do not set the data type of each element. At this time, Pandas will automatically determine a data type and use it as the type of Series.

Of course, we can also specify the data type manually.

>>> exam_result = pd.Series(data=[10, 11, 13, 12, 6, 23, 4, 11, 14, 21], index=student_name, name=”Student’s exam result”, dtype=float)
>>> exam_result
name
Kitty 10.0
Lily 11.0
Elizabeth 13.0
Johnson 12.0
Harry 6.0
Ramsay 23.0
Bear 4.0
Austin 11.0
Nelly 14.0
Parker 21.0
Name: Student's exam result, dtype: float64

What is like?

Series contains the characteristics of the dictionary, which means that you can use some operations similar to a dictionary. We can think of the elements in the index as keys in the dictionary.

>>> exam_result[‘Lily’]
11.0

In addition, it can be obtained by the get method. The advantage of this way is that no exception will be thrown when the index does not exist.

>>> exam_result.get(“Kitty”)
10.0

In addition to the dictionary, Series is also very similar to ndarray, which means that you can use the slicing operation.

>>> exam_result[2]
13.0
>>> exam_result[:4]
name
Kitty 10.0
Lily 11.0
Elizabeth 13.0
Johnson 12.0
Name: Student's exam result, dtype: float64
>>> exam_result[exam_result > 22]
name
Ramsay 23.0
Name: Student's exam result, dtype: float64
>>> exam_result[[7,2]] # student of 7 index and 2 index
name
Austin 11.0
Elizabeth 13.0
Name: Student's exam result, dtype: float64

As you can see, no matter how we operate the Series by slicing, it automatically aligns the index.

Series vectorization operation

Series, like ndarray, also supports vectorization. It can also be passed to most NumPy methods that expect ndarray.

>>> exam_result + 2
name
Kitty 12.0
Lily 13.0
Elizabeth 15.0
Johnson 14.0
Harry 8.0
Ramsay 25.0
Bear 6.0
Austin 13.0
Nelly 16.0
Parker 23.0
Name: Student's exam result, dtype: float64
>>> import numpy as np
>>> np.exp(exam_result)
name
Kitty 2.202647e+04
Lily 5.987414e+04
Elizabeth 4.424134e+05
Johnson 1.627548e+05
Harry 4.034288e+02
Ramsay 9.744803e+09
Bear 5.459815e+01
Austin 5.987414e+04
Nelly 1.202604e+06
Parker 1.318816e+09
Name: Student's exam result, dtype: float64

DataFrame

A DataFrame is a two-dimensional data structure with an index. Each column can have its own name and can have different data types. Think of it as an excel table or a table in a database. DataFrame is the most commonly used Pandas object.

We continue to use the previous example to explain DataFrame. When storing user information, in addition to age, I also want to store the city where the user is located. How to achieve through DataFrame?

You can build a dictionary, the key is the information you need to store, and value is the list of information. Then pass the dictionary to the data parameter.

As you can see, we successfully built a DataFrame, the index of this DataFrame is the user gender, and two columns are the student’s age and age information.

In addition to the above-mentioned way of passing in the dictionary, we can build it in another way. This is done by constructing a two-dimensional array and then generating a list of column names.

>>> Data = [[10, 15], [11, 14], [13, 16], [12, 17], [6, 18], [23, 12], [4, 13], [11, 19], [14, 14], [21, 18]] 
>>> columns = ["marks", "age"]
>>> exam_result = pd. DataFrame(data=Data, index=index, columns=columns)
>>> exam_result

Access line

After generating the DataFrame, you can see that each line represents the information of a certain student. If I want to access Nelly’s information, what should I do?

One way is to access a row by index name, which requires the loc method.

>>> exam_result.loc["Nelly"]
marks 14
age 14
Name: Nelly, dtype: int64

In addition to accessing a row of data directly through the index name, you can also select this row by the location of the row.

>>> exam_result.iloc[8]
marks 14
age 14
Name: Nelly, dtype: int64

Now that I can access the information of a certain student, how do I access the information of multiple students? That is how to access multiple lines?

It’s easy to do with row slicing, look here.

>>> exam_result.iloc[1:4]

Access column

Learning how to access row data naturally leads you to how to access columns. We can access the data of the column by means of attributes (“.column name”), or access the data of the column by the form of [column].

If I want to get the mark of all students, then I can do this.

>>> exam_result.marks
name
Kitty 10
Lily 11
Elizabeth 13
Johnson 12
Harry 6
Ramsay 23
Bear 4
Austin 11
Nelly 14
Parker 21
Name: marks, dtype: int64
>>> exam_result["marks"]
name
Kitty 10
Lily 11
Elizabeth 13
Johnson 12
Harry 6
Ramsay 23
Bear 4
Austin 11
Nelly 14
Parker 21
Name: marks, dtype: int64

What if you want to get both marks and age?

Can change the order of the columns

>>> exam_result[["marks","age"]]

Add/Remove Columns

After generating the DataFrame, suddenly you find that the student’s gender is missing, so how do you add it?

If all genders are the same, we can pass in a scalar and Pandas will automatically broadcast all of our locations for us.

>>> exam_result["sex"] = "Male"
>>> exam_result

If you want to delete a column, you can use the pop method to complete.

>>> exam_result.pop("sex")
name
Kitty Male
Lily Male
Elizabeth Male
Johnson Male
Harry Male
Ramsay Male
Bear Male
Austin Male
Nelly Male
Parker Male
Name: sex, dtype: object

If the student’s gender is inconsistent, we can add a new column by passing in a like-list.

>>> exam_result["sex"] = ["Female", "Female", "Female", "Male", "Male", "Male", "Male", "Male", "Female", "Male"]
>>> exam_result

As you can see from the above example, when we create a new column, we modify it on the original DataFrame, which means that if a new column is added, the original DataFrame will change.

If you want to ensure that the original DataFrame does not change, we can create a new column by the assign method.

>>> exam_result.assign((Bonus_Mark = exam_result["marks"] + 1)
>>> import numpy as np
>>> exam_result.assign(sex_code = np.where(exam_result["sex"] == "Male", 1, 0))

Commonly used basic functions

When we build the Series and DataFrame, what features do we use often? Come check it out with me. Referring to the scenario in the previous chapter, we have some students exam result information and store it in a DataFrame.

Because DataFrame is more common than Series in most cases, this is illustrated by DataFrame, but in fact, many common features are also available for Series.

Generally, get the data our first step is to understand the overall situation of the data, you can use the info method to view.

>>> exam_result.info()
Index: 4 entries, Tom to James
Data columns (total 3 columns):
age 4 non-null int64
city 4 non-null object
sex 4 non-null object
dtypes: int64(1), object(2)
memory usage: 128.0+ bytes

If our data volume is very large, I would like to see how long the data is. I certainly don’t want to see all the data. At this time, we can use the only n of the head or n of the tail. To view the n pieces of data in the header, you can use the head method. You can use the tail method to view the n pieces of data at the end.

>>> exam_result.head(5)

In addition, the data structures in Pandas have common methods and properties in ndarray, such as getting the shape of the data through, shape and getting the transpose of the data through .T.

>>> exam_result.shape
(10, 3)
>>> exam_result.T

If we want to get the original data it contains through DataFrame, we can get it by .values. The obtained data type is actually a ndarray.

>>> exam_result.values
array([[10, 15, 'Female'],
[11, 14, 'Female'],
[13, 16, 'Female'],
[12, 17, 'Male'],
[6, 18, 'Male'],
[23, 12, 'Male'],
[4, 13, 'Male'],
[11, 19, 'Male'],
[14, 14, 'Female'],
[21, 18, 'Male']], dtype=object)

Description and statistics

Sometimes, after we get the data, we want to see the simple statistical indicators (maximum, minimum, average, median, etc.) of the data. For example, if you want to see the maximum marks, how to achieve it?

Simply call the max method on the marks column.

>>> exam_result.marks.max()
23

Similarly, the minimum, average, median, and sum can be achieved by calling the min, mean, quantize, and sum methods. As you can see, after calling these methods on a Series, all that is returned is just an aggregate result.

To introduce an interesting method: cumsum, look at the name and find it has a relationship with the sum method, in fact, cumsum is also used to sum, but it is used, to sum up, that is to say, it results from Same size as the original Series or DataFrame.

>>> exam_result.marks.cumsum()
name
Kitty 10
Lily 21
Elizabeth 34
Johnson 46
Harry 52
Ramsay 75
Bear 79
Austin 90
Nelly 104
Parker 125
Name: marks, dtype: int64

As you can see, the final result of cummax is to sum the result of the last summation with the original current value as the current value. This sounds a bit winding. For example, the above 34= 21+ 13. Cumsum can also be used to manipulate objects of string type.

>>> exam_result.sex.cumsum()
name
Kitty Female
Lily FemaleFemale
Elizabeth FemaleFemaleFemale
Johnson FemaleFemaleFemaleMale
Harry FemaleFemaleFemaleMaleMale
Ramsay FemaleFemaleFemaleMaleMaleMale
Bear FemaleFemaleFemaleMaleMaleMaleMale
Austin FemaleFemaleFemaleMaleMaleMaleMaleMale
Nelly FemaleFemaleFemaleMaleMaleMaleMaleMaleFemale
Parker FemaleFemaleFemaleMaleMaleMaleMaleMaleFemaleMale
Name: sex, dtype: object

If you want to get more statistical methods, you can refer to the official link: Descriptive statistics

Although there are corresponding methods for various common statistical values, if I want to get multiple indicators, I need to call multiple methods. Is it a bit troublesome?

Pandas designers naturally take this into account. To get multiple statistics at once, just call the describe() method.

>>> exam_result.describe()

As you can see, describeafter calling the method directly, some statistical indicators of the column of the numeric type are displayed, for example, Total, average, standard deviation, minimum, maximum, 25%/50%/75% quantile. If you want to view statistical indicators of non-numeric columns, you can set include=["object"]to get.

>>> exam_result.describe(include=[“object”])

The above results show some statistical metrics for columns of non-numeric type: total, number of de-weighted, most common values, frequency of the most common values.

Also, if I want to count the number of occurrences of each value in a column, how can I achieve it quickly? Call the value_countsmethod to quickly get the number of occurrences of each value in the Series.

>>> exam_result.sex.value_counts()
Male 6
Female 4
Name: sex, dtype: int64

If you want to get the index corresponding to the maximum or minimum value of a column, you can use the idxmaxor idxminmethod to complete.

>>> exam_result.marks.idxmax()
'Ramsay'

Discretization

Sometimes, we will encounter such a demand, we want to discretize the marks (slotting), in straightforward, we divide the marks into several intervals, here we want to divide the marks into 3 interval segments. It can be done using Panda’s cut method.

>>> pd.cut(exam_result.marks, 3)
name
Kitty (3.981, 10.333]
Lily (10.333, 16.667]
Elizabeth (10.333, 16.667]
Johnson (10.333, 16.667]
Harry (3.981, 10.333]
Ramsay (16.667, 23.0]
Bear (3.981, 10.333]
Austin (10.333, 16.667]
Nelly (10.333, 16.667]
Parker (16.667, 23.0]
Name: marks, dtype: category
Categories (3, interval[float64]): [(3.981, 10.333] < (10.333, 16.667] < (16.667, 23.0]]

As you can see, cut automatically generates equidistant discrete intervals, which is fine if you want to define it yourself.

>>> pd.cut(exam_result.marks, [1, 3, 4, 7, 8, 11, 12, 15, 16, 19])
name
Kitty (8.0, 11.0]
Lily (8.0, 11.0]
Elizabeth (12.0, 15.0]
Johnson (11.0, 12.0]
Harry (4.0, 7.0]
Ramsay NaN
Bear (3.0, 4.0]
Austin (8.0, 11.0]
Nelly (12.0, 15.0]
Parker NaN
Name: marks, dtype: category
Categories (9, interval[int64]): [(1, 3] < (3, 4] < (4, 7] < (7, 8] ... (11, 12] < (12, 15] < (15, 16] < (16, 19]]

Sometimes after discretization, you want to give each interval a name, you can specify the labels parameter.

>>> pd.cut(exam_result.marks, [1, 3, 4, 7, 8, 11, 12, 15, 16, 19], labels=[“bad”, “not bad”, “medium”, “average”, “good”, “extremely good”, “Brilliant”, “Superb”, “Golden”])
name
Kitty good
Lily good
Elizabeth Brilliant
Johnson extremely good
Harry medium
Ramsay NaN
Bear not bad
Austin good
Nelly Brilliant
Parker NaN
Name: marks, dtype: category
Categories (9, object): [bad < not bad < medium < average ... extremely good < Brilliant < Superb < Golden]

In addition to discretization with cut, qcut can also be discretized. The cut is discretized according to the size of each value, and qcut is discretized according to the number of occurrences of each value.

>>> pd.qcut(exam_result.marks, 4)
name
Kitty (3.999, 10.25]
Lily (10.25, 11.5]
Elizabeth (11.5, 13.75]
Johnson (11.5, 13.75]
Harry (3.999, 10.25]
Ramsay (13.75, 23.0]
Bear (3.999, 10.25]
Austin (10.25, 11.5]
Nelly (13.75, 23.0]
Parker (13.75, 23.0]
Name: marks, dtype: category
Categories (4, interval[float64]): [(3.999, 10.25] < (10.25, 11.5] < (11.5, 13.75] < (13.75, 23.0]]

If you can not understand this session, please visit the official site

Sorting Function

When performing data analysis, data sorting is indispensable. Pandas support two sorting methods: sort by axis (index or column) and sort by the actual value.

First, look at the sort by index: the sort_index method defaults to the positive order according to the index.

>>> exam_result.sort_index()

If you want to reverse the column by column, you can set the parameters axis=1 and ascending=False.

>>> exam_result.sort_index(axis=1, ascending=False)

If you want to achieve sorting by actual value, for example, want to sort by age, how to achieve it?
Use the sort_valuesmethod, set the parameters by="marks".

>>> exam_result.sort_values(by=”marks”)

Sometimes we may need to sort by multiple values, for example, sort by marks and age, you can set the parameter by as a list.

Note: The order of each element in the list affects the sorting priority.

>>> exam_result.sort_values(by=[“marks”, “age”])

Usually, after sorting, we may need to get the n values ​​of the largest n or minimum values, which we can do using the nlargest and nsmallest methods, which is much faster than sorting first and then using the head(n) method.

>>> exam_result.marks.nlargest(3)
name
Ramsay 23
Parker 21
Nelly 14
Name: marks, dtype: int64

Function Application

Although Pandas provides us with a very rich set of functions, sometimes we may need to customize some functions ourselves and apply them to DataFrame or Series. Commonly used functions are map, apply, applymap.

The map is a unique method in the Series that allows you to convert every element in the Series.

If I want to judge whether a student is middle-marks by marks (12-marks is middle-marks), it can be easily done with a map.

>>> exam_result.marks.map(lambda x: “yes” if x>=12 else “no”)
name
Kitty no
Lily no
Elizabeth yes
Johnson yes
Harry no
Ramsay yes
Bear no
Austin no
Nelly yes
Parker yes
Name: marks, dtype: object

Another example is that I want to judge whether it is a boy or a girl through sex. I can do this.

>>> exam_result.sex.map(sex_map)
name
Kitty Girl
Lily Girl
Elizabeth Girl
Johnson Boy
Harry Boy
Ramsay Boy
Bear Boy
Austin Boy
Nelly Girl
Parker Boy
Name: sex, dtype: object

The apply method supports both Series and DataFrame, which is applied to each value for Series operations and to all or all columns (controlled by the axis parameter) when operating on a DataFrame.

# For Series, the apply method is not much different from the map method. 
>>> exam_result.marks.apply(lambda x: “yes” if x >= 12 else “no”)
name
Kitty no
Lily no
Elizabeth yes
Johnson yes
Harry no
Ramsay yes
Bear no
Austin no
Nelly yes
Parker yes
Name: marks, dtype: object
# For DataFrame, the apply method is a row or column of data (a Series)
>>> exam_result.apply(lambda x: x.max(), axis=0)
marks 23
age 19
sex Male
dtype: object

The applymap method is for the DataFrame, which acts on each element in the DataFrame, and its effect on the DataFrame is similar to the effect of the Apply on the Series.

>>> exam_result.applymap(lambda x: str(x).lower())

Modify the column/index name

In the process of using DataFrame, you often encounter changes to column names, index names, and so on. It’s easy to do with rename.
To modify the column name, you only need to set the parameter columns.

>>> exam_result.rename(columns={“marks”: “Marks”, “age”: “Age”, “sex”: “Sex”})

Similarly, to modify the index name, you only need to set the parameter index.

>>> exam_result.rename(index={“Kitty”: “kitty”, “Lily”: “lily”, “Harry”: “harry”})

Type Operation

If you want to get the number of columns for each type, you can use the get_dtype_counts method.

>>> exam_result.get_dtype_counts()
int64 2
object 1
dtype: int64

If you want to convert the data type, you can do it with astype.

>>> exam_result[“marks”].astype(float)
name
Kitty 10.0
Lily 11.0
Elizabeth 13.0
Johnson 12.0
Harry 6.0
Ramsay 23.0
Bear 4.0
Austin 11.0
Nelly 14.0
Parker 21.0
Name: marks, dtype: float64

Sometimes it involves changing the object type to another type. The common ones are numbers, dates, and time differences. The Pandas correspond to the to_numeric, to_datetime, and to_timedelta methods, respectively.

Here is some information about the height of these students.

>>> exam_result[“height”] = [“160”, “158”, “162”, “156”, “164”, “155”, “165”, “158”, “150”, “157cm”]
>>> exam_result

Now turn the height column into a number. Obviously, 157cm is not a number. To cast, we can pass in the errors parameter. This parameter is used to handle the failure when it is forced to fail.

By default, errors=’raise’, which means that an exception is thrown directly after a strong turn fails, setting errors=’coerce’ can assign the problematic element to pd.NaT (for datetime and timedelta) when the turnaround fails. Or np.nan (number). Set errors=’ignore’ to return the original data if the transition fails.

>>> pd.to_numeric(exam_result.height, errors=”coerce”)
name
Kitty 160.0
Lily 158.0
Elizabeth 162.0
Johnson 156.0
Harry 164.0
Ramsay 155.0
Bear 165.0
Austin 158.0
Nelly 150.0
Parker NaN
Name: height, dtype: float64
>>> pd.to_numeric(exam_result.height, errors="ignore")
name
Kitty 160
Lily 158
Elizabeth 162
Johnson 156
Harry 164
Ramsay 155
Bear 165
Austin 158
Nelly 150
Parker 157cm
Name: height, dtype: object

Pandas missing value processing

Content directory

1. What is a missing value

2. Discard missing values

3. Fill in missing values

4. Replace missing values

5. Fill with other objects

What is a missing value

Before you understand how missing values ​​(also called controls) are handled, the first thing to know is what is missing values? Intuitively, missing values ​​represent “missing data.”

Think about a question: What is the cause of the missing value? In fact, there are many reasons. In actual life, data may be missing due to incomplete data. It may also result in data loss due to misoperation, or artificially causing data loss.

Let’s take a look at our example.

We can see the student Harry's mark None, the student Johnson’s age NAN, Bear birthday NaT. Pandas in the eyes of these are missing values may be used isnull()or notnull()a method to operate.

>>> exam_result.isnull()

In addition to being simple to identify which are missing or non-missing values, the most common is to filter out some missing rows. For example, I want to filter out users whose student’s are empty. How do I do this?

>>> exam_result[exam_result.marks.notnull()]

Discard missing values

Since there are missing values, one common approach is to discard missing values. Use dropnamethod missing values can be discarded.

>>> exam_result.marks.dropna()
name
Kitty 10.0
Lily 11.0
Elizabeth 13.0
Johnson 12.0
Ramsay 23.0
Bear 4.0
Austin 11.0
Parker 21.0
Name: marks, dtype: float64

SeriesThe use of dropnarelatively simple, for DataFrameit, you can set more parameters.

axisParameters are used to control rows or columns. Unlike others, axis=0(default) represents an action row that axis=1represents an action column.

howThe optional value of the parameter is any(default) or all. anyIndicates that a row/column is discarded when any element is empty and is discarded when allall values ​​of a row/column are empty.

subset The parameter indicates the index or column name that is only considered when deleting.

threshThe type of the argument is an integer, which is used, for example thresh=3, to preserve at least 3 non-null values ​​in a row/column.

# 一Row data, as long as there is a field with a null value, delete 
>>> exam_result.dropna(axis=0, how="any")
# 一Data All fields are null to delete 
>>> exam_result.dropna(axis=0, how="all")
# a row of data, as long as the age or sex has a null value, delete 
>>> exam_result.dropna (axis=0, how="any", subset=["age", "sex"])

Fill in missing values

In addition to missing values may be discarded, but also may be filled with missing values, the most common is the use of fillnacompletion of the filling.

fillna This name is used to fill in missing values.

One common way to fill in missing values ​​is to use a scalar to fill. For example, here I have missing ages filled with 0.

>>> exam_result.marks.fillna(0)
name
Kitty 10.0
Lily 11.0
Elizabeth 13.0
Johnson 12.0
Harry 0.0
Ramsay 23.0
Bear 4.0
Austin 11.0
Nelly 0.0
Parker 21.0
Name: marks, dtype: float64

In addition to being able to fill with scalars, you can also use the previous or last valid value to fill.

Setting parameters method='pad'or method='ffill'prior to use to fill a valid value.

>>> exam_result.marks.fillna(method=”ffill”)
name
Kitty 10.0
Lily 11.0
Elizabeth 13.0
Johnson 12.0
Harry 12.0
Ramsay 23.0
Bear 4.0
Austin 11.0
Nelly 11.0
Parker 21.0
Name: marks, dtype: float64

Setting parameters method='bfill'or method='backfill'after a valid value can be used to fill.

>>> exam_result.marks.fillna(method=”backfill”)
name
Kitty 10.0
Lily 11.0
Elizabeth 13.0
Johnson 12.0
Harry 23.0
Ramsay 23.0
Bear 4.0
Austin 11.0
Nelly 21.0
Parker 21.0
Name: marks, dtype: float64

In addition by fillnafilling method of missing values, but also can interpolatebe filled method. Linear interpolation using the default may be set methodparameters to change the way.

>>> exam_result.marks.interpolate()
name
Kitty 10.0
Lily 11.0
Elizabeth 13.0
Johnson 12.0
Harry 17.5
Ramsay 23.0
Bear 4.0
Austin 11.0
Nelly 16.0
Parker 21.0
Name: marks, dtype: float64

Replace missing values

Have you ever thought about a question: What is the missing value? You may be surprised to say that the previous ones have not been said None, np.nanand NaTthese are missing values. But I also said that these are missing values ​​in the eyes of Pandas. Sometimes in our eyes, some outliers are treated as missing values.

For example, in our stored student information, we assume that a student is a teenager, and when the age is 13, we can think of it as an outlier. For example, we all know that gender is divided into male (male) and female (female). When recording student gender, the unknown student gender is recorded as “unknown”. Obviously, we can also think that “unknown” is missing. value. In addition, there are sometimes blank strings, which can also be considered missing values.

In this case the above, we can use replaceto replace missing values methods.

>>> exam_result.age.replace(13, np.nan)
name
Kitty 15.0
Lily 14.0
Elizabeth 16.0
Johnson NaN
Harry 18.0
Ramsay 12.0
Bear NaN
Austin NaN
Nelly 14.0
Parker 18.0
Name: age, dtype: float64

You can also specify a mapping dictionary.

>>> exam_result.age.replace({13: np.nan})
name
Kitty 15.0
Lily 14.0
Elizabeth 16.0
Johnson NaN
Harry 18.0
Ramsay 12.0
Bear NaN
Austin NaN
Nelly 14.0
Parker 18.0
Name: age, dtype: float64

For a DataFrame, you can specify a value to replace for each column.

>>> exam_result.replace({“age”: 13, “birth”: pd.Timestamp(“2002–08–08”)}, np.nan)

Similarly, we can Specific string replace it, such as replace "unknown".

>>> exam_result.sex.replace(“unknown”, np.nan)
name
Kitty Female
Lily Female
Elizabeth NaN
Johnson Male
Harry Male
Ramsay NaN
Bear Male
Austin Male
Nelly Female
Parker NaN
Name: sex, dtype: object

In addition to being able to replace specific values, you can also use them Regular expression to replace them, such as: will Replace blank string with a null value.

>>> exam_result.sex.replace(r’\s+’, np.nan, regex=True)
name
Kitty Female
Lily Female
Elizabeth NaN
Johnson Male
Harry Male
Ramsay NaN
Bear Male
Austin Male
Nelly Female
Parker NaN
Name: sex, dtype: object

Fill with other objects

In addition to our own manual discarding, padding has replaced missing values, we can also use other objects to fill.

For example, there are two about the marks of the studentSeries, One has a missing value and the other one is not, we can ‘Will have no missing values Series Elements in the pass are passed to the missing value’

>>> marks_new = exam_result.marks.copy()
>>> marks_new.fillna(20, inplace=True)
>>> marks_new
name
Kitty 10.0
Lily 11.0
Elizabeth 13.0
Johnson 12.0
Harry 20.0
Ramsay 23.0
Bear 4.0
Austin 11.0
Nelly 20.0
Parker 21.0
Name: marks, dtype: float64
>>> exam_result.marks.combine_first(marks_new)
name
Kitty 10.0
Lily 11.0
Elizabeth 13.0
Johnson 12.0
Harry 20.0
Ramsay 23.0
Bear 4.0
Austin 11.0
Nelly 20.0
Parker 21.0
Name: marks, dtype: float64

As you can see, the missing values ​​for marks in the student information are populated with the marks_new collection.

Pandas text data processing

  1. Why use the str attribute

2. Replacement and segmentation

3. Extract substring

Why use the str attribute

Text data is also the string we often say. Pandas provide the str attribute for Series, which makes it easy to manipulate each element.

As we learned before, we can use the map or apply methods when dealing with each element in the Series.

For example, I want to convert each gender to lowercase, you can use the following method.

>>> exam_result.sex.map(lambda x: x.lower())
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-39-e09f49c63410> in <module>
----> 1 exam_result.sex.map(lambda x: x.lower())

~\Anaconda3\lib\site-packages\pandas\core\series.py in map(self, arg, na_action)
3380 """
3381 new_values = super(Series, self)._map_values(
-> 3382 arg, na_action=na_action)
3383 return self._constructor(new_values,
3384 index=self.index).__finalize__(self)

~\Anaconda3\lib\site-packages\pandas\core\base.py in _map_values(self, mapper, na_action)
1216
1217 # mapper is a function
-> 1218 new_values = map_f(values, mapper)
1219
1220 return new_values

pandas/_libs/lib.pyx in pandas._libs.lib.map_infer()

<ipython-input-39-e09f49c63410> in <lambda>(x)
----> 1 exam_result.sex.map(lambda x: x.lower())

AttributeError: 'float' object has no attribute 'lower'

WhatActually wrong, the wrong reason is because floatthe object is not the type of lowerproperty. This is because of the missing value ( np.nan) is of floattype.

This time our strproperty operations, and see how to use it.

# Convert text to lowercase 
>>> exam_result.sex.str.lower()
name
Kitty female
Lily female
Elizabeth NaN
Johnson male
Harry male
Ramsay NaN
Bear male
Austin male
Nelly female
Parker NaN
Name: sex, dtype: object

It can be seen through the struse of the following methods to access the property name and Pythonthe same built-in string method name. And can automatically exclude missing values.

Let’s try some other methods. For example, count the length of each string.

>>> exam_result.sex.str.len()
name
Kitty 6.0
Lily 6.0
Elizabeth NaN
Johnson 4.0
Harry 4.0
Ramsay NaN
Bear 4.0
Austin 4.0
Nelly 6.0
Parker NaN
Name: sex, dtype: float64

Replacement and Segmentation

The use of .srtattributes also support replacement and split operations.

Let’s look at the replacement operation first, for example: Replace the empty string with an underscore.

>>> exam_result.sex.str.replace(“e”, “_”)
name
Kitty F_mal_
Lily F_mal_
Elizabeth NaN
Johnson Mal_
Harry Mal_
Ramsay NaN
Bear Mal_
Austin Mal_
Nelly F_mal_
Parker NaN
Name: sex, dtype: object

replaceThe method also supports Regular expression, for example, all beginning with “F” the gender replaced empty string.

>>> exam_result.sex.str.replace(“^F.*”, “ “)
name
Kitty
Lily
Elizabeth NaN
Johnson Male
Harry Male
Ramsay NaN
Bear Male
Austin Male
Nelly
Parker NaN
Name: sex, dtype: object

Let’s look at the split operation, such as splitting a column based on an empty string.

>>> exam_result.sex.str.split(“ “)
name
Kitty [Female]
Lily [Female]
Elizabeth NaN
Johnson [Male]
Harry [Male]
Ramsay NaN
Bear [Male]
Austin [Male]
Nelly [Female]
Parker NaN
Name: sex, dtype: object

Dividing element in the list can be used getor []symbols visit:

>>> exam_result.sex.str.split(“ “).str.get(1)
name
Kitty NaN
Lily NaN
Elizabeth NaN
Johnson NaN
Harry NaN
Ramsay NaN
Bear NaN
Austin NaN
Nelly NaN
Parker NaN
Name: sex, dtype: float64
>>> exam_result.sex.str.split(“ “).str[1]
name
Kitty NaN
Lily NaN
Elizabeth NaN
Johnson NaN
Harry NaN
Ramsay NaN
Bear NaN
Austin NaN
Nelly NaN
Parker NaN
Name: sex, dtype: float64

Setting parameters expand=TrueYou can easily extend this to return DataFrame.

>>> exam_result.sex.str.split(“ “, expand=True)

Extracting Substrings

Since it’s natural to manipulate strings, you might wonder if you can extract substrings from a long string. The answer is yes.

Extract the first matching substring

extractThe method accepts a regular expression and comprising at least one capture group, specify the parameters expand=Truecan ensure that each return DataFrame.

For example, if you want the match the empty string previous one all the letters, you can use the following operations:

>>> exam_result.sex.str.extract(“(\w+)\s+”, expand=True)

If the use multiple sets of extraction regular expressions will return one DataFrame, each group has only one column.

For example, if you want to match all the letters before and after the empty string, the operation is as follows:

>>> exam_result.sex.str.extract(“(\w+)\s+(\w+)”, expand=True)

Match all substrings

extractOnly the first substring matching can be used extractallto match all of the sub-strings.

For example, to the letters in front of the blank string match all groups, you can do the following.

>>> exam_result.sex.str.extractall(“(\w+)\s+”)

Test whether the substring is included

Matched substring addition, we can also use the containstest contains a substring. For example, you want to test if a gender contains substrings “Fe”.

>>> exam_result.sex.str.contains(“Fe”)
name
Kitty True
Lily True
Elizabeth NaN
Johnson False
Harry False
Ramsay NaN
Bear False
Austin False
Nelly True
Parker NaN
Name: sex, dtype: object

Of course, regular expressions are also supported. For example, you want to test if it starts with the letter “ l".

>>> exam_result.sex.str.contains(“^l”)
name
Kitty False
Lily False
Elizabeth NaN
Johnson False
Harry False
Ramsay NaN
Bear False
Austin False
Nelly False
Parker NaN
Name: sex, dtype: object

Generate dummy variables

This is a wonderful function, by the get_dummiesthe method may be a string to a dummy variable, sep the parameters are to specify the separator between dummy variables. Let’s see the effect.

>>> exam_result.sex.str.get_dummies(sep=” “)

Import Pandas

The first thing, of course, is to ask our star, Pandas.

Import  pandas  as  pd  # This is the standard

This is the introduction pandasof standard methods. We do not want to have to write pandasthe full name, but to ensure simplicity and to avoid naming conflicts code is very important, so a compromise use pd. If you go to someone else to use pandasthe code, you'll see this import mode.

Pandas data type

Source Link: https://pbpython.com/pandas_dtypes.html

Data types are essentially internal structures used by programming languages ​​to understand how data is stored and manipulated. For example, a program needs to understand that you can add two numbers, such as 2+ 3 to get 8. Or, if it is two strings, such as “Iron” and “Man”, you can connect (add) them to get “IronMan”.

One potentially confusing aspect of the Pandas data type is the overlap between the data types of Pandas, Python, and numpy. The following table summarizes the key points:

Pandas dtypemapping:

For the most part, there is no need to worry about determining if you should try to explicitly force the pandas type to a corresponding to NumPy type. Most of the time, using pandas default int64 and float64 types will work. The only reason I included in this table is that sometimes you may see the numpy types pop up on-line or in your own analysis.

Data Collection (Import data into Pandas)

Before we can modify, explore, and analyze the data, we must first import the data. Thanks, Pandas. ThenNumpyis easier in.

Here I encourage you to find the data you are interested in and use it for practice. Your (or other) country’s website is a good source of data. The following are good sources of data collection:

Google: You can easily search your data through google.

Government Database: Refined data is available from various government databases. If you want to give an example, the UK government data and US government data are the first to be promoted.

University Data Repository: Some of the universities whose data-sharing freely. For example, Stanford University has a large network dataset collection.

Non-governmental organization: Various NGO companies share their data. You can collect BuzzFeed news data in their own GitHub repository.

Kaggle: You can call Kaggle a machine learning or data science hub. This site is unique for machine learning and data science. Here you will find numerous data sets.

Github: There is a lot of data available in Github and just find out your desired data. Awesome Public Datasets

Some of the link given below:

  1. https://opendata.socrata.com/
  2. https://www.reddit.com/r/datasets/
  3. https://archive.ics.uci.edu/ml/index.php?
  4. https://www.datasetlist.com/

I will use the Boston Housing data, which can be easily downloaded from the Kaggle website. Link: https://www.kaggle.com/c/boston-housing/data

Download both test.csv and train.csv then create a folder and place these two files. We will work with the train.csv file.

>>> import pandas as pd
>>> df = pd.read_csv('$PATH://Filename')

For windows user,

PATH = C:\Users\msjahid\Documents\Boston Housing Dataset 
# Now add extra \ and add file name
# C:\\Users\\msjahid\\Documents\\Boston Housing Dataset\\train.csv

Now going to our jupyter notebook,

From here we csvimport the data file and stored in the dataframe in. This step is very simple, you just need to call read_csvand then pass in the path of the file on the line. Here head (5) means the first five values of the data set will show me.

--

--

Jahid Hasan
Jahid Hasan

Written by Jahid Hasan

Enthusiast of Data Science | Data Analyst | Machine Learning | Artificial intelligence | Deep Learning | Author || ps: https://msjahid.github.io/