Data Manipulation

Understanding Data Manipulation with Python — Pandas Library

Rashmi Duleesha
LinkIT
5 min readJun 19, 2020

--

Today I will be focusing on data manipulation with Python and why it is important for data science.

What is Data Manipulation?

Photo by Markus Spiske on Unsplash

So the first question is what is data manipulation. The simple definition is, it’s a process of making data more organized.

Let’s focus on what is data. Data are the set of values. Considering these sets of values, you can discover more about a particular thing. In the discovery process, data manipulation plays a major role because it can get more value from the data.

Data manipulation provides more benefits when you are dealing with data sets. You can easily change datasets in the way you need. It means you can edit, delete, insert whatever you want. Furthermore, you are able to use historical data. Historical data can be involved in future predictions.

Understanding the Pandas Library

Photo by Hitesh Choudhary on Unsplash

In this article, I aim to provide a clear knowledge of the data manipulation process with the Pandas library. Pandas is also an essential package in Python. It was designed for data manipulation and analysis.

Pandas is essential when you are dealing with large and complex datasets. It helps to perform matrices calculations, perform queries and aggregations, discovering incorrect data or missing values, and in data visualizing. We can analyze data with Series and DataFrame in Pandas.

Series

Series is a one-dimensional array object that can hold any data type. Series have the ability to implement values with indexes, unlike List.

Example 1:

Output:

Also, Series acts as a dictionary because it can handle the indices of elements.

Example 2:

Output:

DataFrame

Let’s move on to the DataFrames. In contrast series, Dataframe is a two-dimensional array object. In the real world, datasets can come as a bunch of files. It is hard to analyze the data. Therefore, we need to combine these multiple files into one Dataframe to analyze data more effectively.

This why DataFrames are very important in the data manipulation process. Let’s understand the structure of the DataFrame. DataFrame looks like a table, it also has rows and columns.

Example:

Output:

Converting Dictionary into DataFrame

In small purposes, dictionaries are maybe faster. But when you need to deal with more complex datasets, DataFrames are more useful. The dictionary key is used to describe the heading of the column/column name.

Example

Output:

Basic static operations

  1. info()

info() function is used to understanding the summary of DataFrame.

Example:

Output:

2. describe()

describe() is used to show basic statistical details like mean, median, max, and min.

Example:

Output:

3. loc() and iloc()

These methods are used to filter the data. loc() method is based on rows and columns with their labels, while iloc() is based on rows and columns with their indexes.

Example iloc()

Output:

Example 1: loc()

output:

Example 2:

output:

Performing aggregation functions

Example 1.

output:

Getting sum & min using an aggregation function

output:

Pandas library is the heart of data manipulation in Python. There are more features related to Pandas. It can be used to merge and join datasets, filter data around conditions, arranging data in the ascending or descending order, reading from files with CSV and other formats, etc.

Not only that Pandas library is used in data analyzing and cleaning processes. Moreover, the Pandas library performs like a high-level building block when you are working with data sets.

I hope you learned the basics of Dataframes and its main functions. When you are dealing with real-world scenarios these small concepts are more useful in understanding complex datasets and identifying more effective ways for solving real-world problems.

Thanks for reading!

--

--