A Guide to Data Manipulation with Python’s Pandas and NumPy

Published in

Tech Blog

11 min readFeb 13, 2024

Data Manipulation using Python (by the author)

Unlock the power of data manipulation with Python’s Pandas and NumPy. Within this comprehensive guide, explore the fundamental principles of refining, cleaning, and organizing core data. Learn practical data handling techniques through step-by-step tutorials and real-world examples. By the end of this article, you’ll possess the skills to extract valuable insights from your data with ease. Start your data manipulation journey today!

What is data manipulation?

Data manipulation refers to the process of transforming, cleaning, and reorganizing data to make it suitable for analysis, visualization, and further processing. In data science, data manipulation is a critical step in the data preprocessing phase, where raw data is refined and structured to extract meaningful insights efficiently. This process often involves various operations, such as filtering, merging, reshaping, and aggregating data.

What are typical data manipulation tasks?

Some typical data manipulation tasks are:

Data manipulation pipeline(by the author)

Data Cleaning: Identifying and handling missing values, outliers, and inconsistencies in the data to ensure accuracy and reliability.
Data Transformation: Converting data from one format to another, changing data types, and scaling numerical values to bring them into a range.
Filtering and Sub-setting: Selecting specific rows or columns based on certain conditions to focus on relevant data for analysis.
Data Aggregation: Combining data into groups and calculating summary statistics (e.g., mean, sum, count) for each group.
Data Joining and Merging: Combining data from multiple sources based on common attributes to construct a unified dataset.
Pivoting and Reshaping: Reorganizing data to change its structure, such as moving rows to columns or columns to rows (or “pivoting”) to see different summaries of the source data.
Data Encoding: Converting categorical variables into numerical representations for analysis.
Feature Engineering: Creating new features or variables from existing data that may improve the performance of machine learning models.
Data Imputation: Fill in missing values using various techniques to maintain the integrity of the dataset.
Data Normalization: Scaling numerical data to a standard range, often between 0 and 1, to prevent the dominance of certain features.

What programming languages do data manipulation require?

Data manipulation is typically performed using programming languages like Python or R and libraries such as Pandas (Python) and dplyr (R), which offer powerful functions and methods to efficiently handle large datasets and perform complex data manipulation operations. In this article, we will focus only on Python.

Which is better for data analysis: R or Python? (imaginarycloud.com)

Why is data manipulation important?

The quality of data manipulation directly impacts the accuracy and reliability of any data analysis or machine learning models built on the processed data. Therefore, data scientists spend considerable period of time and effort on data manipulation to ensure that the data is in the most suitable form for meaningful insights and predictions.

What are some data manipulation libraries in Python?

Python offers several powerful data manipulation libraries that are widely used in the data science community. These libraries provide various tools and functions for data cleaning, transformation, analysis, and visualization. Here are some of the most popular data manipulation libraries in Python:

Libraries in Python (https://teksands.ai/)

Pandas:
Pandas is a fundamental data manipulation library in Python. It provides data structures like DataFrame and Series, enabling the smooth handling of structured data. Pandas provides a set of tools to facilitate data cleaning, filtering, merging, grouping, and aggregation.
NumPy:
While NumPy is primarily known for its numerical computing capabilities, it also plays a significant role in data manipulation. It provides support for arrays and matrices, enabling efficient manipulation of large datasets.
SciPy:
SciPy builds on NumPy and provides additional scientific computing functionalities, including statistical functions, optimization, integration, interpolation, and more.
Dask:
Dask is a parallel computing library that extends the capabilities of Pandas and NumPy. It allows you to work with larger-than-memory datasets by performing operations in parallel and leveraging distributed computing resources.

How can you determine which library to utilize?

These libraries cater to different use cases and dataset sizes, so the choice of library depends on the specific requirements of your project. Pandas, being the most widely used and beginner-friendly, is an excellent starting point for most data manipulation tasks. As you encounter larger datasets and more complex scenarios, you may explore other libraries that offer better performance and scalability.

In this article, our main focus is on the python libraries that deal with data manipulation: Pandas and NumPy.

What is Pandas?

Pandas is a powerful and widely used Python library for data manipulation and analysis. It provides data structures like DataFrame and Series that allow you to handle structured data efficiently. Here’s an overview of how to use Pandas.

pandas library for data manipulation in Python (wikipedia.org)

How do you install Pandas?

If you haven’t installed Pandas yet, you can do so using pip, the Python package manager. Open a terminal or command prompt and run the following command:

pip install pandas

How do you import Pandas?

To use Pandas in your Python script or Jupyter Notebook, you need to import the library first. Conventionally, it is imported using the alias `pd`:

import pandas as pd

What is a DataFrame ? How do you create it?

A data frame is a two-dimensional tabular data structure, similar to a spreadsheet. You can create a data frame from various data sources like dictionaries, lists, NumPy arrays, or CSV files. For example:

`import pandas as pd
# Creating a DataFrame from a dictionary
data = {
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35],
'City': ['New York', 'San Francisco', 'Los Angeles']
}
df = pd.DataFrame(data)

How do you view my data?

To see the content of your data frame, you can use the `.head()` method to view the first few rows:

#Shows the first 5 rows of the dataframe if no argument is specified
print(df.head())
#if an argument is specified, for example 3, this function shows the first 3 rows
print(df.head(3))

How can you access my data by column name?

You can access data in a data frame using various methods. For example, to access a specific column:

# Accessing a specific column
names = df['Name']

How do you filter the data based on a specific condition?

You can filter rows in a dataframe based on certain conditions. For example, to filter people aged above 30:

# Filtering rows based on a condition
above_30 = df[df['Age'] > 30]

How can you add or modify data in our DataFrame?

You can add new columns or modify existing ones in a DataFrame. For example, to add a new column ‘Country’:

# Adding a new column
df['Country'] = 'USA'

How can you read or write data from and to various file types?

Pandas supports reading data from and writing data to various formats like CSV, Excel, SQL databases, etc. CSV files are most commonly used for structured and tabulated data. For example:

# Reading data from a CSV file
df = pd.read_csv('data.csv')
# Writing data to a CSV file
df.to_csv('output.csv', index=False) # Set index=False to exclude the row numbers in the output

Does pandas provide functions for summary and statistics?

Pandas provides functions for basic data summary and statistics. For example:

# Summary statistics
print(df.describe())
# Grouping and aggregation
grouped_data = df.groupby('City')['Age'].mean()

How can you handle missing data in pandas?

Pandas provides tools to handle missing data, such as `.dropna()` to remove rows with missing values and `.fillna()` to fill missing values with specific values.

import pandas as pd

# Create a DataFrame with missing values
data = {'A': [1, 2, None, 4],
        'B': [5, None, 7, 8]}
df = pd.DataFrame(data)

# Drop rows with missing values
df_dropna = df.dropna()

# Fill missing values with specific value
df_fillna = df.fillna(0)

print("Original DataFrame:")
print(df)
print("DataFrame after dropping rows with missing values:")
print(df_dropna)
print("DataFrame after filling missing values with 0:")
print(df_fillna)

What is the NumPy library?

NumPy is a fundamental Python library for numerical computing. It provides support for large, multi-dimensional arrays and matrices, along with a collection of functions to operate on these arrays efficiently. NumPy is the foundation of many other Python data science libraries and is widely used for tasks like mathematical operations, data manipulation, and scientific computing. Here’s an overview of the NumPy library and its key features.

Numpy library for numerical computation in Python (WikiPedia)

How can you install NumPy?

If you don’t have NumPy installed, you can install it using pip:

pip install numpy

How can you import NumPy?

After installation, you can import NumPy into your Python script or Jupyter Notebook using the conventional alias `np`:

import numpy as np

How do you create NumPy arrays?

The primary data structure in NumPy is the `ndarray` (N-dimensional array), which can be created using various methods like `numpy.array()`, `numpy.zeros()`, `numpy.ones()`, `numpy.arange()`, etc. For example:

# Create a 1D array
arr1d = np.array([1, 2, 3, 4, 5])
# Create a 2D array (matrix)
arr2d = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])

What are some array operations that could be performed by NumPy?

NumPy arrays support element-wise operations and broadcasting, making it easy to perform mathematical operations on entire arrays at once. For example:

# Element-wise addition, adds 10 to each and every entry of the array
result = arr1d + 10
# Element-wise multiplication, multiplies each and every entry in the array by 2
result = arr1d * 2

What are some array attributes and methods?

NumPy arrays come with various attributes and methods that provide useful information and functionalities. For instance:

# Shape of the array
shape = arr2d.shape
# Maximum value in the array
max_value = arr2d.max()
# Transpose of the array
transpose_arr = arr2d.T

What are indexing and slicing methods in NumPy library?

Similar to Python lists, you can access elements in NumPy arrays using indexing and slicing. Slicing allows you to specify a range of indices to obtain a subset of the original sequence. For example:

# Accessing specific element
element = arr1d[2]
# Slicing a 1D array
sliced_arr = arr1d[1:4]
# Slicing a 2D array
sliced_arr = arr2d[0:2, 1:3]

What is array broadcasting in NumPy?

NumPy allows arrays with different shapes to be combined or operated upon in a process called broadcasting. For example:

# Broadcasting example
arr1 = np.array([1, 2, 3])
arr2 = np.array([[10], [20], [30]])
result = arr1 + arr2 # Broadcasting the addition operation

What are some mathematical and statistical functions?

NumPy provides a wide range of mathematical and statistical functions to operate on arrays efficiently, like `numpy.sum()`, `numpy.mean()`, `numpy.median()`, `numpy.std()`, etc.

# Calculate the sum of all elements in the array
sum_result = np.sum(arr)
print("Sum:", sum_result)

# Calculate the mean (average) of all elements in the array
mean_result = np.mean(arr)
print("Mean:", mean_result)

# Calculate the median of all elements in the array
median_result = np.median(arr)
print("Median:", median_result)

# Calculate the standard deviation of all elements in the array
std_result = np.std(arr)
print("Standard Deviation:", std_result)

Does NumPy deal with linear algebra?

NumPy includes linear algebra operations, such as matrix multiplication (`numpy.dot()` or `@` operator) and solving linear systems of equations (`numpy.linalg.solve()`).

#two 2*2 matrices
matrix1 = np.array([[1, 2], [3, 4]])
matrix2 = np.array([[5, 6], [7, 8]])

# Using numpy.dot() function
result_dot = np.dot(matrix1, matrix2)
print("Matrix Multiplication using numpy.dot():")
print(result_dot)

# Using @ operator for matrix multiplication
result_operator = matrix1 @ matrix2
print("\nMatrix Multiplication using @ operator:")
print(result_operator)

# Define the coefficients and the constants of your equations 
# Mathematically, this is equivelant to 2x+3y=8 and x-2y=1
coefficients = np.array([[2, 3], [1, -2]])
constants = np.array([8, 1])

# Solving the linear system of equations using numpy.linalg.solve() function
solution = np.linalg.solve(coefficients, constants)
print("\nSolution of the linear system of equations:")
print(solution)

How can you generate random numbers using NumPy?

NumPy has a submodule for random number generation (`numpy.random`) that allows you to generate random data, samples, and distributions.

# Generates a random integer between 0 and 9
random_int = np.random.randint(10)
print("Random integer:", random_int)

# Generates an array of 5 random integers between 0 and 9
random_integers = np.random.randint(10, size=5)
print("Array of random integers:", random_integers)

# Generates a random float between 0 and 1
random_float = np.random.rand()
print("Random float:", random_float)

# Generates a 2x3 array of random floats between 0 and 1
random_floats = np.random.rand(2, 3)
print("2x3 array of random floats:")
print(random_floats)

# Generates a random sample from a standard normal (Gaussian) distribution with mean 0 and standard deviation 1
random_normal = np.random.randn()
print("Random sample from normal distribution:", random_normal)

# Generates an array of 5 random samples from a standard normal distribution N~(0,1)
random_normals = np.random.randn(5)
print("Array of random samples from normal distribution:")
print(random_normals)

How do NumPy and Pandas work together?

NumPy and Pandas are two essential libraries that work together seamlessly in data science workflows. NumPy provides the foundation for data storage and manipulation through its multi-dimensional array (`ndarray`) data structure, while Pandas builds on top of NumPy to offer more high-level data structures like DataFrame and Series, which are specifically designed for data analysis and manipulation.

import numpy as np
import pandas as pd

# Create a NumPy array
data = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])

# Create a Pandas DataFrame from the NumPy array
df = pd.DataFrame(data, columns=['A', 'B', 'C'])

print("NumPy array:")
print(data)
print("\nDataFrame:")
print(df)

Here’s a practical example of how NumPy and Pandas work together in data manipulation:

Let’s say you have a dataset of student information stored in a CSV file called “student_data.csv”. The dataset contains columns like ‘Name’, ‘Age’, ‘Gender’, ‘Math_Score’, and ‘Science_Score’. You want to read this data, perform some data manipulations, extract specific information from the dataset, and create a new DataFrame containing only male students with scores above the average.

# Import necessary libraries
import numpy as np
import pandas as pd

# Step 1: Reading the data using Pandas
df = pd.read_csv('student_data.csv')

# Step 2: Extracting specific columns and converting them to NumPy arrays
math_scores = np.array(df['Math_Score'])
science_scores = np.array(df['Science_Score'])

# Step 3: Performing data manipulations using NumPy
# For example, let's calculate the average math and science scores
average_math_score = np.mean(math_scores)
average_science_score = np.mean(science_scores)

# Step 4: Adding new calculated columns to the DataFrame
df['Average_Math_Score'] = average_math_score
df['Average_Science_Score'] = average_science_score

# Step 5: Applying conditions and filtering the DataFrame
# Let's create a new DataFrame that includes only male students with scores above the average
male_students_above_avg = df[(df['Gender'] == 'Male') & ((df['Math_Score'] > average_math_score) | (df['Science_Score'] > average_science_score))]

# Step 6: Saving the modified DataFrame to a new CSV file
df.to_csv('modified_student_data.csv', index=False)

In this example, we used Pandas and Numpy to extract data into meaningful insights.

Conclusion

To summarize, data manipulation is essential for efficiently managing any dataset. Using Pandas and NumPy ensures that you work with clean, precisely prepared data. Now it’s time to move on to the next phase: data visualization. Stay tuned for the coming article on data visualization approaches, which will provide even more insights from your data.

References

[1] GeeksforGeeks. (n.d.). Data manipulation. Retrieved from https://www.geeksforgeeks.org/data-manipulation/