An Overview of Python Libraries for Data Manipulation: Numpy and Pandas.
The presence of python libraries like Numpy and Pandas give analysts the power to manipulate data with ease by providing sets of tools, that can be used to perform a range of actions on data: from organization to performance of arithmetic operations to visualization, among others.
Numpy and Pandas both provide tools for processing multidimensional data structures like arrays, and even better, generic data.
Python Numpy Library
Numpy is a general-purpose array processing package that provides a high-performance multidimensional array object, and tools for working with these arrays. These arrays are called ndarrays and have the following properties:
- Rank — number of dimensions of the array
- Shape — tuple of integers giving the size of the array along each dimension
- dtype — datatype of the array
Create an array with random values
import numpy as np# This guarantees the code will generate the same set of random numbers whenever executed
np.random.seed(21)#use np.random to create an ndarray of random integers with 1 as the lowest value and 500000 as the highest value
random_integers = np.random.randint(1, high = 500000, size = (20, 5))
Numpy is useful for directly performing operations on more than one list, without doing it element-wise. It gives us the power to perform arithmetic operations between lists, reshape and resize the structure, and generally play around with the structure of the array.
Python Pandas Library
Pandas revolves around a structure called DataFrame. A dataframe is a 2-dimensional structure, that is, it structures the data along two axes: x and y. The x-axis maps to the rows while the y-axis maps to columns. In short, dataframes provide a table-like structure with columns that could be of different types.
A dataframe is created by instantiating the pandas
DataFrame class and its constructor is defined as:
pandas.DataFrame(data, index, columns, dtype, copy)
Constructor arguments explained:
data - refers to the structure that is to be created as a dataframe instance. This could be a list, ndarray, dictionary, map, serie, or constant.index - optional property for labeling the rows.columns - optional pproperty for labeling the columns.dtype - optional property for the datatype of each column.copy - optional property for copying data.
Below is a quick example of how to:
- Create a dataframe, and
- Derive statistics about the data
- Calculate the probability that a dataframe of randomly created values contains duplicates.
Create a dataframe
Going back to the ndarray we created in the Numpy section (random_integers), we can use it to create a dataframe object as follows:
import pandas as pdrandom_int_dataframe = pd.DataFrame(data = random_integers).head()
random_int_dataframe#head() limits our view to the first (usually 5) rows of the dataframe.
This would yield the following output:
Let us modify the dataframe by labeling the columns so that they are more meaningful and accessible with ease. We can do this by adding a list of column labels and passing this list to the
columns = ["one", "two", "three", "four", "five"]random_int_dataframe = pd.DataFrame(data = random_integers, columns = columns).head()
Derive statistics about the data.
We can now derive the high-level summary of statistics about the dataframe at once by using
describe() lists important statistics like mean, standard deviation, and maximum value per column.
Calculate the probability that a dataframe of randomly created values contains duplicates.
Another important operation could be calculating the probability that some value across the dataframe is a duplicate (recall that np.random.randint() was used to create random values).
#create a reference to the first entry in random_int_dataframe
some_value = random_int_dataframe.loc[0, 'one']#update the above value
random_int_dataframe.loc[0, 'one'] = "initial_value"#get the sum of all occurences left of that value
sum_all_occurences = (random_int_dataframe.loc[:] == some_value).sum()#calculate the probability by diving the sum by the length of the dataframe
probability = all_occurences/len(random_int_dataframe)
Zero as expected!
Read a CSV into a dataframe
The above derivations could be more useful in a scenario that provides real world data, for example, data from a survey, election, registration process, among others. Such data is usually stored in databases, excel files, CSV files, among others, and can be read as dataframes.
Take an example of the
xyz_hospital_registration.csv file. This can be read into a dataframe as follows:
xyz_hosp_df = pd.read_csv('xyz_hospital_registration.csv')
Something to try:
Visit kaggle and:
- Access one of the public datasets,
- Read it as a dataframe,
- Derive a summary of statistics, and
- Calculate the probability of any desirable variable/column.
Libraries like Numpy and Pandas enable their users to derive useful information from data manipulation. The tools that they provide produce clean, neat, and meaningful output that can be used to make important decisions in everyday life. Visit this link for more on how to use Numpy and this for Pandas.