# A Gentle Introduction to Pandas

Dec 19, 2018 · 5 min read

I hope you had an exciting intro to Numpy because Pandas builds on it. What is Pandas? Forget about the song ‘Panda’ by American hiphop artiste Desiigner or the animal (KungFu Panda enthusiasts), this is a python library that offers powerful and flexible data structures that make data manipulation and analysis easy. It stands for “Python Data Analysis Library” and according to Wikipedia, the name Pandas is derived from the term “panel data”, an econometrics term for multidimensional structured data sets.

Pandas has three data structures: Series, DataFrames and Panels.

1. A Series is 1-dimensional labelled array that can hold data of any type (integer, string, float, python objects, etc.). It’s axis labels are collectively called an index.
2. A DataFrame is 2-dimensional labelled data structure with columns
3. A panel is 3-dimensional. For this post we will not be discussing about panels.

# Series

The syntax is: pd.Series( data, index, dtype, copy)

`Console:import pandas as pdimport numpy as npdata = np.array([‘Tom’,’Jerry’,’Nick’,’Harry’,’Ruth’,’Gloria’])names = pd.Series(data)print (names)Output:0 Tom1 Jerry2 Nick3 Harry4 Ruth5 Gloriadtype: object`

Create a series from array with index

`Console:import pandas as pdimport numpy as npdata = np.array([‘Tom’,’Jerry’,’Nick’,’Harry’,’Ruth’,’Gloria’])names = pd.Series(data, index=[100,101,102,103,104,105])print (names)Output:100 Tom101 Jerry102 Nick103 Harry104 Ruth105 Gloriadtype: objectNotice the difference in the numbering, it changed from 1–5 to 100–105.`

Create a series from Dictionary

`Console:import pandas as pdimport numpy as npdata = { ‘student’ : [‘Tom’, ‘Jerry’, ‘Gloria’, ‘Hillary’], ‘age’ : [21, 34, 45, 67], ‘gender’ : [‘Male’, ‘Female’, ‘Female’, ‘Male’] }Student = pd.Series(data)print (Student)Output:student [Tom, Jerry, Gloria, Hillary]age [21, 34, 45, 67]gender [Male, Female, Female, Male]dtype: object`

## Create a Series from Scalar

`Console:import pandas as pdimport numpy as npscore = pd.Series(10, index=[‘A’, ‘B’, ‘C’])print (score)Output:A 10B 10C 10dtype: int64`

## Accessing data from series with position

`Console:import pandas as pdimport numpy as npdata = np.array([1,2,3,4,5])position = pd.Series(data)Console:position[0] #first element in the arrayOutput: 1Console:position[:3] #first three elements in the arrayOutput: 0 11 22 3dtype: int64Console:position[-1:] #the last element in the arrayOutput: 4  5dtype: int64`

# DataFrames

Those that have used R before can relate to this data structure as it was inspired by R’s own dataframes. A dataframe can contain a Pandas DataFrame, Series, Numpy array or dictionaries of 1-dimensional.

Let’s go ahead and get our hands dirty.

`Console:import pandas as pdimport numpy as npdata = { ‘name’: [‘Kwadwo’, ‘Nana’, ‘Kwame’, ‘Naa’], ‘age’: [20, 19, 22, 21], ‘favorite_color’: [‘red’, ‘orange’, ‘green’, ‘purple’], ‘grade’: [67, 78, 90, 12] }df = pd.DataFrame(data)print(df)Output: name age favorite_color grade0 Kwadwo 20 red 671 Nana 19 orange 782 Kwame 22 green 903 Naa 21 purple 12Console:df.columnsOutput:Index([‘name’, ‘age’, ‘favorite_color’, ‘grade’], dtype=’object’)Console:df.valuesOutput: array([[‘Kwadwo’, 20, ‘red’, 67], [‘Nana’, 19, ‘orange’, 78], [‘Kwame’, 22, ‘green’, 90], [‘Naa’, 21, ‘purple’, 12]], dtype=object)Console:df.shapeOutput: (4, 4)Console:df.dtypesOutput: name objectage int64favorite_color objectgrade int64dtype: object`

From the above outputs we can tell that our data is a 4 by 4 table containing four columns and four rows. We can also tell the data type of each column. Notice that the columns are homogeneous meaning that each holds a certain set of entries. i.e. the grades column only holds grades which are integers. (We don’t expect to see a string in that column as that will be an data entry error since grades are weighed in numbers in this scenario. ).

Sorting

We can sort values in a dataframe by using one of its columns as the base. In this instrance we will sort values by age in an ascending order.

`Console:df.sort_values(by=’age’)Output: name age favorite_color grade1 Nana 19 orange 780 Kwadwo 20 red 673 Naa 21 purple 122 Kwame 22 green 90`

## Slicing

`Console:df[[‘age’,’grade’]] #Display the age and grade columns onlyOutput:  age grade0 20 671 19 782 22 903 21 12Console:df[‘age’] #Display the age column onlyOutput:0 201 192 223 21Name: age, dtype: int64`

When accessing data you can do so by selection by label or by position

Selection by label

`Console:df.loc[:2] #Display the first three rowsOutput:  name age favorite_color grade0 Kwadwo 20 red 671 Nana 19 orange 782 Kwame 22 green 90Console:df.loc[2:] #Display the last two rowsOutput: name age favorite_color grade2 Kwame 22 green 903 Naa 21 purple 12`

Selection by Position

`Console:df.iloc[3]Output:name Naaage 21favorite_color purplegrade 12Name: 3, dtype: objectConsole:df.iloc[2:4,0:1]Output name2 Kwame3 NaaConsole:df[df[‘age’]>20] #Display the rows that have age greater than 20Output:  name age favorite_color grade2 Kwame 22 green 903 Naa 21 purple 12`

Summary Statistics

We can perform a summary statistics on our data. This means that any integers or float columns will be summarized.

`Console:df[‘age’].mean() #Get the mean of the age columnOutput:20.5Console:df[‘age’].std() #Display the standard deviationOutput:1.2909944487358056Console:df[‘age’].min() #Display the minimum valueOutput:19Console:df[‘age’].max() #Display the maximum valueOutput:22Console:df[‘age’].var() #Display the varianceOutput:1.6666666666666667`

alternatively we can display the whole output at once with a line of code:

`Console:df.describe()Output: age gradecount 4.000000 4.000000mean 20.500000 61.750000std 1.290994 34.471002min 19.000000 12.00000025% 19.750000 53.25000050% 20.500000 72.50000075% 21.250000 81.000000max 22.000000 90.000000`

Importing and exporting data

We have learnt how to create data in Pandas. But what if you have a dataset that you want to import/export? The dataset could be in any format: .csv, .txt, .xlsx, .json etc.

Take for instance a csv file called lemonade.csv containing 365 rows and 11 columns.

`Console:data = pd.read_csv(‘/home/wilson/Downloads/Lemonade.csv’)print(data)`

Now the dataset ‘Lemonade’ has been assigned a new name ‘data’. It still contains 365 rows * 11 columns in .csv format.

You can also write a file (also called exporting). I want to subset the Lemonade datafile that we imported earlier by picking the first five rows and string that data in a new file called ‘new.csv’. Then export the ‘new.csv’ as ‘export.csv’ and saving it in the downloads folder.

`Console:import pandas as pddata = pd.read_csv(‘/home/wilson/Downloads/Lemonade.csv’)datanew = data.head(5)new.to_csv(‘/home/wilson/Downloads/export.csv’)`

Exercise

Go to https://github.com/wbusaka/Lemonade and pull the Lemonade dataset in .xlsx format. It contains 366 rows and 11 columns. The data is generated from a Lemonade stand sales collected from January to December for the year 2017.

Open the Lemonade.xlsx’ file in Ms Excel.

Convert the ‘Lemonade.xlsx’ dataset to ‘.csv’ and save it on your computer.

Import Pandas.

Import the new ‘Lemonade.csv’ file into python.

Slice the dataset from 245 to 321 and store in it a new variable called ‘sliced_data’.

Filter the data for to obtain for only the month of ‘August’ and store it in a new variable called ‘August’.

Perform a summary statistics for the month of August (i.e. mean, max, median, standard deviation etc.).

• As always, post the result on social media (Twitter, Facebook or Instagram) to receive a confirmation or assistance :-)

NB: This exercise utilizes concepts in this tutorial.

Conclusion

Pandas are the real deal when it comes to data science as most of the real world scenario data is stored in a dataframe format (rows and columns). As they say, learning is infinite as long as you’re breathing.

https://pandas.pydata.org/pandas-docs/version/0.22/cookbook.html#cookbook

Stay Tuned!

Written by

Written by