Python Pandas:- Part 1

ayush shekhar
Analytics Vidhya
Published in
8 min readDec 27, 2020

(Advice:- It’s always good to know at least basics of NumPy/ Matplotlib before jumping to Pandas)

Key Contents:-

  1. 3 W’s of Pandas (What/ Why/ Where)
  2. Installing Pandas
  3. Data Structures (Data Frame, Series)- Basic
  4. Important Questions

3 W’s of Pandas:-

What is Pandas?

Pandas is an open source, BSD-licensed library which provides easy to use data structures, data analysis, data manipulation tools for the python programming language. It’s built on top of two core libraries of Python — matplotlib (used for data visualization) and NumPy (used for mathematical Operations).

Pandas is like excel of Python. Data Frame is analogous to a table and Series is analogous to rows and columns.

Why Pandas?

As Pandas is built on top of other Python libraries, and some part of it is implemented in C, which makes it really smarter and faster while execution . In short you can say it’s “fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive.”(mentioned in Pandas docs). With the help of these features you can save some of your time of computation and invest it in analyzing data and building models.

Where Pandas?

You can use Pandas almost everywhere where you are dealing with data. If you are working on a project where you are going to visualize data, analyze it and perform some operations with it, you will most probably love to use Pandas.

Installation:-

Note:- Officially it is supported by, Python 3.6.1 and above.

There are multiple ways to install and use Pandas. The easiest and most recommended way to install pandas is to install it as part of the Anaconda distribution, a cross platform distribution for data analysis and scientific computing. Another widely used way to install Pandas is to create a virtual environment and install Pandas in it.

Create virtual environment

(Assuming that you have already installed python3, pip and virtual env)

cd my-project                            #go to project directory
virtualenv -p python3 venv #create virtual env
source venv/bin/activate #Activate Virtal env
pip install pandas #install pandas

It’s advised to use Jupyter Notebook for learning Pandas, as it makes it really easy to visualize the data at each step, making you to debug easily and saves time.

Installing Jupyter Notebook:-

pip install jupyter                      #install Jupyter notebook
jupyter notebook #launching Jupyter notebook

Data Structures:-

The fundamental data structures in pandas are DataFrame and Series. Fundamental behaviour of these data types are to indexing, axis labelling/alignment apply across all of the objects. We need to import NumPy and pandas to get started into namespace:-

Series

It’s a 1-D labelled array capable of holding any data type, ex:- integers, strings, floating point numbers, Python objects, etc. The data can be either a python dictionary, ndarray, a scalar value etc. Axis labels are called indexes.

Syntax :- s = pd.Series(data, index=index)

Instantiation of Series from:-

a) python dictionary

Example 1) When we don’t give index arg to Series method

Series- Dictionary Instantiation 1

Notice:- we have not passed the index arg but still we got the index as the key of dictionary.

Example 2) When we pass the index arg with limited keys

Series- Dictionary Instantiation 2

Notice:- here we have passed index arg to Series method. There are various occasions when we are working on large datasets and our dictionary will have many unwanted information but we can use the above method to get the series of only needed information.

Example 3) When we pass index keys which are not part of dictionary

Series- Dictionary Instantiation 3

Notice:- Here we can observe that “b” is one of the index key passed, but the dictionary doesn’t have that key. Hence, it returns NaN for “b” key

Note:- For, Python version ≥ 3.6 and Pandas version ≥ 0.23, Series index will be ordered by the dictionary’s insertion order but for Python version ≥ 3.6 and Pandas version ≥ 0.23, Series index will be the lexically ordered list of dictionary keys.

b) ndarray

Example 1) When we don’t give index arg to Series method

Series -ndarray Instantiation 1

Notice:- In case of ndarray, if no index is passed, it will be created having values [0, … , len(data-1)]

Example 2) When we pass index arg to series

Series- ndarray Instantiation 2

Example 3) When the index list has less no of keys then the length of ndarray

Series- ndarray Instantiation 3
Series- ndarray Instantiation 4

Notice:- It will give you an error because, here we have taken the authority of pandas to assign index and passed index of length not equal to length of data list

c) Scalar Value

In case of Scalar Value, an index must be provided and value gets repeated in order to match index’s length.

Series- Scalar Value Instantiation

We can treat Series like a dictionary or ndarray.

Series Instantiation using numpy array

*np.random.randn(4) creates a numpy.ndarray of length 4

Series like a ndarray:-

Fetching Series value using indexing/slicing

Series like a dictionary:-

Fetching Series value using key
Fetching key that doesn’t exist

While fetching key that doesn’t exist in Series will give us “KeyError”

KeyError

One of the key difference between ndarray and Series is that operations between Series automatically align the data based on the labels. Hence, we don’t need to consider whether the Series involved have the same labels or not. If the operation between unaligned series takes place, then the union of the indexes involved is the result. If any label doesn’t match in other series, then the value of that index will be set as NaN. We can also drop labels with missing data via the dropna function.

Name attribute:-

Series can have a name attribute.

Name attribute in Series

We can also rename the series using pandas.Series.rename() method.

Renaming Series

DataFrame

It’s a 2-D labeled data structure with columns of different types, like a spreadsheet or SQL table, or a dictionary of series object. It accepts different kinds of inputs like Dictionary of 1-D ndarrays, lists, dicts or Series, 2 -D numpy.ndarray,, Structured or record ndarray, another DataFrame. It’s optional to pass index(row labels) and columns(column labels) args. A dictionary of Series plus a specific index will discard all data not matching up to the passed index.

Instantiation of DataFrame from:-

a) Dictionary of Series or Dictionary

If indexes are passed, then union of indexes of series will be the resulting index else ordered list of dictionary keys. If there are nested dictionaries, then these will be converted to Series first. Row and Columns labels can be accessed by accessing the index and columns attributes respectively. All the ndarrays involved in DataFrame instantiation should be of same length and if the index list is passed as an arg then it must be of same length as ndarrays. If no index is passed, then the resulting indexes value will be in [0, …, len(data) — 1].

DataFrame Instantiation using dictionary of Series

Notice:- Here, we have created dictionary of Series and passed it as the only arg for DataFrame instantiation. So, the indexes are the union of keys of dictionary of Series and columns are the key of each Series or dictionary keys. The value of index “e” in first_series column is NaN as we don’t have that key for first_series key in dict.

Now we will pass the explicit indexes and columns:-

Passed Index and column arg

Getting Index and Columns of a DataFrame

Getting Index and Columns

b) From structured or record array

DataFrame Instantiation using structured or record array

c) From list of dictionaries

DataFrame Instantiation using list of dictionaries

Column Selection, Addition and Deletion

DataFrame can be treated like a dictionary of like-indexed Series object. It uses the same syntax as analogous to dictionary operations for getting, setting and deleting.

i) Column Selection

Selecting values based on Column

ii) Column Addition

Adding extra column(BSD License) in DataFrame

iii) Column Deletion

Column deletion can be achieved by using either pop() or del

Deleted Column “BSD License” using del
Deleted Column “license” using pop()

Notice:- While Deleting column using pop(), it returned us the deleted column details

Questions:-

  1. Is Numpy included in Pandas?
  2. What is faster Numpy or Pandas?
  3. Should I learn Numpy or Pandas?
  4. Why Pandas is so fast?
  5. What is Numpy used for?

References:-

Thanks for reading!. Hope, I helped you and served the purpose. Learning is a every moment process and I would like to learn from you as well. If you think any improvement is required or if I missed anything or for any query, you can also reach out to me on my mail id:- iamayushshekhar@gmail.com. If you wish, we can also connect on linkedin.

--

--

ayush shekhar
Analytics Vidhya

Data Science Enthusiast | Full Stack Web Developer( MERN) | Flask | DevOps