Geek Culture
Published in

Geek Culture

A step-by-step coding guide to Pandas Library — Part -2

Hello, In the previous article, of this series, I covered the basics of the Pandas library. Please go and check out the first part, if you haven’t already. As these articles are sequential in nature, going over them will help you understand and sharpen your basic knowledge.

It is crucial before you make any decision regarding your data, that you first examine it thoroughly. Examining the data leads to better answers and you can also outline the nuances, during data analysis.

You are welcome to download my code file and follow along with the article for better understanding.

For more awesome analytics cartoons, visit https://timoelliott.com/blog/

Start treating the dataset as a puzzle in front of you and data analysis methods as your provided tools. The selection and usage of tools depend on the particular puzzle, In data analysis, all tools can’t be applied at once on a dataset.

Photo by Markus Spiske on Unsplash

You need to access the problem at hand and choose your method as the accuracy required.

Developing a data analysis routine is vital, if you are playing with data in your day-to-day life, although there is not a singular standard technique when beginning a data analysis, it is recommended to develop a routine for yourself when first examining any dataset

It will facilitate increasing your familiarity with the dataset; preferably it should be a dynamic checklist of tasks that evolves with your familiarity with pandas and data analysis methods.

Here, I am sharing my data analysis routine which can help beginners to build their own from scratch.

We will cover the following :

  1. Overview
  2. Reducing the memory footprint of a dataset
  3. Selection method — Smallest of the largest
  4. Selection method — Largest of each group
  5. Selection method — Subset of the data
  6. Speeding up the selection process
  7. Slicing Rows with iloc or loc indexers
  8. Slicing Rows lexicographically
  9. Filtering data using boolean indexers
  10. Finding the most common maximum along with column

I. Overview

I am going to explore data analysis methods, and how to explore filtering and selection of data frames-you can’t work on the whole dataset at once, you need to manipulate the data to yield more insights out of it.

  • Here, I’ll be using the same Netflix and Spotify datasets which I used in my previous article, let’s load the data.

II. Reducing the memory footprint of a dataset.

In pandas, the default size for integer and float is — 64 bits, regardless of the maximum necessary size for the particular dataframe. There are two tricks that you can use to decrease the size of data you are dealing with and making your programs faster:

  • Not loading all the data at once, using subsets of the data (we will see this in detail later).
  • Changing the data type of the data.

A data type or simply type is an attribute of data that tells the compiler or interpreter how the programmer intends to use the data.

  1. int8 / uint8: consumes 1 byte of memory, range between 128/127 or 0/2554.
  2. bool: consumes 1 byte, true or false.
  3. float16 / int16 / uint16: consumes 2 bytes of memory,range between -32768 and 32767 or 0/65535.
  4. float32 / int32 / uint32 : consumes 4 bytes of memory,range between -2147483648 and 2147483647.
  5. float64 / int64 / uint64: consumes 8 bytes of memory.

The above steps explained, reading from left to right :

  1. Checking the data type of Spotify dataset.
  2. Checking memory usage of Spotify dataset
  3. Columns mode, key and target are categorical variables changing their data type to int8 and saving in new dataset Spotify2.
  4. Checking memory usage of the new dataset

Mode, key and target value shrunk to 1/8 of the original value

III. Selection method — Smallest of the largest

During analysis, it is possible that you will first need to find a grouping of data that contains the top n values in a single column and, from this subset, find the bottom m values based on a different column or maybe you don’t need full data for analysis, you just need top 100 data.

This recipe can be used to answer your “Top 100 songs you can dance to?” or “Top 100 high energy songs”.

Here, I first created a dataset with 3 continuous variables, chaining the nlargest method created a dataset with 50 top songs with high danceability,

Chaining the nsmallest method to return the five lowest songs among those top 50.

IV. Selection method — Largest of each group

One of the common operations to perform during data analysis is to select the highest value of columns, this can help you to answer questions such as “Top rated Netflix movie”, “Most lengthy song of the year” or “Top rated movie director/artist/cast/description”.

In the III. step I showed how to select the top 50 songs using nlargest method chaining, we can achieve the same with the sort_values( ) method

You can also highlight the max, min and null values using the .style.highlight method:

The default behaviour of the drop_duplicates method is to keep the first occurrence of

each unique row, which would not drop any rows as each row is unique. However, the subset parameter alters it to only consider the column (or list of columns) given to it. In this example, only one row for each key will be returned.

V. Selection method — Subset of the dataset

As discussed above another method to reduce the memory footprint of any function — is to not work with the whole dataset but work with subsets. Every dimension of data in a Series or DataFrame is labelled through an Index object. It is this Index that separates pandas data structures from NumPy’s n-dimensional array.

Pandas allow its users to select data by the integer location of the rows and columns. This dual selection capability, one using labels and the other using integer location, makes for powerful yet confusing syntax to select subsets of data.

I. Subset selection in Series using loc and iloc indexers:

Reading the Netflix dataset with the title as the index, and select a single column duration as a Series with the indexing operator:

To select several different integer locations I passed a list to .iloc. This returns a Series:

To select an equally spaced partition of data I used slice notation:

To select several disjoint labels I used a list:

II. Subset selection in DataFrame using loc and iloc indexers:

The most explicit and preferred way to select DataFrame rows is with the .iloc and .loc indexers. They are capable of selecting rows or columns independently and simultaneously.

Passing an integer to the .iloc indexer to select an entire row at that position.

I can use loc. index too to get the same row as the preceding step.

For selecting a disjointed set of rows as a DataFrame I passed a list of integers to the .iloc indexer, the list of index labels can be selected directly from the DataFrame.

III. Selecting Rows and Columns simultaneously

Directly using the indexing operator is the correct method to select one or more columns from a DataFrame. However, it does not allow you to select both rows and columns simultaneously.

To select rows and columns simultaneously, you will need to pass both valid row and column selections separated by a comma to either the .iloc or .loc indexers.

VI. Speeding up the selection process

Both the .iloc and .loc indexers are capable of selecting a single element, a scalar value, from a Series or DataFrame. However, there exist the indexers, .iat and .at, which respectively achieve the same thing at faster speeds. Like .iloc, the .iat indexer uses integer location to make its selection and must be passed two integers separated by a comma. Similar to .loc, the .at index uses labels to make its selection and must be passed an index and column label separated by a comma.

VII. Slicing Rows with iloc or loc indexers

A shortcut to select the rows exists with just the indexing operator itself. This is just a shortcut to show additional features of pandas, but the primary function of the indexing operator is actually to select DataFrame columns. If you want to select rows, it is best to use .iloc or .loc, as they are unambiguous.

VIII. Slicing Rows lexicographically

The .loc indexer typically selects data based on the exact string label of the index. However, it also allows you to select data based on the lexicographic order of the values in the index. Specifically, .loc allows you to select all rows with an index lexicographically using slice notation. This works only if the index is sorted.

IX. Filtering data using boolean indexers

Filtering data from a dataset is one of the most common and basic operations. There are numerous ways to filter (or subset) data in pandas with boolean indexing. Boolean indexing refers to selecting rows by providing a boolean value (True or False) for each row. These boolean values are usually stored in a Series or NumPy n-dimensional array and are usually created by applying a boolean condition to one or more columns in a DataFrame.

Creating a variable to hold each set of criteria independently as a boolean Series then combining all the criteria into a single boolean series.

Pass the data to the indexing operator to get the filtered data.

After getting the filtered data, we can combine rows and columns with the .iloc indexer

X. Finding the most common maximum along with column

The Spotify dataset contains thousands of songs, it would be interesting to see which song scored highest in various column-wise independently, we can check it using idxmax method( ).

This is the second post of my DataWrangling Series, please look forward to it. You can find the first post attached below.

Hey there, I am looking forward to your comments and suggestions.
Kindly clap, subscribe and share.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Prashant Ojha

Prashant Ojha

18 Followers

Data Science | Story Telling |Machine Learning | In god we trust, all others must bring data |