Pandas:The ML Building Block-Part 1

Published in

Xebia Engineering Blog

5 min readApr 27, 2021

Python is open source.It’s hard to know the best package for a specific task.There is one package we absolutely need to learn for data science, and it’s called pandas.The powerful machine learning and glamorous visualisation tools may have drawn your attention, however, you won’t go anywhere far if you don’t have good skills in Pandas.

There are two main data structures in Pandas — Series and Dataframes. The default way to store data is dataframes, and thus manipulating dataframes quickly is probably the most important skill set for data analysis

N.B: The comments in the code screenshot will help you understand the code more precisely.You must read through those.

Jupyter_Notebook

1. The Pandas Series

A series is similar to a 1-D numpy array, and contains scalar values of the same type (numeric, character, datetime etc.). A dataframe is simply a table where each column is a pandas series.

Creating Pandas Series :

Series are one-dimensional array-like structures, though unlike numpy arrays, they often contain non-numeric data (characters, dates, time, booleans etc.)

We can create pandas series from array-like objects using pd.Series().

You might have noticed that while creating a series, Pandas automatically indexes it from 0 to (n-1), n being the number of rows. But if we want, we can also explicitly set the index ourselves, using the ‘index’ argument while creating the series using pd.Series()

Usually, we work with Series only as a part of dataframes. Let’s study the basics of dataframes.

2. The Pandas Dataframe

Dataframe is the most widely used data-structure in data analysis. It is a table with rows and columns, with rows having an index and columns having meaningful names.

Creating dataframes from dictionaries:

There are various ways of creating dataframes, such as creating them from List, dictionaries, JSON objects, reading from txt, CSV files, etc.

Importing CSV data files as pandas dataframes

Reading and Summarising Dataframes :

After you import a dataframe, you’d want to quickly understand its structure, shape, meanings of rows and columns etc. Further, you may want to look at summary statistics — such as mean, percentiles etc.

head() and tail() helps us to look into the data top and bottom rows.

info() will give a more detailed output

describe() helps us with summary statistics

Each column of a dataframe is a Series.

Now, arbitrary numeric indices are difficult to read and work with. Thus, we may want to change the indices of the df to something more meanigful.

Let’s change the index to Ord_id (unique id of each order), so that you can select rows using the order ids directly.

3. Sorting dataframes

You can sort dataframes in two ways — 1) by the indices and 2) by the values.

Sorting by index:

Sorting by values:

Various variants of sort_values

4. Selecting Data

Please note indexing in pandas starts with 0

Position and Label Based Indexing: df.iloc and df.loc

We have seen some ways of selecting rows and columns from dataframes. Let’s now see some other ways of indexing dataframes, which pandas recommends, since they are more explicit (and less ambiguous).

There are two main ways of indexing dataframes:

1) Position based indexing using df.iloc

2) Label based indexing using df.loc

Note that simply writing df[2, 4] will throw an error, since pandas gets confused whether the 2 is an integer index (the third row), or is it a row with label = 2?

On the other hand, df.iloc[2, 4] tells pandas explicitly that it should assume integer indices.

Pandas provides the df.loc[] functionality to index dataframes using labels.

I hope the interest level must have peaked up about what next we can learn in Pandas, but my friend I assure you there’s a lot more than we expect available with Pandas.In the upcoming parts we will have a look into slicing/dicing, merging and concatenation, grouping of data and many more interesting stuffs.Stay tuned until then for the fun stuffs to come your way.

Part 2