hacking-pandas

Vinay Kudari
hacking-datascience
5 min readJul 27, 2018

Pandas is a powerful python package which makes importing, cleaning, analyzing, and exporting the data easier. In this story I’d like to write down some of the key functions which I found them useful.

Will keep on appending new stuff to the story.

#Update 08/08/18 : Categorical Data, Grouping.

Importing the data

The first part of any data science problem is to import data, now lets see how few hacks can make importing much easier and cleaner.

syntax

pd.read_csv(file_name,argument1, argument2,...)

key arguments

delimiter = character : delimiter is a sequence of one or more characters used to specify the boundary between separate, independent regions. The character can be ' ' or ',' or a tab delimiter '\t' .

header = n : header implies the start of the data, If header argument is not passed it is assumed to take the 0th row, therefore column names are inferred from the first line of the file.

index_col = n : columns to use as row labels, If a sequence is given, a MultiIndex will be used.

use_cols = [ ]: list can have column numbers or name, only those mentioned in the list will be selected.

nrows = n : no of rows that are to be read.

comment = character : character indicates the beginning of the line which should not be parsed, it can be '#' or '\\'

na_values = list : additional string list to recognise as NaN value.

parse_dates = True : parses dates into yyyy-mm-dd hh:mm:ss format.

example

Cleaning and analyzing the data

Data almost never comes in the form which can be used for analyzing or training, The most common problems in raw data include missing values, inconsistent data, duplicate records.

Tabular data can be organized in many ways some are suitable for reporting and some are suitable for analysis, The process of turning an untidy data-set into a form which is suitable for data analysis is know as data tidying.

The principles of tidy data are

  1. columns represent unique variables.
  2. rows represent individual observations.

Lets explore some functions which make data cleaning easier.

Melting

Melting is a way of tidying the data, the columns are turned into rows making data good for analysis.

syntax

pd.melt(df, argument1, argument2,....)

key arguments

id_vars = [ ] : represents columns you do not want to melt

value_vars = [ ] : represents columns you do wish to melt into rows if this argument is not provided all the columns not set in id_vars will be melted.

Pivoting

Exactly opposite to melting, turning every unique value into columns is called pivoting. Making the data report friendly.

syntax

df.pivot(argument1, argument2, ...)

key arguments

index = [ ] : represents columns you do not want to pivot

columns = [ ] : represents columns you want to pivot

values = [ ] : values to be used when the column is pivoted

aggfunc = np.ufunc : in case of duplicate rows the values are computed as per the ufunc mean, max, min..

why?

makes data cleaning easier for better data analysis.

Date Time Series

It is one of most common series data, I have found the pandas package to be extremely useful for time series manipulation and analysis.

functions

pd.to_datetime(series, format=time_format)

Passing list of strings and specifying the format in which the strings are present will parse the strings into datetime objects.

example

why?

datetime object is powerful and we can get the individual elements of the date Eg. datetime_object.dt.hour( )

df.reindex(list, method=func)

reindexing a dataframe df according to the indices given in the list, if the indices doesn't match then the values are filled according to optional logic func, it can be ffill or bfill they represent forward fill and backward fill.

example

why?

when we need to compare two dataframes it would be better if we have the same indexing order.

pd.resample(rule).func( )

Resampling involves changing the frequency of your observations, the records are up-sampled or down-sampled according to the rule, Most common rules in the table below. This can be chained with any statistical functions.

downsampling is reducing the frequency of records example turning daily data to weekly data in contrast up sampling is increasing the frequency, while we compute the later we often come up with many missing values we can overcome this by interpolating the values.

example

why?

to provide additional structure or insight into the learning problem for supervised learning models

#Update 08–08–18

Categorical Data

Often in real world data each column consists of repetitive values, like gender, country. We can convert the series data type to category type using

syntax

df[column_name].astype('category')

why?

Save memory especially in case of large datasets also speed up operations like groupby.

Grouping

It involves splitting the dataframe by the column name into groups and applying a function on each group in the column and merging all the results.

syntax

df.groupby(column_name) or df.groupby(condition)

groupby return an object on which we apply a function and finally combine the results

example

functions

aggregation function returns a single aggregated value for each group

syntax

df.groupby(column_name).agg([func1, func2, ...])

If we want to apply different functions to different columns we could pass a dictionary to agg( ) instead of a list.

df.groupby(column_name).agg({column1:func1, column2:func2})

transformation function applies the function to each group and returns the same size data frame where each group has same aggregated value.

syntax

df.groupby(column_name).transform(function)

filter function filters the dataframe on some Boolean condition

syntax

df.groupby(column_name).filter(function)

example

visualization

Thanks to pbpython :)

why?

helps to properly analyse the data also fill missing values appropriately using transform.

See ya soon 😄

--

--