hacking-pandas
Pandas is a powerful python package which makes importing, cleaning, analyzing, and exporting the data easier. In this story I’d like to write down some of the key functions which I found them useful.
Will keep on appending new stuff to the story.
#Update 08/08/18 : Categorical Data, Grouping.
Importing the data
The first part of any data science problem is to import data, now lets see how few hacks can make importing much easier and cleaner.
syntax
pd.read_csv(file_name,argument1, argument2,...)
key arguments
delimiter = character
: delimiter is a sequence of one or more characters used to specify the boundary between separate, independent regions. The character can be ' '
or ','
or a tab delimiter '\t'
.
header = n
: header implies the start of the data, If header argument is not passed it is assumed to take the 0th row, therefore column names are inferred from the first line of the file.
index_col = n
: columns to use as row labels, If a sequence is given, a MultiIndex will be used.
use_cols = [ ]
: list can have column numbers or name, only those mentioned in the list will be selected.
nrows = n
: no of rows that are to be read.
comment = character
: character indicates the beginning of the line which should not be parsed, it can be '#'
or '\\'
na_values = list
: additional string list to recognise as NaN value.
parse_dates = True
: parses dates into yyyy-mm-dd hh:mm:ss format.
example
Cleaning and analyzing the data
Data almost never comes in the form which can be used for analyzing or training, The most common problems in raw data include missing values, inconsistent data, duplicate records.
Tabular data can be organized in many ways some are suitable for reporting and some are suitable for analysis, The process of turning an untidy data-set into a form which is suitable for data analysis is know as data tidying.
The principles of tidy data are
- columns represent unique variables.
- rows represent individual observations.
Lets explore some functions which make data cleaning easier.
Melting
Melting is a way of tidying the data, the columns are turned into rows making data good for analysis.
syntax
pd.melt(df, argument1, argument2,....)
key arguments
id_vars = [ ]
: represents columns you do not want to melt
value_vars = [ ]
: represents columns you do wish to melt into rows if this argument is not provided all the columns not set in id_vars
will be melted.
Pivoting
Exactly opposite to melting, turning every unique value into columns is called pivoting. Making the data report friendly.
syntax
df.pivot(argument1, argument2, ...)
key arguments
index = [ ]
: represents columns you do not want to pivot
columns = [ ]
: represents columns you want to pivot
values = [ ]
: values to be used when the column is pivoted
aggfunc = np.ufunc
: in case of duplicate rows the values are computed as per the ufunc mean, max, min..
why?
makes data cleaning easier for better data analysis.
Date Time Series
It is one of most common series data, I have found the pandas package to be extremely useful for time series manipulation and analysis.
functions
pd.to_datetime(series, format=time_format)
Passing list of strings and specifying the format in which the strings are present will parse the strings into datetime objects.
example
why?
datetime object is powerful and we can get the individual elements of the date Eg. datetime_object.dt.hour( )
df.reindex(list, method=func)
reindexing a dataframe df according to the indices given in the list, if the indices doesn't match then the values are filled according to optional logic func, it can be ffill or bfill
they represent forward fill and backward fill.
example
why?
when we need to compare two dataframes it would be better if we have the same indexing order.
pd.resample(rule).func( )
Resampling involves changing the frequency of your observations, the records are up-sampled or down-sampled according to the rule, Most common rules in the table below. This can be chained with any statistical functions.
downsampling is reducing the frequency of records example turning daily data to weekly data in contrast up sampling is increasing the frequency, while we compute the later we often come up with many missing values we can overcome this by interpolating the values.
example
why?
to provide additional structure or insight into the learning problem for supervised learning models
#Update 08–08–18
Categorical Data
Often in real world data each column consists of repetitive values, like gender, country. We can convert the series data type to category type using
syntax
df[column_name].astype('category')
why?
Save memory especially in case of large datasets also speed up operations like groupby.
Grouping
It involves splitting the dataframe by the column name into groups and applying a function on each group in the column and merging all the results.
syntax
df.groupby(column_name) or df.groupby(condition)
groupby return an object on which we apply a function and finally combine the results
example
functions
aggregation function returns a single aggregated value for each group
syntax
df.groupby(column_name).agg([func1, func2, ...])
If we want to apply different functions to different columns we could pass a dictionary to agg( ) instead of a list.
df.groupby(column_name).agg({column1:func1, column2:func2})
transformation function applies the function to each group and returns the same size data frame where each group has same aggregated value.
syntax
df.groupby(column_name).transform(function)
filter function filters the dataframe on some Boolean condition
syntax
df.groupby(column_name).filter(function)
example
visualization
why?
helps to properly analyse the data also fill missing values appropriately using transform.
See ya soon 😄