Just Enough Python for Data Science

Gaurav Madan

8 min readDec 17, 2019

String

Strings are immutable

Subset

Extract a subset of characters from a given string.

str[start_index:stop_index:skip]

sub setting is right exclusive


# start from 2nd character (index = 1) till the end
str[1]# start with 3rd character (index = 2) and till 4th character (index = 3)
str[2:4]str[:6]
str[1:]
#start with 1st index (2nd char) and get till second last char (-1 is last char and subsetting is right exclusive)
str[1:-1]
# print alternate characters
str[::2]

Lambda

Map

A map function maps a lambda function to a list and create a new list by applying lambda function to each value in the list. The resulting list has the same number of values as the original list.

m = map(lambda x:x+10, range(10))list(m) # to display as list

Reduce

A reduce function maps a lambda function to a list and creates a summarization by applying lambda function to each item in the list.

import functools
x = functools.reduce( (lambda x,y : x * y) , [1,2,3,4])# The lambda function is applied recursively. E.g. 
1,2 = 1 * 2 = 2
2,3 = 2 * 3 = 6
6,4 = 6 * 4 = 24

Filter

It is used to filter the items original list based on a certain criteria and return a filtered list

list( filter( lambda x : x%2 != 0 , range(10) )# This will returns [1,3,5,7,9]

Zip

# merges the sequences passed as params and returns a tuple containing element at the same index from each of the sequences. 
# returns a list which has elements = smallest list

Enumerate

# returns both index and element of the sequence as a tuple

Comprehensions

List Comprehensions

new_list = [ n**2 for n in range(1,21) if n % 2 == 0 ]# This is similar to FILTER function

Dictionary Comprehensions

NumPy

Arrays

np.ndarray([2,2,3], dtype='float')
returns 2 x 2 x 3 matrix of type float
It is represented through a set of 3 arrays of 2 x 2.

slicing an array

b = a[1,2:4]

trace

sum

sum(axis=0)

mean

sort

ndim

Arrays

Define np.array and ndarray

Slicing an array

condition and lookup

Boolean Indexing

aa > 2 #returns a same size array with True where element > 2 else falsea[a>2] #returns the elements where the element > 2

Reshaping an array

f is a 4x4 array
np.shape(f, (1,16)) # this will return a 1D array with 1 row and 16 columns

Numpy Maths

add

subtract

multiplicationa * b # This does an element to elementnp.matmul(a,b)
# number of columns of a should be equal to columns in b

Transpose

a.T # this will transpose

function broadcasting

Advanced Matrix Operations

Solving linear equations

2x + 2y + 3z = 53x + y + 4z = 74x + 3y = 10[x, y, z] = np.linalg.solve(feature_matrix, results_matrix)

Matrix Inversion

Eigen value and Eigen Vector

To identify where is the variance in data

Identify key variables, that contribute most of the variance

In above example, total variance = 1 + 2 + 3.

Third column represts 50% of the variance. Why not 3rd row, bcoz column represents the variable and not row

Data Frames

Subsetting data

To get a subset of data framedata.iloc[[row_start, row_end],[col_start, col_end]]
# give column indexesdata.loc[0:99, ['Id','Name']] #give names

Creating new variables in existing data frame

data['new_var'] = data['var_one'] + data['var_one']ORnew_data = data.assign(a = data.var_one + data.var_two, b = data.var_three + data.var_four)
# existing data frame is not modified. a new data frame is returned
# multiple new attributes can be added in one go this way

Accessing variables

data['varname'] is same as data.varname

Dropping variables

dropping columns

data.drop( ['col1', 'col2'], axis=1, inplace = True )
# axis = 1 means drop columns, for given rows
inplace = True will actually drop the data
inplace = False will just create a view without that column

dropping rows

data.drop( [0,1], inplace = True )
# this will remove first and second rows

BINNING — converting numerical variables to categorical variables

user defined | can be implemented using splitting of data frame
equidense
equidistant

income # contains numerical value of incomeincome_grp = High Medium Low# Approach 1 - assign income_grp with a value which is decile of the income
#equidense binning
#each group has same number of people# Approach 2 - 
# equidistant
# create ranges for each category. Then assign into each category--pd.cut( data['colname'], 3, labels = ['0-33','33-66','66-100'] )# colname is the column which is numerical and is being converted into categorical
# the data will sort automatically# returns rownum | colname (with categorical values)---# binning into quantilesdata['deciles'] = pd.qcut( data['colname', 10, labels = range(1,11,1)])

Q> why do we convert numerical to categorical?

A> To simplify our interpretation

REname columns

data.rename(columns = {"oldvar1":"newvar1", ... })

Sort data

data.sort_index(axis=1, ascending = False)data.sort_values(by=['income','utilization'], ascending=[True, False], inplace = True)

Convert data types

data.MonthlyIncome.astype('int64', raise_on_error = True)# if raise_on_error = False, and we are converting string to int, it will convert string to corresponding ascii values

Indexing and de-indexing

data.set_index("id", inplace = True)
# now id is not a column in data
# we cannot apply filtering / conditional statements on IDsdata.reset_index()# nested indexes
data.set_index( np.array(["customer_segment", "id"]) )

Handling duplicates

data[ demo.duplicated() ] # checks all columns together for duplication# selected columnsdata[ demo['customer_name'].duplicated() ]dups = data[ demo['customer_name'].duplicated() ]
uniques = data[ demo['customer_name'].duplicated()  == False]uniques_alt = data.drop_duplicates()
uniques_alt = data.drop_duplicates(keep='last')

Missing value treatment

systemic missing — no need to replace. or just replae with -9999
random missing — replace

when replacing random, if missing data < 10%, then replace with mean. or mode if there are very high outliers

If > 10%, then in each quantile, replace with any random value from that quantile. This will ensure that overall mean is not changed

# to detect how many are missing
data['col'].isnull()# replace
data['col'].fillna(data['col'].mean())
data['col'].fillna(data['col'].median())
data['col'].fillna(data[-9999)#drop rows
data[ data['col'].isnull() == False]
# this creates a data set which has no missing# worst case
data.dropna()
# this can be done if the missing data is very min. say < 1% across all variables

Handling Outliers

how to detect

By one definition

whatever is less than 1 percentile is outlier.
whatever is more than 99 percentile is outlier

Another definition

if a value > mean + 3 SD, or value < mean — 3 SD, then it is an outlier (in case of normal distributions)

# to find
np.percentile(data.income, 0.01)
data.income.quantile(0.99)# 6 Sigma method
upper_end = data.income.dropna().mean() + 3 * data.income.dropna().std()lower_end = data.income.dropna().mean() - 3 * data.income.dropna().std()

how to handle outliers

# if we know valid range of a variable, we can handle outliers based on thatdata['income'].clip_upper(10000) # replace greater values by 10000
data['income'].clip_lower(0) # replace negative values with 0# percentile method
data['income'].dropna().quantile(0.95)
data['income'].dropna().quantile(0.05)

Indexing

2. create / drop variables or rows

3. rename variables

indexing

7. missing value treatment

8. duplicates

9. outliers

10. binning of numeric var

11. type conv

12. data summarization

groupby

13. long / wide data conversion

stack / unstack
pivots

14. merging

Type Conversion

Converting Numerical Variables into Categorical Variables

Use Binning

data['age_group'] = pd.cut(data['age'], bins=[0,25,35,100], labels=["young","medium","old"])

Creating Categorical Variables into Numerical variables

Create Dummy Variables
It is required only for predictive analysis. For descriptive analysis, we dont need to convert into numerical values

Income has following possible values: L, M, MH, H, VHWe simply cannot assign them values: 1 to 5
This would imply VH is 5x the value of L. We are unnecessarily introducing an order in the values which may not be the case in realityWe can use K-1 dummy variables, each combination of dummy variables representing a categoryDummy variables can take values: 0 or 1
while even the 0s and 1s may seem like ordinals, this is handled using intercepty = 3x + 5
5 is the value at x = 0
8 is the value at x = 1why K-1 dummy variables? bcoz the k'th value is accounted for in the intercept.

Code Sample

dummy_var_data = pd.get_dummies(data['age_group'], prefix="D")# This will create K dummy variables, bcoz python doesnt understand that it can be represented by K-1we can drop Kth column

Data Summarization

df.apply(lambda function)# if you want to apply row wisedf.apply(lambda function, axis=1)Eg.# TODOvar_summary(x)
    pd.series([], ['N', 'NMISS', 'SUM', 'MEAN', 'MEDIAN', 'STD', 'VAR', 'MIN'])data._get_numeric_data().apply(lambda x: var_summary(x)).T

Group By

grouped_data = data[['gender','age']].groupby('gender').mean()# average age w.r.t gendergender | age
0 | 52
1 | 53type(grouped_data)grouped_data.max()
grouped_data.mean()# apply multiple statistics at once# df is data frame. k1 is a categorical variablepdf.concat([ df.groupby(df.k1).agg('mean'),
             df.groupby(df.k1).agg('std')], axis = 1)# returns mean and std of each numerical variable, wrt to each distinct value of k1# now we are not able to differentiate which is mean and which is stdpdf.concat([ df.groupby(df.k1).agg('mean').add_prefix('mu_'),
             df.groupby(df.k1).agg('std').add_prefix('sigma_')], axis = 1)pd.DataFrame(pd.concat([grouped_data.max(), grouped.min()], axis=1), columns=['max','min'])

Pivot


df.pivot(index='date', columns='item', values='status')# this is very similar to excel pivotprint( pd.pivot_table(data=df, index='date', columns='item', values='status', aggfunc='sum') )# for multiple aggregations
print( pd.pivot_table(data=df, index='date', columns='item', values='status', aggfunc=['sum','mean']) )

Merging (Joins)

pd.merge(df1, df2, how="left", on="key", left_on=None, right_on=None, left_index = False, right_index=False, sort=True, copy=True, suffixes=("_l", "_r"))we can join in left and columns on a specific key. Or we could do left index / right index

Another way

df1.join[df2]# left join between 2 data framespd.concat([s1, s2, s3], axis = 0 or 1)#join vertically or horizontally

Random Sampling

With replacement

remove from original
also called SRSWR (simple random sample with replacement)
samples are random
representative of population
create a sample of infinite size. data can be amplified
pick a ball from a jar, note the number on the ball, and put it back in jar

ts.sample(n=758, replace=False).duplicated().value_count()# sample 70% of the total population
df.sample(frac=0.7, replace=True).duplicated().value_count()

2. Without replacement

pick a ball from a jar, note the number on the ball, but don’t put it back in jar
samples are unique

# without replacement
ts.sample(n=758, replace=False).duplicated().value_count()

3. Stratified sampling

eg. people in north have longer height as compared to people in north east. if we dont take samples from each state, we will not correct picture with MEAN / AVG
Select samples from each state. Then apply SRS-WR or SRS-WOR, state wise

What do we do with Outliers

We do flooring and capping

Share these with business. If this observation might reoccur, we can do some treatment.

Info / Profiling of DataFrames

pandas_profiling.ProfileReport(df)

Just Enough Python for Data Science

String

Subset

Lambda

Comprehensions

List Comprehensions

NumPy

Arrays

Advanced Matrix Operations

Data Frames

Type Conversion

Data Summarization

Pivot

What do we do with Outliers

Info / Profiling of DataFrames

Written by Gaurav Madan