Just Enough Python for Data Science

Gaurav Madan
8 min readDec 17, 2019

--

String

Strings are immutable

Subset

Extract a subset of characters from a given string.

str[start_index:stop_index:skip]

sub setting is right exclusive


# start from 2nd character (index = 1) till the end
str[1]
# start with 3rd character (index = 2) and till 4th character (index = 3)
str[2:4]
str[:6]
str[1:]
#start with 1st index (2nd char) and get till second last char (-1 is last char and subsetting is right exclusive)
str[1:-1]
# print alternate characters
str[::2]

Lambda

Map

A map function maps a lambda function to a list and create a new list by applying lambda function to each value in the list. The resulting list has the same number of values as the original list.

m = map(lambda x:x+10, range(10))list(m) # to display as list

Reduce

A reduce function maps a lambda function to a list and creates a summarization by applying lambda function to each item in the list.

import functools
x = functools.reduce( (lambda x,y : x * y) , [1,2,3,4])
# The lambda function is applied recursively. E.g.
1,2 = 1 * 2 = 2
2,3 = 2 * 3 = 6
6,4 = 6 * 4 = 24

Filter

It is used to filter the items original list based on a certain criteria and return a filtered list

list( filter( lambda x : x%2 != 0 , range(10) )# This will returns [1,3,5,7,9]

Zip

# merges the sequences passed as params and returns a tuple containing element at the same index from each of the sequences. 
# returns a list which has elements = smallest list

Enumerate

# returns both index and element of the sequence as a tuple

Comprehensions

List Comprehensions

new_list = [ n**2 for n in range(1,21) if n % 2 == 0 ]# This is similar to FILTER function

Dictionary Comprehensions

NumPy

Arrays

np.ndarray([2,2,3], dtype='float')
returns 2 x 2 x 3 matrix of type float
It is represented through a set of 3 arrays of 2 x 2.

slicing an array

b = a[1,2:4]

trace

sum

sum(axis=0)

mean

sort

ndim

Arrays

Define np.array and ndarray

Slicing an array

condition and lookup

Boolean Indexing

aa > 2 #returns a same size array with True where element > 2 else falsea[a>2] #returns the elements where the element > 2

Reshaping an array

f is a 4x4 array
np.shape(f, (1,16)) # this will return a 1D array with 1 row and 16 columns

Numpy Maths

add

subtract

multiplicationa * b # This does an element to elementnp.matmul(a,b)
# number of columns of a should be equal to columns in b

Transpose

a.T # this will transpose

function broadcasting

Advanced Matrix Operations

Solving linear equations

2x + 2y + 3z = 53x + y + 4z = 74x + 3y = 10[x, y, z] = np.linalg.solve(feature_matrix, results_matrix)

Matrix Inversion

Eigen value and Eigen Vector

To identify where is the variance in data

Identify key variables, that contribute most of the variance

In above example, total variance = 1 + 2 + 3.

Third column represts 50% of the variance. Why not 3rd row, bcoz column represents the variable and not row

Data Frames

Subsetting data

To get a subset of data framedata.iloc[[row_start, row_end],[col_start, col_end]]
# give column indexes
data.loc[0:99, ['Id','Name']] #give names

Creating new variables in existing data frame

data['new_var'] = data['var_one'] + data['var_one']ORnew_data = data.assign(a = data.var_one + data.var_two, b = data.var_three + data.var_four)
# existing data frame is not modified. a new data frame is returned
# multiple new attributes can be added in one go this way

Accessing variables

data['varname'] is same as data.varname

Dropping variables

dropping columns

data.drop( ['col1', 'col2'], axis=1, inplace = True )
# axis = 1 means drop columns, for given rows
inplace = True will actually drop the data
inplace = False will just create a view without that column

dropping rows

data.drop( [0,1], inplace = True )
# this will remove first and second rows

BINNING — converting numerical variables to categorical variables

  • user defined | can be implemented using splitting of data frame
  • equidense
  • equidistant
income # contains numerical value of incomeincome_grp = High Medium Low# Approach 1 - assign income_grp with a value which is decile of the income
#equidense binning
#each group has same number of people
# Approach 2 -
# equidistant
# create ranges for each category. Then assign into each category
--pd.cut( data['colname'], 3, labels = ['0-33','33-66','66-100'] )# colname is the column which is numerical and is being converted into categorical
# the data will sort automatically
# returns rownum | colname (with categorical values)---# binning into quantilesdata['deciles'] = pd.qcut( data['colname', 10, labels = range(1,11,1)])

Q> why do we convert numerical to categorical?

A> To simplify our interpretation

REname columns

data.rename(columns = {"oldvar1":"newvar1", ... })

Sort data

data.sort_index(axis=1, ascending = False)data.sort_values(by=['income','utilization'], ascending=[True, False], inplace = True)

Convert data types

data.MonthlyIncome.astype('int64', raise_on_error = True)# if raise_on_error = False, and we are converting string to int, it will convert string to corresponding ascii values

Indexing and de-indexing

data.set_index("id", inplace = True)
# now id is not a column in data
# we cannot apply filtering / conditional statements on IDs
data.reset_index()# nested indexes
data.set_index( np.array(["customer_segment", "id"]) )

Handling duplicates

data[ demo.duplicated() ] # checks all columns together for duplication# selected columnsdata[ demo['customer_name'].duplicated() ]dups = data[ demo['customer_name'].duplicated() ]
uniques = data[ demo['customer_name'].duplicated() == False]
uniques_alt = data.drop_duplicates()
uniques_alt = data.drop_duplicates(keep='last')

Missing value treatment

  • systemic missing — no need to replace. or just replae with -9999
  • random missing — replace

when replacing random, if missing data < 10%, then replace with mean. or mode if there are very high outliers

If > 10%, then in each quantile, replace with any random value from that quantile. This will ensure that overall mean is not changed

# to detect how many are missing
data['col'].isnull()
# replace
data['col'].fillna(data['col'].mean())
data['col'].fillna(data['col'].median())
data['col'].fillna(data[-9999)
#drop rows
data[ data['col'].isnull() == False]
# this creates a data set which has no missing
# worst case
data.dropna()
# this can be done if the missing data is very min. say < 1% across all variables

Handling Outliers

how to detect

By one definition

  • whatever is less than 1 percentile is outlier.
  • whatever is more than 99 percentile is outlier

Another definition

  • if a value > mean + 3 SD, or value < mean — 3 SD, then it is an outlier (in case of normal distributions)
# to find
np.percentile(data.income, 0.01)
data.income.quantile(0.99)
# 6 Sigma method
upper_end = data.income.dropna().mean() + 3 * data.income.dropna().std()
lower_end = data.income.dropna().mean() - 3 * data.income.dropna().std()

how to handle outliers

# if we know valid range of a variable, we can handle outliers based on thatdata['income'].clip_upper(10000) # replace greater values by 10000
data['income'].clip_lower(0) # replace negative values with 0
# percentile method
data['income'].dropna().quantile(0.95)
data['income'].dropna().quantile(0.05)
  1. Indexing

2. create / drop variables or rows

3. rename variables

indexing

7. missing value treatment

8. duplicates

9. outliers

10. binning of numeric var

11. type conv

12. data summarization

  • groupby

13. long / wide data conversion

  • stack / unstack
  • pivots

14. merging

Type Conversion

Converting Numerical Variables into Categorical Variables

  • Use Binning
data['age_group'] = pd.cut(data['age'], bins=[0,25,35,100], labels=["young","medium","old"])

Creating Categorical Variables into Numerical variables

  • Create Dummy Variables
  • It is required only for predictive analysis. For descriptive analysis, we dont need to convert into numerical values
Income has following possible values: L, M, MH, H, VHWe simply cannot assign them values: 1 to 5
This would imply VH is 5x the value of L. We are unnecessarily introducing an order in the values which may not be the case in reality
We can use K-1 dummy variables, each combination of dummy variables representing a categoryDummy variables can take values: 0 or 1
while even the 0s and 1s may seem like ordinals, this is handled using intercept
y = 3x + 5
5 is the value at x = 0
8 is the value at x = 1
why K-1 dummy variables? bcoz the k'th value is accounted for in the intercept.

Code Sample

dummy_var_data = pd.get_dummies(data['age_group'], prefix="D")# This will create K dummy variables, bcoz python doesnt understand that it can be represented by K-1we can drop Kth column

Data Summarization

df.apply(lambda function)# if you want to apply row wisedf.apply(lambda function, axis=1)Eg.# TODOvar_summary(x)
pd.series([], ['N', 'NMISS', 'SUM', 'MEAN', 'MEDIAN', 'STD', 'VAR', 'MIN'])
data._get_numeric_data().apply(lambda x: var_summary(x)).T

Group By

grouped_data = data[['gender','age']].groupby('gender').mean()# average age w.r.t gendergender | age
0 | 52
1 | 53
type(grouped_data)grouped_data.max()
grouped_data.mean()
# apply multiple statistics at once# df is data frame. k1 is a categorical variablepdf.concat([ df.groupby(df.k1).agg('mean'),
df.groupby(df.k1).agg('std')], axis = 1)
# returns mean and std of each numerical variable, wrt to each distinct value of k1# now we are not able to differentiate which is mean and which is stdpdf.concat([ df.groupby(df.k1).agg('mean').add_prefix('mu_'),
df.groupby(df.k1).agg('std').add_prefix('sigma_')], axis = 1)
pd.DataFrame(pd.concat([grouped_data.max(), grouped.min()], axis=1), columns=['max','min'])

Pivot


df.pivot(index='date', columns='item', values='status')
# this is very similar to excel pivotprint( pd.pivot_table(data=df, index='date', columns='item', values='status', aggfunc='sum') )# for multiple aggregations
print( pd.pivot_table(data=df, index='date', columns='item', values='status', aggfunc=['sum','mean']) )

Merging (Joins)

pd.merge(df1, df2, how="left", on="key", left_on=None, right_on=None, left_index = False, right_index=False, sort=True, copy=True, suffixes=("_l", "_r"))we can join in left and columns on a specific key. Or we could do left index / right index

Another way

df1.join[df2]# left join between 2 data framespd.concat([s1, s2, s3], axis = 0 or 1)#join vertically or horizontally                                                                                                                                                                                                                                                                                                         

Random Sampling

  1. With replacement
  • remove from original
  • also called SRSWR (simple random sample with replacement)
  • samples are random
  • representative of population
  • create a sample of infinite size. data can be amplified
  • pick a ball from a jar, note the number on the ball, and put it back in jar
ts.sample(n=758, replace=False).duplicated().value_count()# sample 70% of the total population
df.sample(frac=0.7, replace=True).duplicated().value_count()

2. Without replacement

  • pick a ball from a jar, note the number on the ball, but don’t put it back in jar
  • samples are unique
# without replacement
ts.sample(n=758, replace=False).duplicated().value_count()

3. Stratified sampling

  • eg. people in north have longer height as compared to people in north east. if we dont take samples from each state, we will not correct picture with MEAN / AVG
  • Select samples from each state. Then apply SRS-WR or SRS-WOR, state wise

What do we do with Outliers

  • We do flooring and capping

Share these with business. If this observation might reoccur, we can do some treatment.

Info / Profiling of DataFrames

pandas_profiling.ProfileReport(df)

--

--