Just Enough Python for Data Science
String
Strings are immutable
Subset
Extract a subset of characters from a given string.
str[start_index:stop_index:skip]
sub setting is right exclusive
# start from 2nd character (index = 1) till the end
str[1]# start with 3rd character (index = 2) and till 4th character (index = 3)
str[2:4]str[:6]
str[1:]
#start with 1st index (2nd char) and get till second last char (-1 is last char and subsetting is right exclusive)
str[1:-1]
# print alternate characters
str[::2]
Lambda
Map
A map function maps a lambda function to a list and create a new list by applying lambda function to each value in the list. The resulting list has the same number of values as the original list.
m = map(lambda x:x+10, range(10))list(m) # to display as list
Reduce
A reduce function maps a lambda function to a list and creates a summarization by applying lambda function to each item in the list.
import functools
x = functools.reduce( (lambda x,y : x * y) , [1,2,3,4])# The lambda function is applied recursively. E.g.
1,2 = 1 * 2 = 2
2,3 = 2 * 3 = 6
6,4 = 6 * 4 = 24
Filter
It is used to filter the items original list based on a certain criteria and return a filtered list
list( filter( lambda x : x%2 != 0 , range(10) )# This will returns [1,3,5,7,9]
Zip
# merges the sequences passed as params and returns a tuple containing element at the same index from each of the sequences.
# returns a list which has elements = smallest list
Enumerate
# returns both index and element of the sequence as a tuple
Comprehensions
List Comprehensions
new_list = [ n**2 for n in range(1,21) if n % 2 == 0 ]# This is similar to FILTER function
Dictionary Comprehensions
NumPy
Arrays
np.ndarray([2,2,3], dtype='float')
returns 2 x 2 x 3 matrix of type float
It is represented through a set of 3 arrays of 2 x 2.
slicing an array
b = a[1,2:4]
trace
sum
sum(axis=0)
mean
sort
ndim
Arrays
Define np.array and ndarray
Slicing an array
condition and lookup
Boolean Indexing
aa > 2 #returns a same size array with True where element > 2 else falsea[a>2] #returns the elements where the element > 2
Reshaping an array
f is a 4x4 array
np.shape(f, (1,16)) # this will return a 1D array with 1 row and 16 columns
Numpy Maths
add
subtract
multiplicationa * b # This does an element to elementnp.matmul(a,b)
# number of columns of a should be equal to columns in b
Transpose
a.T # this will transpose
function broadcasting
Advanced Matrix Operations
Solving linear equations
2x + 2y + 3z = 53x + y + 4z = 74x + 3y = 10[x, y, z] = np.linalg.solve(feature_matrix, results_matrix)
Matrix Inversion
Eigen value and Eigen Vector
To identify where is the variance in data
Identify key variables, that contribute most of the variance
In above example, total variance = 1 + 2 + 3.
Third column represts 50% of the variance. Why not 3rd row, bcoz column represents the variable and not row
Data Frames
Subsetting data
To get a subset of data framedata.iloc[[row_start, row_end],[col_start, col_end]]
# give column indexesdata.loc[0:99, ['Id','Name']] #give names
Creating new variables in existing data frame
data['new_var'] = data['var_one'] + data['var_one']ORnew_data = data.assign(a = data.var_one + data.var_two, b = data.var_three + data.var_four)
# existing data frame is not modified. a new data frame is returned
# multiple new attributes can be added in one go this way
Accessing variables
data['varname'] is same as data.varname
Dropping variables
dropping columns
data.drop( ['col1', 'col2'], axis=1, inplace = True )
# axis = 1 means drop columns, for given rows
inplace = True will actually drop the data
inplace = False will just create a view without that column
dropping rows
data.drop( [0,1], inplace = True )
# this will remove first and second rows
BINNING — converting numerical variables to categorical variables
- user defined | can be implemented using splitting of data frame
- equidense
- equidistant
income # contains numerical value of incomeincome_grp = High Medium Low# Approach 1 - assign income_grp with a value which is decile of the income
#equidense binning
#each group has same number of people# Approach 2 -
# equidistant
# create ranges for each category. Then assign into each category--pd.cut( data['colname'], 3, labels = ['0-33','33-66','66-100'] )# colname is the column which is numerical and is being converted into categorical
# the data will sort automatically# returns rownum | colname (with categorical values)---# binning into quantilesdata['deciles'] = pd.qcut( data['colname', 10, labels = range(1,11,1)])
Q> why do we convert numerical to categorical?
A> To simplify our interpretation
REname columns
data.rename(columns = {"oldvar1":"newvar1", ... })
Sort data
data.sort_index(axis=1, ascending = False)data.sort_values(by=['income','utilization'], ascending=[True, False], inplace = True)
Convert data types
data.MonthlyIncome.astype('int64', raise_on_error = True)# if raise_on_error = False, and we are converting string to int, it will convert string to corresponding ascii values
Indexing and de-indexing
data.set_index("id", inplace = True)
# now id is not a column in data
# we cannot apply filtering / conditional statements on IDsdata.reset_index()# nested indexes
data.set_index( np.array(["customer_segment", "id"]) )
Handling duplicates
data[ demo.duplicated() ] # checks all columns together for duplication# selected columnsdata[ demo['customer_name'].duplicated() ]dups = data[ demo['customer_name'].duplicated() ]
uniques = data[ demo['customer_name'].duplicated() == False]uniques_alt = data.drop_duplicates()
uniques_alt = data.drop_duplicates(keep='last')
Missing value treatment
- systemic missing — no need to replace. or just replae with -9999
- random missing — replace
when replacing random, if missing data < 10%, then replace with mean. or mode if there are very high outliers
If > 10%, then in each quantile, replace with any random value from that quantile. This will ensure that overall mean is not changed
# to detect how many are missing
data['col'].isnull()# replace
data['col'].fillna(data['col'].mean())
data['col'].fillna(data['col'].median())
data['col'].fillna(data[-9999)#drop rows
data[ data['col'].isnull() == False]
# this creates a data set which has no missing# worst case
data.dropna()
# this can be done if the missing data is very min. say < 1% across all variables
Handling Outliers
how to detect
By one definition
- whatever is less than 1 percentile is outlier.
- whatever is more than 99 percentile is outlier
Another definition
- if a value > mean + 3 SD, or value < mean — 3 SD, then it is an outlier (in case of normal distributions)
# to find
np.percentile(data.income, 0.01)
data.income.quantile(0.99)# 6 Sigma method
upper_end = data.income.dropna().mean() + 3 * data.income.dropna().std()lower_end = data.income.dropna().mean() - 3 * data.income.dropna().std()
how to handle outliers
# if we know valid range of a variable, we can handle outliers based on thatdata['income'].clip_upper(10000) # replace greater values by 10000
data['income'].clip_lower(0) # replace negative values with 0# percentile method
data['income'].dropna().quantile(0.95)
data['income'].dropna().quantile(0.05)
- Indexing
2. create / drop variables or rows
3. rename variables
indexing
7. missing value treatment
8. duplicates
9. outliers
10. binning of numeric var
11. type conv
12. data summarization
- groupby
13. long / wide data conversion
- stack / unstack
- pivots
14. merging
Type Conversion
Converting Numerical Variables into Categorical Variables
- Use Binning
data['age_group'] = pd.cut(data['age'], bins=[0,25,35,100], labels=["young","medium","old"])
Creating Categorical Variables into Numerical variables
- Create Dummy Variables
- It is required only for predictive analysis. For descriptive analysis, we dont need to convert into numerical values
Income has following possible values: L, M, MH, H, VHWe simply cannot assign them values: 1 to 5
This would imply VH is 5x the value of L. We are unnecessarily introducing an order in the values which may not be the case in realityWe can use K-1 dummy variables, each combination of dummy variables representing a categoryDummy variables can take values: 0 or 1
while even the 0s and 1s may seem like ordinals, this is handled using intercepty = 3x + 5
5 is the value at x = 0
8 is the value at x = 1why K-1 dummy variables? bcoz the k'th value is accounted for in the intercept.
Code Sample
dummy_var_data = pd.get_dummies(data['age_group'], prefix="D")# This will create K dummy variables, bcoz python doesnt understand that it can be represented by K-1we can drop Kth column
Data Summarization
df.apply(lambda function)# if you want to apply row wisedf.apply(lambda function, axis=1)Eg.# TODOvar_summary(x)
pd.series([], ['N', 'NMISS', 'SUM', 'MEAN', 'MEDIAN', 'STD', 'VAR', 'MIN'])data._get_numeric_data().apply(lambda x: var_summary(x)).T
Group By
grouped_data = data[['gender','age']].groupby('gender').mean()# average age w.r.t gendergender | age
0 | 52
1 | 53type(grouped_data)grouped_data.max()
grouped_data.mean()# apply multiple statistics at once# df is data frame. k1 is a categorical variablepdf.concat([ df.groupby(df.k1).agg('mean'),
df.groupby(df.k1).agg('std')], axis = 1)# returns mean and std of each numerical variable, wrt to each distinct value of k1# now we are not able to differentiate which is mean and which is stdpdf.concat([ df.groupby(df.k1).agg('mean').add_prefix('mu_'),
df.groupby(df.k1).agg('std').add_prefix('sigma_')], axis = 1)pd.DataFrame(pd.concat([grouped_data.max(), grouped.min()], axis=1), columns=['max','min'])
Pivot
df.pivot(index='date', columns='item', values='status')# this is very similar to excel pivotprint( pd.pivot_table(data=df, index='date', columns='item', values='status', aggfunc='sum') )# for multiple aggregations
print( pd.pivot_table(data=df, index='date', columns='item', values='status', aggfunc=['sum','mean']) )
Merging (Joins)
pd.merge(df1, df2, how="left", on="key", left_on=None, right_on=None, left_index = False, right_index=False, sort=True, copy=True, suffixes=("_l", "_r"))we can join in left and columns on a specific key. Or we could do left index / right index
Another way
df1.join[df2]# left join between 2 data framespd.concat([s1, s2, s3], axis = 0 or 1)#join vertically or horizontally
Random Sampling
- With replacement
- remove from original
- also called SRSWR (simple random sample with replacement)
- samples are random
- representative of population
- create a sample of infinite size. data can be amplified
- pick a ball from a jar, note the number on the ball, and put it back in jar
ts.sample(n=758, replace=False).duplicated().value_count()# sample 70% of the total population
df.sample(frac=0.7, replace=True).duplicated().value_count()
2. Without replacement
- pick a ball from a jar, note the number on the ball, but don’t put it back in jar
- samples are unique
# without replacement
ts.sample(n=758, replace=False).duplicated().value_count()
3. Stratified sampling
- eg. people in north have longer height as compared to people in north east. if we dont take samples from each state, we will not correct picture with MEAN / AVG
- Select samples from each state. Then apply SRS-WR or SRS-WOR, state wise
What do we do with Outliers
- We do flooring and capping
Share these with business. If this observation might reoccur, we can do some treatment.
Info / Profiling of DataFrames
pandas_profiling.ProfileReport(df)