Top 10 python tips to make our life easy in Data Analysis

Tricks for quickly summarization and styling data for Data Analysis

Published in

Nerd For Tech

6 min readJun 13, 2021

Whatever looks good has always the highest price, as Content matters 80% + looks matter 20%. Tricks and tips are always best to collect to make work more efficient and easy. Minor shortcuts can work as a booster to your work. Some know and some unknown tricks are shown with code and examples below

Data

# Importing librariesimport pandas as pd
import numpy as np# Dataset# Dataset
df = pd.DataFrame({
  'Subject':['S1', 'F1', 'A1', 'S1', 'S1','M1','F1'],
  'Marks1':[10, 20,10, 40, 20, 60, 20],
  'Marks2':[20, 40, 20, 30, 10, 80, 39],
  'Review': ['Good can do better. Better luck next time', 'Happy', 'Good', 'Best', 'Good! satisfy with the result', 'Better', 'Best'],
  'Code':[1,6,2,6,7,6,1]
               })
df.head(3)

1. Crosstab

“Compute a simple cross-tabulation of two (or more) factors. By default computes a frequency table of the factors unless an array of values and an aggregation function are passed.” I personally find crosstab function more useful.

# Data set
dff = pd.DataFrame({
'Name':['Alisa','Bobby','Cathrine', 'Alisa','Bobby','Cathrine',
'Alisa','Bobby'],
'Subject':['Mathematics','Mathematics','Mathematics','Science','Science','Science',
'History','History'],
'Score':[62,47,55, 74,31,77, 100,63],
'Group_rank': [1.0,3.0,2.0,2.0,3.0,1.0,1.0,2.0]})
dff

Crosstab of multiple indexes and a column. It shows the total count of each rows & columns by setting margin=True, we can also change the column name by margin_name

pd.crosstab([dff.Name, dff.Group_rank], [dff.Subject, dff.Score], margins=True, margins_name="Total")

crosstab provides values parameters to 3 numerical values to aggregate on. Setting some additional functionality by replacing nan values with 0 & rounding the value to 2 decimals.

pd.crosstab(df.Subject, df.Code, values=df.Marks1, aggfunc='mean', margins=True, margins_name="Total").round(2).replace(np.nan, 0)

The most popular parameter is normalize which accepts these options.

a) If passed normalize = True or all, will normalize over each value.

b) If passed normalize = index(rows), will normalize over each rows.

c) If passed normalize = columns, will normalize over each columns.

pd.crosstab(df.Subject, df.Marks1, normalize=True)

2. Styling

a) Tables

Have you ever wonder just looking at the table it’s plain simple good but not great we just need to spicy up things & our savior is CSS. for more examples look at attaching the link.

pandas.io.formats.style.Styler.set_table_styles — pandas 1.2.4 documentation

If supplying a list, each individual table_style should be a dictionary with and keys. should be a CSS selector that…

pandas.pydata.org

# Styling the data framedf.style.set_table_styles(
[{'selector': 'tr:nth-of-type(odd)',
  'props': [('background', '#eee')]},
 {'selector': 'tr:nth-of-type(even)',
  'props': [('background', 'white')]},
 {'selector': 'th',
  'props': [('background', '#000'),
            ('color', 'white'),
            ('font-family', "'Lato', sans-serif")]},
 {'selector': 'td',
  'props': [('font-family', 'verdana')]},
]).hide_index()

If you want to make the table interactive then switching to plotly(Library) is a great option.

b) Markdown

Markdown to make jupyter notebook fun.

<div class="alert alert-vlock alert-danger">
It is a danger box. </div>

<div class="alert alert-vlock alert-info">
It is a info box. </div>

<div class="alert alert-vlock alert-warning">
It is warning box. </div>

<div class="alert alert-vlock alert-success">
It is a success box. </div>

3. Working with text data

a) Extract

Extract function is used when we need to extract words or digits from the text.

pd.Series(["a1", "b2", "c3"], dtype="string").str.extract(r"([abc])(\d)", expand=False)

b) lower(), upper()

pandas provide lower & upper functions so we can directly convert text data into lowercase & uppercase simultaneously.

df['upper'] = df['Review'].str.upper()
df['lower'] = df['Review'].str.lower()
df[:5]

4. Memory Usage

Memory usage is used to check how much memory is being used by columns in the data frame. It is most useful while working with Deep Learning & Machine Learning Algorithm when we need to train a model.

df.memory_usage()

5. Option

Readers have encountered this issue when there are many columns some of middle the columns get omitted.

columns containing long text get truncated. Used when working with text data where data is too large to fit.

columns having float datatype also get truncated when they have too many digits after the decimals.

# Setting the col width value
pd.set_option('max_colwidth', 500)
df[['Subject','Review']]

pd.set_option('min_rows', 2)
df[['Subject','Review']]

6. Groupby

Groupby function involves a combination of splitting, applying a function, & combining the results. Mainly used to group a large amount of data & perform functions on them.

tbl = df.groupby(['Subject','Code']).agg({'Marks1': ['max', np.mean],
                                 'Marks2': ['sum','min','count']})
tbl

If you don’t like this alignment of columns there is always room for change. Using the reset index we can change the alignment.

tbl = tbl.reset_index()
tbl.columns = ['Subject', 'Code', 'Mark1_max', 'Marks1_mean', 'Marks2_sum', 'Marks2_min', 'Marks2_count']
tbl

7. Listing all unique values in a group

Getting the list of unique values.

dff = pd.DataFrame(dict(A=['A','A','A','A','A','B','B','B','B'],
                       B=[1,1,1,2,2,1,1,1,2],
                       C=['CA','NY','CA','FL','FL',     
                          'WA','FL','NY','WA']))
dff[:3]

tbl = dff[['A', 'B', 'C']].drop_duplicates()\
                         .groupby(['A','B'])['C']\
                         .apply(list)\
                         .reset_index()
tbl['C'] = tbl.apply(lambda x: (','.join([str(s) for s in x['C']])), axis = 1)
tbl

8. other functions

a) cumulative sum

cumulative sum gives cumulative sum for each group.

df['cumulative_sum'] = df['Marks1'].cumsum()
df

b) squeeze

This method is most useful when you don’t know if your object is a Series or Data Frame, but you do know it has just a single column. In that case, you can safely call squeeze to ensure you have a Series.

sr = pd.Series([100, 215, 32, 123, 24, 65])
sr_temp = sr[sr % 13 == 0]
ans = sr_temp.squeeze()
print(sr_temp)
print("Seequze: ", ans)

c) Sample

Sample method allows you to select values randomly from a Series or Data Frame. It is useful when we want to select a random sample from a large data.

df.sample(n=2)

9. Finding unique values

nunique counts unique values over columns & rows. Mostly used when we have categorical features where the unique values are too many to count manually.

df.nunique()

10. Pandas profiling

you can find the notebook here and play around.

You can contact me here

Linkedin | Kaggle | Blog

Some of my previous works are below feel free to have a look!

Google Colaboratory

Edit description

colab.research.google.com

Automation OF EDA For Superstore Dataset

Exploratory Data Analysis (EDA) is an approach for data analysis and data exploration that employs a variety of…

medium.com

Fun with Python: From Zero to One

This article aims to discuss all the key features for the basics of Python programming language. My target is to keep…