Python & Jupyter notebook protips

A collection of Tips & Tricks to better your code

Show me all the code — chatter for later!

1. Time your code:

%timeit is an ipython magic function, which can be used to time a particular piece of code. These are especially handy when you have some slow code and you’re trying to identify where the issue is.

%%timeit
#or 
import time
start = time.time()
l=[1,2,3,4]
len(l)
print(“ — — %s seconds — -” % (time.time() — start_time))

%%timeit runs a statement 100,000 times (by default) and then provides the mean of the fastest three times (hence, not recommended to leave it in the code but document inline) whereas %%time gives information about single run of code.

2. Check memory size of your data frames:

Want to write your data frame to a csv file on your local but check the size of the data frame first?

df.memory_usage()  #lists each column-wise size at 8 bytes/row
df.memory_usage(index=True).sum() #lists full memory used in bytes
#or 
import sys
sys.getsizeof(df)

3. Compress all at once:

You can write pandas objects to compressed file objects (supported are gzip, bz2, zip, or xz compression) - so you don’t have to write each file to your local and then compress it.

df.to_json('my_dataframe.json.gz', orient='records',
lines=True, compression='gzip')

4. Pickle it up

Want to preserve the data types of every column of your data frame and make sure no conversion to file type effects it when you load it back up. Try pickling it. Pickle is a serialized way of storing a Pandas data frame. You are basically writing down the exact representation of your data frame to disc.

df.to_pickle("./dummy.pkl")
unpickled_df = pd.read_pickle("./dummy.pkl")

5. I can do it using SQL but wait in python?

On a time crunch to get your results but not sure how to do using python, use SQLite format in your query and read it into a data frame.

from pandasql import *
import pandas as pd
pysqldf = lambda q: sqldf(q, globals())
query = """select *,(case when col>5 then 1 else 0 end) as newcol
from your_table limit 5"""
df = pysqldf(query)

6. Missing out on R’s ggplot package?

Use python’s ggplot package — similar syntax.

from ggplot import *

ggplot(aes(x='date', y='value'), data=your_data) +\
geom_line() +\
stat_smooth(colour='blue', span=0.2)

7. High-resolution plots for notebooks:

%config InlineBackend.figure_format = 'retina'

8. Easily set your date parts as date index:

datecols = ['year', 'month', 'day']
df.index = pd.to_datetime(df[datecols])

9. Display multiple data frames same cell:

This allows you to display multiple data frames if run within the same cell

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

10. Suppress scientific notation:

All your floats showing with the e-s and 10 powers in it?

pd.set_option(‘display.float_format’, lambda x: ‘%.3f’ % x)

11. Outputs to pdf

import fpdf
pdf = fpdf.FPDF(format='letter')
pdf.add_page()
pdf.set_font("Arial", size=12)
pdf.write("hello world")
pdf.ln()
pdf.output("example.pdf")
#or
from IPython.display import set_matplotlib_formats
set_matplotlib_formats('png', 'pdf')

12. Back up your scratch code & try some more magic commands:

%%writefile  #saves the contents of that cell to an external file
%%pdb #python debugger - investigates inside a function
%debug #debug within cell
%load_ext autoreload #automatically reload before running modules
%prun #identify inefficient or time consuming code
!ls *.csv    #check what datasets are available in working folder

Stay tuned for part II of this article on Pyspark.

Share a tip in the comments to help fellow programmer’s lives easier. And I’ll respond to your comment with perhaps another tip!