Fundamental Python Data Science Libraries: A Cheatsheet (Part 2/4)

Published in

HackerNoon.com

7 min readJan 17, 2018

If you are a developer and want to integrate data manipulation or science into your product or starting your journey in data science, here are the Python libraries you need to know.

NumPy
Pandas
Matplotlib
Scikit-Learn

The goal of this series is to provide introductions, highlights, and demonstrations of how to use the must-have libraries so you can pick what to explore more in depth.

pandas

This library is built on top of NumPy, which you may remember from my last article. Pandas takes NumPy’s powerful mathematical array-magic one step further. It allows you to store & manipulate data in a relational table structure.

Focus of the Library

This library focuses on two objects: the Series (1D) and the DataFrame (2D). Each allow you to set:

an index — that lets you find and manipulate certain rows
column names — that lets you find and manipulate certain columns

Having SQL deja-vu yet?

Installation

Open a command line and type in

pip install pandas

Windows: in the past I have found installing NumPy & other scientific packages to be a headache, so I encourage all you Windows users to download Anaconda’s distribution of Python which already comes with all the mathematical and scientific libraries installed.

Details

A pandas data structure differs from a NumPy array in a couple of ways:

All data in a NumPy array must be of the same data type, a pandas data structure can hold multiple data types
A pandas data structure allows you to name rows and columns
NumPy arrays can reach multiple dimensions, pandas data structures limit you to just 1 & 2D.*

*there is a 3D pandas data structure called a Panel but it is depreciated

Let’s dive in!

import pandas as pd
import numpy as np

Creation

It’s very simple!

You can create a Series or DataFrame from a list, tuple, NumPy array, or even a dictionary! Oh and of course from CSVs and databases.

From an array

# Series
future_array1 = [1,2,3,4,5,6]
array1 = np.array(future_array1)
s = pd.Series(array1)>>> s
0    1
1    2
2    3
3    4
4    5
5    6
dtype: int64

The print out you see above has two columns. The one on the left is the index and the one on the right is your data. This index looks like the indexes we are used to when using lists, tuples, arrays, or any other iterable. We will see soon in pandas we can change it to whatever we like!

# DataFrame
future_array2 = [2,4,6,8,10,12]
array2 = np.array(future_array2)
df = pd.DataFrame([future_array1, future_array2])>>> df
   0  1  2  3   4   5
0  1  2  3  4   5   6
1  2  4  6  8  10  12

The print out you see above has a ton of numbers. The first column on the left is the index. The top row is the columns names (for now 0…5). Again, we will see soon in pandas we can change it to whatever we like!

From a dictionary

The dictionary keys will become the index in a Series

# Series
future_series = {0: 'A', 1: 'B', 2: 'C'}
s = pd.Series(future_series)>>> s
0    A
1    B
2    C
dtype: object

It works a bit differently in a DataFrame — the keys become the column names

# DataFrame
dict = {'Normal': ['A', 'B', 'C'], 'Reverse': ['Z', 'Y', 'X']}
df = pd.DataFrame(dict)>>> df
  Normal Reverse
0      A       Z
1      B       Y
2      C       X

Upload data

Pandas has many ways to upload data, but let’s focus on the standard csv format.

uploaded_data = pd.read_csv("filename.csv", index_col=0)

The keyword argument, index_col, is where you can specify which column in your CSV should be the index in the DataFrame. For more details on the read_csv function, go here.

I love that the pandas library only requires 1 line to import data from a CSV. Who else is over copying and pasting the same lines of code from the csv library? ;)

Use the Index

Your days of text wrangling are over! No more weird list comprehensions or for loops with comments like “# extract this column during given period” or “# sorry for the mess”.

Here is an example DataFrame:

dates = pd.date_range("20160101", periods=6)
data = np.random.random((6,3))
column_names = ['Column1', 'Column2', 'Column3']
df = pd.DataFrame(data, index=dates, columns=column_names)>>> df
             Column1   Column2   Column32016-01-01  0.704351  0.151919  0.5058812016-01-02  0.242099  0.887256  0.0695122016-01-03  0.683565  0.305862  0.2780662016-01-04  0.943801  0.388292  0.2213182016-01-05  0.353116  0.418686  0.0540112016-01-06  0.802379  0.720102  0.043310

Indexing a column

>>> df['Column2'] # use the column name's string2016-01-01    0.1519192016-01-02    0.8872562016-01-03    0.3058622016-01-04    0.3882922016-01-05    0.4186862016-01-06    0.720102Freq: D, Name: Column2, dtype: float64

Indexing a row

>>> df[0:2] # use the standard indexing technique            Column1   Column2   Column32016-01-01  0.704351  0.151919  0.5058812016-01-02  0.242099  0.887256  0.069512
>>> df['20160101':'20160102'] # use the index's strings            Column1   Column2   Column32016-01-01  0.704351  0.151919  0.5058812016-01-02  0.242099  0.887256  0.069512

Indexing multiple axes — names

>>> df.loc['20160101':'20160102',['Column1','Column3']]            Column1   Column32016-01-01  0.704351  0.5058812016-01-02  0.242099  0.069512

Indexing multiple axes — numbers

>>> df.iloc[3:5, 0:2]            Column1   Column22016-01-04  0.943801  0.3882922016-01-05  0.353116  0.418686

View Your Data

Quickly check the top and bottom rows:

>>> df.head(2) # first 2 rows            Column1   Column2   Column32016-01-01  0.704351  0.151919  0.5058812016-01-02  0.242099  0.887256  0.069512
>>> df.tail(2) # last 2 rows            Column1   Column2   Column32016-01-05  0.353116  0.418686  0.0540112016-01-06  0.802379  0.720102  0.043310

View summary statistics before you dash off for a meeting:

>>> df.describe()       Column1   Column2   Column3count  6.000000  6.000000  6.000000mean   0.621552  0.478686  0.195350std    0.269550  0.273359  0.180485min    0.242099  0.151919  0.04331025%    0.435728  0.326470  0.05788750%    0.693958  0.403489  0.14541575%    0.777872  0.644748  0.263879max    0.943801  0.887256  0.505881

Control Your Data

Pandas brings the flexibility of SQL into Python.

Sort

>>> df.sort_index(axis=0, ascending=False) # sort using the index             Column1   Column2   Column32016-01-06  0.802379  0.720102  0.0433102016-01-05  0.353116  0.418686  0.0540112016-01-04  0.943801  0.388292  0.2213182016-01-03  0.683565  0.305862  0.2780662016-01-02  0.242099  0.887256  0.0695122016-01-01  0.704351  0.151919  0.505881
>>> df.sort_values(by='Column2') # sort using a column            Column1   Column2   Column32016-01-01  0.704351  0.151919  0.5058812016-01-03  0.683565  0.305862  0.2780662016-01-04  0.943801  0.388292  0.2213182016-01-05  0.353116  0.418686  0.0540112016-01-06  0.802379  0.720102  0.0433102016-01-02  0.242099  0.887256  0.069512

Join

Here are new example DataFrames:

dates1 = pd.date_range("20160101", periods=6)
data1 = np.random.random((6,2))
column_names1 = ['ColumnA', 'ColumnB']dates2 = pd.date_range("20160101", periods=7)
data2 = np.random.random((7,2))
column_names2 = ['ColumnC', 'ColumnD']df1 = pd.DataFrame(data1, index=dates1, columns=column_names1)
df2 = pd.DataFrame(data2, index=dates2, columns=column_names2)>>> df1.join(df2) # joins on the index            ColumnA   ColumnB   ColumnC   ColumnD2016-01-01  0.128655  0.181495  0.574188  0.6285842016-01-02  0.278669  0.810805  0.634820  0.5455312016-01-03  0.489763  0.397794  0.169862  0.3006662016-01-04  0.911465  0.903353  0.058488  0.9111652016-01-05  0.094284  0.890642  0.282264  0.5680992016-01-06  0.512656  0.735082  0.141056  0.698386

If you want to join on a column other than the index, check out the merge method.

Group by

df3 = df1.join(df2)# add a column to df to group on
df3['ProfitLoss'] = pd.Series(['Profit', 'Loss', 'Profit', 'Profit', 'Profit', 'Loss'], index=dates)>>> df3.groupby('ProfitLoss').mean()            ColumnA   ColumnB   ColumnC   ColumnDProfitLossLoss        0.403947  0.759588  0.272969  0.305868Profit      0.576668  0.477050  0.359661  0.406070

Accessing Attributes

Notice how I was able to just add in a column using a key/value notation in the code above? Pandas allows you to add new data with ease. But it also allows you to access the core attributes of your data structures.

Access the Index

>>> df3.indexDatetimeIndex(['2016-01-01', '2016-01-02', '2016-01-03', '2016-01-04',
               '2016-01-05', '2016-01-06'],
              dtype='datetime64[ns]', freq='D')

Access the Values

>>> df3.valuesarray([[0.441513594483238, 0.974419927787583, 0.20896018007846018,0.45913058454344435, 'Profit'], ...[0.6980963896232228, 0.7005669323477245, 0.09231336594380268,0.13264595083739117, 'Loss']], dtype=object)

Access the Columns

>>> df3.columnsIndex([u'ColumnA', u'ColumnB', u'ColumnC', u'ColumnD', u'ProfitLoss'], dtype='object')

I’m providing here a link to download my pandas walkthrough using a Jupyter Notebook!

Never used Jupyter notebooks before? Visit their website here.

Overall, if you have a dataset you want to manipulate but don’t want to go to the hassle of hauling it all into SQL, I recommend searching for a pandas solution before anything else!

Applications

Let’s look at a scenario. Say you wanted to keep an eye on Bitcoin but don’t want to invest too much time in building out an infrastructure. You can use pandas to keep it simple.

You’ll need a Quandl account and the python Quandl library.

pip install quandl

Let’s code:

import quandl# set up the Quandl connection
api_key = 'GETYOURAPIKEY'
quandl.ApiConfig.api_key = api_key
quandl_code = "BITSTAMP/USD"# get the data from the API
bitcoin_data = quandl.get(quandl_code, start_date="2017-01-01", end_date="2018-01-17", returns="numpy")# set up the data in pandas
df = pd.DataFrame(data=bitcoin_data, columns=['Date', 'High', 'Low', 'Last', 'Bid', 'Ask', 'Volume', 'VWAP'])# make the 'Date' column the index
df.set_index('Date', inplace=True) # find a rolling 30 day average
df['RollingMean'] = df['Last'].rolling(window=30).mean().shift(1)# label when the last price is less than L30D average
df['Buy'] = df['Last'] < df['RollingMean']# create a strategic trading DataFrame
trading_info = df.loc[:,['Last', 'RollingMean', 'Buy']]>>> trading_info.tail(10) # lets look at last 10 days                Last   RollingMean    BuyDate2018-01-08  16173.98  15693.421333  False2018-01-09  15000.00  15704.147667   True2018-01-10  14397.30  15716.680333   True2018-01-11  14900.00  15706.590333   True2018-01-12  13220.00  15655.209333   True2018-01-13  13829.29  15539.209333   True2018-01-14  14189.66  15458.548000   True2018-01-15  13648.00  15384.760000   True2018-01-16  13581.66  15258.109000   True2018-01-17  11378.66  15070.668667   True

This is the power of pandas with real life data! However, what if we wanted to view the data shown above in a graph? That’s possible, check out my next article on Matplotlib.

Thanks for reading! If you have questions feel free to comment & I will try to get back to you.