Fundamental Python Data Science Libraries: A Cheatsheet (Part 3/4)
If you are a developer and want to integrate data manipulation or science into your product or starting your journey in data science, here are the Python libraries you need to know.
- NumPy
- Pandas
- Matplotlib
- Scikit-Learn
The goal of this series is to provide introductions, highlights, and demonstrations of how to use the must-have libraries so you can pick what to explore more in depth.
Matplotlib
This library is the go-to Python visualization package (except for Plotly which is paid)! It allows you to create rich images displaying your data with Python code.
Focus of the Library
This library is extensive, but this article will focus on two objects: the Figure and the Axes.
Installation
Open a command line and type in
pip install matplotlib
Windows: in the past I have found installing NumPy & other scientific packages to be a headache, so I encourage all you Windows users to download Anaconda’s distribution of Python which already comes with all the mathematical and scientific libraries installed.
Details
Matplotlib is split into two main sections: the Pyplot API (visualization functions for fast production) and the Object Oriented API (more flexible and robust).
We will focus on the latter.
Let’s dive in!
import matplotlib.pyplot as plt
import numpy as np
Creation
In order to make a visualization, you need to create 2 objects one right after the other. First create a Figure object and then from that, create an Axes object. After that, all visualization details are created by calling methods.
# Figure is a blank canvas
fig = plt.figure(figsize=(8,5), dpi=100) # 800x500 pixel image# Add axes at specific position (fractions of fig width and height)
position = [0.1, 0.1, 0.8, 0.8] # left, bottom, width, height
axes = fig.add_axes(position)
Some things to note about the Figure object:
- The figsize & dpi parameters are optional
- figsize is the width and height of the figure in inches
- dpi: is the dots-per-inch (pixel per inch)
Some things to note about the add_axes method:
- The position of the axes can only be specified in fractions of the figure size
- There are many other parameters that you can pass to this method
Plotting
Now we are going to create some simple data, plot it, label the graph, and save it to the same directory as where our code lives.
# Create data
x = np.array([1,2,3,4,5,6])
y = np.array([1,4,9,16,25,36])# Plot a line
axes.plot(x, y, label="growth") # label keyword used later!
axes.set_xlabel('X Axis')
axes.set_ylabel('Y Axis')
axes.set_title("Simple Line")# Save the image
fig.savefig("file1.jpg")
Here is the resulting image:
Legends
The best way to add a legend is to include the label keyword when you call the plot method on the Axes object (as we saw in the code above). Then you can make a legend and choose its location by calling another method.
# Location options: 0 = Auto Best Fit, 1 = Upper Right, 2 = Lower Right,
# 3 = Lower Left, 4 = Lower Right
axes.legend(loc=0)# Save the image
fig.savefig("file2.jpg")
Here is the resulting image:
Colors & Lines
You can control features of the lines by passing certain keyword arguments into the plot method. Some of the most commonly used keywords are:
- color: either passing the name (“b”, “blue”, “r”, “red”, etc) or a hex code (“#1155dd”, “15cc55”)
- alpha: transparency of the line
- linewidth
- linestyle: pattern of the line (‘-’, ‘-.’, ‘:’, ‘steps’)
- marker: pattern for each data point on the line (‘+’, ‘o’, ‘*’, ‘s’, ‘,’, ‘.’)
- markersize
# Use the keywords in the plot method
benchmark_data = [5,5,5,5,5,5]
axes.plot(x, benchmark_data, label="benchmark", color="r", alpha=.5, linewidth=1, linestyle ='-', marker='+', markersize=4)axes.legend(loc=0)# Save the image
fig.savefig("file3.jpg")
Here is the resulting image:
Axes Range & Tick Marks
You can also control the range of the axes and override the tick lines of your graph.
# Control the range of the axes
axes.set_xlim([1, 6])
axes.set_ylim([1, 50]) # increasing y axis maximum to 50, instead of 35
#axes.axis("tight") # to get auto tight fitted axes, do this# Control the tick lines
axes.set_xticks([1, 2, 3, 4, 5, 6])
axes.set_yticks([0, 25, 50])# Control the labels of the tick lines
axes.set_xticklabels(["2018-07-0{0}".format(d) for d in range(1,7)])
axes.set_yticklabels([0, 25, 50])axes.legend(loc=0)
fig.savefig("file4.jpg")
Here is the resulting image:
Subplots
So far we have created a Figure object with only one graph on it. It is possible to create multiple graphs on one Figure all in one go. We can do this using the subplots function.
# 2 graphs side by side
fig1, axes1 = plt.subplots(nrows=1, ncols=2, figsize=(8,5), dpi=100))# Set up first graph
axes1[0].plot(x, x**2, color='r')
axes1[0].set_xlabel("x")
axes1[0].set_ylabel("y")
axes1[0].set_title("Squared")# Set up second graph
axes1[1].plot(x, x**3, color='b')
axes1[1].set_xlabel("x")
axes1[1].set_ylabel("y")
axes1[1].set_title("Cubed")# Automatically adjust the positions of the axes so there is no overlap
fig1.tight_layout()fig1.savefig("file5.jpg")
Here is the resulting image:
I’m providing here a link to download my Matplotlib walkthrough using a Jupyter Notebook!
Never used Jupyter notebooks before? Visit their website here.
Applications
In my last article on pandas, we acquired data on Bitcoin and created a signal for when to buy and trade based on the rolling 30 day average price. We can use our new knowledge in Matplotlib to visualize this data.
You’ll need a Quandl account and the python Quandl library.
pip install quandl
Code from last time:
import quandl
import pandas as pd
# set up the Quandl connection
api_key = 'GETYOURAPIKEY'
quandl.ApiConfig.api_key = api_key
quandl_code = "BITSTAMP/USD"
# get the data from the API
bitcoin_data = quandl.get(quandl_code, start_date="2017-01-01", end_date="2018-01-17", returns="numpy")
# set up the data in pandas
df = pd.DataFrame(data=bitcoin_data, columns=['Date', 'High', 'Low', 'Last', 'Bid', 'Ask', 'Volume', 'VWAP'])
# make the 'Date' column the index
df.set_index('Date', inplace=True)
# find a rolling 30 day average
df['RollingMean'] = df['Last'].rolling(window=30).mean().shift(1)
# label when the last price is less than L30D average
df['Buy'] = df['Last'] < df['RollingMean']
# create a strategic trading DataFrame
trading_info = df.loc[:,['Last', 'RollingMean', 'Buy']]
New code to visualize bitcoin data:
import matplotlib.pyplot as plt# make figure
fig = plt.figure(figsize=(8,5), dpi=100)
# add axes at specific position
position = [0.1, 0.1, 0.8, 0.8]
axes = fig.add_axes(position)
# plot the bitcoin data
num_days = trading_info.index.size
x = range(num_days)
y = trading_info['Last']
axes.plot(x, y, label="Price", color="b") # label keyword used later!
axes.set_xlabel('Date')
axes.set_ylabel('Price')
axes.set_title("Bitcoin Price")
# plot the rolling mean
axes.plot(x, trading_info['RollingMean'], label="Rolling Mean", color="r", alpha=.5, linewidth=1, linestyle ='-')
# set up the legend
axes.legend(loc=0)
# set up the date tick marks
x_ticks_index = range(0, num_days, 100)
x_ticks_labels = [str(trading_info.index[indx])[0:10] for indx in x_ticks_index]
axes.set_xticks(x_ticks_index)
axes.set_xticklabels(x_ticks_labels)
# save the image
fig.savefig("Bitcoin.jpg")
Here is the resulting image:
That’s Matplotlib! Fast, flexible, and easy visualizations with real data. But what if we wanted to analyze the data with something more sophisticated than a rolling 30 day average? The last library every Python data-oriented programmer needs to know is Scikit-Learn — learn about it in my next article!
Thanks for reading! If you have questions feel free to comment & I will try to get back to you.
Connect with me on Instagram @lauren__glass & LinkedIn
Check out my essentials list on Amazon