Automating the Boring Stuff with Python, Pandas, and SQL

Want to stop doing the same thing over and over again in Excel? You’ve heard about automating aspects of data analysis with Python, and want to learn more? Well, here is a tutorial the data science team at put together for other folks who have Macs and spend a lot of their time working on spreadsheets. All you’ll need to do is open your terminal (open the finder and search for terminal) and paste a few commands. But, make sure not to skip anything or else stuff won’t work. Good luck!

In order to use Python, Pandas, and SQL, you’ll need to install some stuff.

First, let’s turn “developer mode” on. Copy the next line into your terminal.

xcode-select --install

Next, let’s install brew. Copy the next line into you terminal.

ruby -e "$(curl -fsSL"

If you’re wondering, where that line of code came from. Click here. Otherwise, let’s move on and install Java and Python. Copy each line into your terminal and run separately.

brew doctor
brew update
brew install Caskroom/cask/eclipse-java
brew install python
pip install pandas
pip install jupyter
pip install sqlalchemy
pip install psycopg2

Nice work. You’ve got the dependencies installed. So, let’s automate some data analysis. Copy the next line into you terminal.

jupyter notebook

You should should see a page pop up in your browser at the address http://localhost:8888/tree. Now, let’s create a new python notebook. On the top right hand side, you’ll find a dropdown menu where you can create a new python notebook. Once you do that, you should have a new tab that looks like this

a brand new jupyter notebook

In that first cell, copy and paste the following lines.

import pandas as pd
from sqlalchemy import create_engine, text

Now, let’s get some data from your computer. First, click the + sign to add a new row. In the second row, copy and paste the following lines. Make sure you have a .csv file in your Desktop and replace “YourFileName” with whatever your actual file name is.

filename = "~/Desktop/YourFileName.csv"
data = pd.read_csv(filename)

Now, let’s see what the columns of the data are. Add a row and copy the code.


Let’s see what the first few rows look like


Here’s what the jupyter notebook should look like

that’s the pandas logo

Now, click the first row. Then, click the play button to run each row. You should see data! Next, we want to get some data from a SQL database. Add a row and copy the following code. For this to work, you’ll need to replace the missing username/password/url/db credentials with your database credentials.

redshift_username = ""
redshift_password = ""
redshift_url = ""
redshift_port = "5439"
redshift_db = ""
engine = create_engine(redshift_creds)
data_sql = pd.read_sql_query(text(""" 
SELECT * FROM students LIMIT 10
"""), engine)

After you run that row, you will now have two data frames: (1) data (2) data_sql,which we can merge together with the next few lines of code. When you merge data, you need a unique id. Here, I’ll assume that you have an email address in both data frames that serves as a unique id.

data_all = pd.merge(data, data_sql, on="email")

Sweet. You’re almost done.


So, let’s quickly recap. First, you installed all the dependencies. Then, you opened a jupyter notebook by typing jupyter notebookin the terminal. Third, you wrote code in the notebook to (i) import pandas and sqlalchemy (ii) read a local .csv (iii) get data from SQL (iv) merge the data together (v) saved the merged data to a .csv file on your desktop.

Now, you’re just getting started.

One clap, two clap, three clap, forty?

By clapping more or less, you can signal to us which stories really stand out.