Automating the Boring Stuff with Python, Pandas, and SQL
Want to stop doing the same thing over and over again in Excel? You’ve heard about automating aspects of data analysis with Python, and want to learn more? Well, here is a tutorial the data science team at Raise.me put together for other folks who have Macs and spend a lot of their time working on spreadsheets. All you’ll need to do is open your terminal (open the finder and search for
terminal) and paste a few commands. But, make sure not to skip anything or else stuff won’t work. Good luck!
In order to use Python, Pandas, and SQL, you’ll need to install some stuff.
First, let’s turn “developer mode” on. Copy the next line into your terminal.
Next, let’s install brew. Copy the next line into you terminal.
ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"
If you’re wondering, where that line of code came from. Click here. Otherwise, let’s move on and install Java and Python. Copy each line into your terminal and run separately.
brew install Caskroom/cask/eclipse-java
brew install python
pip install pandas
pip install jupyter
pip install sqlalchemy
pip install psycopg2
Nice work. You’ve got the dependencies installed. So, let’s automate some data analysis. Copy the next line into you terminal.
You should should see a page pop up in your browser at the address http://localhost:8888/tree. Now, let’s create a new python notebook. On the top right hand side, you’ll find a dropdown menu where you can create a new python notebook. Once you do that, you should have a new tab that looks like this
In that first cell, copy and paste the following lines.
import pandas as pd
from sqlalchemy import create_engine, text
Now, let’s get some data from your computer. First, click the + sign to add a new row. In the second row, copy and paste the following lines. Make sure you have a .csv file in your Desktop and replace “YourFileName” with whatever your actual file name is.
filename = "~/Desktop/YourFileName.csv"
data = pd.read_csv(filename)
Now, let’s see what the columns of the data are. Add a row and copy the code.
Let’s see what the first few rows look like
Here’s what the jupyter notebook should look like
Now, click the first row. Then, click the play button to run each row. You should see data! Next, we want to get some data from a SQL database. Add a row and copy the following code. For this to work, you’ll need to replace the missing username/password/url/db credentials with your database credentials.
redshift_username = ""
redshift_password = ""
redshift_url = ""
redshift_port = "5439"
redshift_db = ""
engine = create_engine(redshift_creds)
data_sql = pd.read_sql_query(text("""
SELECT * FROM students LIMIT 10
After you run that row, you will now have two data frames: (1)
data_sql,which we can merge together with the next few lines of code. When you merge data, you need a unique id. Here, I’ll assume that you have an email address in both data frames that serves as a unique id.
data_all = pd.merge(data, data_sql, on="email")
Sweet. You’re almost done.
So, let’s quickly recap. First, you installed all the dependencies. Then, you opened a jupyter notebook by typing
jupyter notebookin the terminal. Third, you wrote code in the notebook to (i) import pandas and sqlalchemy (ii) read a local .csv (iii) get data from SQL (iv) merge the data together (v) saved the merged data to a .csv file on your desktop.
Now, you’re just getting started.