Why would you do this instead of using EMR? Well, great question. Under certain circumstances, using EC2 might be cheaper than using EMR, but otherwise, EMR may be advisable. In any case, here’s how to run a Spark application from an EC2 instance:

Photo by Steve Richey on Unsplash

I used a Deep Learning AMI (Ubuntu 16.04) Version 25.3 with a p3 instance, for accelerated computing.

SSH into your EC2 instance.

ssh -i pem_key.pem ubuntu@public_dns_key

You type into your EC2 terminal:

java -version

it returns:

openjdk version “1.8.0_222”OpenJDK Runtime Environment (build 1.8.0_222–8u222-b10–1ubuntu1~16.04.1-b10)OpenJDK 64-Bit Server VM (build 25.222-b10, mixed mode)

Java 8 is what we want for Spark to run…

So, you’ve been given an account, and access keys. Now what?

This blog will go over how to configure your AWS profile on both Windows and Mac machines so that you can successfully pull files from S3 to your local in a Jupyter notebook.


Python — pip, in particular

Step One: Install awscli


pip install awscli

Then Update your environment variables:


To find where awscli is installed, type the following into your command line:


As much as I’d love to only be working on data with 100% of my time, sometimes I’ll find myself needing to solve different kinds of problems. Here I document the resources I used to fix a mistake I made: Putting a space in my new PC’s User Name.

I made a mistake within the first five seconds of turning my new Windows 10 on: I put a space in my username!

What was I thinking, putting a SPACE in my PC’s User name?

This screen doesn’t suggest you’re making any huge commitments, but in fact, you are. The User name you select here will be what your Windows uses as an…

I don’t know about you but I love diving into my data as efficiently as possible. Pulling different file formats from S3 is something I have to look up each time, so here I show how I load data from pickle files stored in S3 to my local Jupyter Notebook.

This has got to be the ugliest picture I’ve ever used for one of my blogs. Thx Google Search and Print Screen!
  • Never hard code your credentials! And if you do, make sure to never upload that code to a repository, especially Github. There are web crawlers looking for accidentally uploaded keys and your AWS account WILL be compromised. Instead, use boto3.Session().get_credentials()
  • In older versions of python (before Python 3), you will…

Jupyter Notebook is an incredible tool for learning and troubleshooting code. Here is a blog to show how to take advantage of this powerful tool as you learn Spark!

Spark is helpful if you’re doing anything computationally intense which can be parallelized. Check out this Quora question for more information.

This blog will be about setting the infrastructure up to use Spark via AWS Elastic Map Reduce (AWS EMR) and Jupyter Notebook.

You’ll see the following at the top of the screen:

Image created using MS Word and google search “AWS RDS logo”

This blog will walk through the creation of an AWS RDS instance using the AWS console. Reasons why one would be interested in storing and maintaining their database using AWS include:

  • limited storage space on your local machine
  • storage or software limitations to what you can store on your work’s server
  • desire for increased security
  • desire for more efficient data pipeline capabilities
  • automation of database maintenance such as software updates, patches, and disaster recovery

If you’re new to working with big data, a popular application for small to medium data storage you may likely be familiar with is MS Access…

This blog will go into detail on extracting information from Word Documents locally. Since many companies and roles are inseparable from the Microsoft Office Suite, this is a useful blog for anyone faced with data transferred through .doc or .docx formats.

As a prerequisite, you will need Python installed on your computer. For those of you doing this at work, you likely do not have admin rights. This blog explains how to install Anaconda on a Windows machine without admin rights.

You can find the Notebook supporting this blog here.

Image created with Microsoft Word and google searches “Microsoft Word Logo” and “Python Logo”

We’ll be taking advantage of each word document’s XML make-up…

Thomas Edison State University (TESU) of Trenton, NJ is continuing its tradition of taking the next steps towards democratizing education.

“Identified by Forbes magazine as one of the top 20 colleges and universities in the nation in the use of technology to create learning opportunities for adults, Thomas Edison State University is a national leader in the assessment of adult learning and a pioneer in the use of educational technologies. The New York Times has stated that Thomas Edison State University is ‘the college that paved the way for flexibility.’”


TESU is a member of a network of Competency-Based…

My brilliant colleague Nicole Eickhoff and I were going to Spain together, and we wanted to meet data scientists there. Since we’d be there for Fallas, a month-long celebration of community and tradition, we collected Valenbisi data around for the week leading up to Crida Fallas. Crida Fallas is the first day of activities to kick off a month long of events. We wanted to see how Fallas affected bike share activity. Here I detail the project we presented and discussed with the Valencia Big Data Meetup on Tuesday, March 13, 2018.

As part of the celebration of Fallas, each neighborhood in Valencia creates fantastic street light displays.

In this blog I briefly describe my process…

With a powerful model like Bayesian Networks, it can be helpful to start with a simple example. After all, Bayesian Networks can be used to explain incredibly complex relationships and have been used to understand supply chain disruptions, diagnose diseases, and even predict terrorism. These examples are complex and require expertise to fully understand, so Shark Attacks serve as a great learning example, as information about them is (relatively) easy to access and something we all can agree is interesting.

Shark attacks are similar to many problems we encounter in our data science careers.

  1. They are infrequent
  2. The data we…

Natalie Olivo

