Why would you do this instead of using EMR? Well, great question. Under certain circumstances, using EC2 might be cheaper than using EMR, but otherwise, EMR may be advisable. In any case, here’s how to run a Spark application from an EC2 instance:
I used a Deep Learning AMI (Ubuntu 16.04) Version 25.3 with a p3 instance, for accelerated computing.
ssh -i pem_key.pem ubuntu@public_dns_key
You type into your EC2 terminal:
openjdk version “1.8.0_222”OpenJDK Runtime Environment (build 1.8.0_222–8u222-b10–1ubuntu1~16.04.1-b10)OpenJDK 64-Bit Server VM (build 25.222-b10, mixed mode)
Java 8 is what we want for Spark to run…
So, you’ve been given an account, and access keys. Now what?
This blog will go over how to configure your AWS profile on both Windows and Mac machines so that you can successfully pull files from S3 to your local in a Jupyter notebook.
Python — pip, in particular
pip install awscli
Then Update your environment variables:
To find where
awscli is installed, type the following into your command line:
As much as I’d love to only be working on data with 100% of my time, sometimes I’ll find myself needing to solve different kinds of problems. Here I document the resources I used to fix a mistake I made: Putting a space in my new PC’s User Name.
I made a mistake within the first five seconds of turning my new Windows 10 on: I put a space in my username!
This screen doesn’t suggest you’re making any huge commitments, but in fact, you are. The User name you select here will be what your Windows uses as an…
I don’t know about you but I love diving into my data as efficiently as possible. Pulling different file formats from S3 is something I have to look up each time, so here I show how I load data from pickle files stored in S3 to my local Jupyter Notebook.
Jupyter Notebook is an incredible tool for learning and troubleshooting code. Here is a blog to show how to take advantage of this powerful tool as you learn Spark!
Spark is helpful if you’re doing anything computationally intense which can be parallelized. Check out this Quora question for more information.
This blog will be about setting the infrastructure up to use Spark via AWS Elastic Map Reduce (AWS EMR) and Jupyter Notebook.
You’ll see the following at the top of the screen:
This blog will walk through the creation of an AWS RDS instance using the AWS console. Reasons why one would be interested in storing and maintaining their database using AWS include:
If you’re new to working with big data, a popular application for small to medium data storage you may likely be familiar with is MS Access…
This blog will go into detail on extracting information from Word Documents locally. Since many companies and roles are inseparable from the Microsoft Office Suite, this is a useful blog for anyone faced with data transferred through .doc or .docx formats.
As a prerequisite, you will need Python installed on your computer. For those of you doing this at work, you likely do not have admin rights. This blog explains how to install Anaconda on a Windows machine without admin rights.
You can find the Notebook supporting this blog here.
We’ll be taking advantage of each word document’s XML make-up…
Thomas Edison State University (TESU) of Trenton, NJ is continuing its tradition of taking the next steps towards democratizing education.
“Identified by Forbes magazine as one of the top 20 colleges and universities in the nation in the use of technology to create learning opportunities for adults, Thomas Edison State University is a national leader in the assessment of adult learning and a pioneer in the use of educational technologies. The New York Times has stated that Thomas Edison State University is ‘the college that paved the way for flexibility.’”
My brilliant colleague Nicole Eickhoff and I were going to Spain together, and we wanted to meet data scientists there. Since we’d be there for Fallas, a month-long celebration of community and tradition, we collected Valenbisi data around for the week leading up to Crida Fallas. Crida Fallas is the first day of activities to kick off a month long of events. We wanted to see how Fallas affected bike share activity. Here I detail the project we presented and discussed with the Valencia Big Data Meetup on Tuesday, March 13, 2018.
In this blog I briefly describe my process…
With a powerful model like Bayesian Networks, it can be helpful to start with a simple example. After all, Bayesian Networks can be used to explain incredibly complex relationships and have been used to understand supply chain disruptions, diagnose diseases, and even predict terrorism. These examples are complex and require expertise to fully understand, so Shark Attacks serve as a great learning example, as information about them is (relatively) easy to access and something we all can agree is interesting.
Shark attacks are similar to many problems we encounter in our data science careers.