Programming Exercises in the Cloud

8 min readJun 18, 2015

Even in computer science, where job prospects are good, students
feel pressure to learn about leading-edge developments. Everyone
is now a customer of cloud services, so it makes sense to look under
the hood, see what are the tools, languages and infrastructure behind
these services. I recently finished teaching a course on “big data infrastructure’’ in which students not only learned a few things about large distributed systems, they were also required to write programs using Map-Reduce and Spark, which they’ll gladly add to their resumés.

I’m sharing below some experiences dealing with Amazon’s Web Services, specifically with Elastic Map Reduce (AWS EMR), and with Apache Spark,
which Amazon recently announced as officially supported in their EMR service. Some observations go beyond teaching to the wider community of cloud computing users — so even if you’re not teaching such a course, please read on. Overall, it looks like the current environment, for educators, is problematic; that said, software stacks and services are evolving rapidly and many of the issues described here might be resolved even by the time you read this.

Lessons Learned

Planning is Cloudy

Using Amazon, it’s easy to launch an EC2 instance, but what kind of instance? For just playing around with commands, use the free-tier t2.micro (maybe soon there will be an equivalent option to launch a container). For learning about Hadoop and friends, we move on to launching an Elastic Map Reduce cluster, which has many more parameters: options about security, the type of software to install, how many instances to put into the cluster.

What’s the right answer for how large to make your EMR cluster if you want to have your program finish within an hour or so? If you have, say, a terabyte of data, how much memory, how many cores, and how much persistent storage should you provision for your needs? There doesn’t seem to be a good answer to this question. In the true spirit of elasticity, it shouldn’t matter: start with a minimal cluster and grow it as you need. In practice, that is also time-consuming, as you might need to restart failed jobs, install software, reconfigure or take other steps to integrate machines added to your cluster. Once you pick a machine type (EC2 type) for your cluster, that can’t be changed, and all the worker machines in the cluster have to be the same type of machine. Certainly these things are beyond beginner-level students.

The typical answer to provisioning seems to be overprovisioning, since AWS is inexpensive, right? Students (and instructors) with limited budgets likely don’t see things that way. Something to keep in mind, next time you see that research paper claiming that a large dataset was mined for the result spending only ten dollars using AWS/EMR is the neglected cost of testing, tuning, debugging, and optimally provisioning their cluster.

Watch Your Wallet

Amazon has a very effective policy for controlling resource allocation: it’s the financial/credit system. To get an AWS account, you’ll need a credit card. After sign-up, it’s up to you to control your resources wisely. Most students take this seriously, and are even fearful of launching instances and putting up data into S3 buckets. Amazon doesn’t show you charges accruing in real time, there is a delay before billing is visible, which can lead to a bit of anxiety. The real problem is the lack of a quotas and controls with fine granularity. Currently one can’t launch a compute instance with a strict dollar-limit: there is no automatic “at most $3.25” way to run things.

Of course this kind of worry about small charges and control is not relevant to significant enterprises using AWS, so it’s unlikely to be on Amazon’s radar of upcoming features. It does add to cognitive load for students. Typically, after launching an EMR cluster, a student might install software, load data into the Hadoop File System (HDFS),
try a job, and then take a break. Amazon continues to charge for the cluster even while the student leaves the machines idle. Just terminating the cluster before taking a break could lose some work, so one is advised to save data in S3 buckets. There currently doesn’t seem to be a simple, automated way to save and restore; therefore clusters can often be idle.

Out of Tune

Optimizing performance is and will continue to be a tough problem in the world of big data and cloud computing. For clusters, one can choose between more small instances or fewer large instances, at roughly the same cost; and there are options on types of processors, memory sizes, storage technology, and physical location (regions). The choices for software and configuration parameters within the software are bewildering. I didn’t have time to compare between Amazon, Microsoft, Cloudera, Horton, Joyent, etc.

Apache Hadoop has many hidden tuning parameters which can be specified (if you have expert knowledge) on the EMR dashboard when launching a cluster. Though we didn’t try playing with the software stack, one can imagine that more possibilities will open up when customers have options to mix-and-match among different file systems, different orchestration systems that manage parallel computing libraries, even options on query engines and languages higher up in the stack, assuming these all play nicely together.

For the past semester’s teaching, a challenge for the students was to tune performance of programs written using Apache Spark. Many things affect performance in Spark, generally controlled on the command line, through API calls, and how the program organizes and processes data distributed across cores. Students either tended not to play around with
all the parameters or found the choices confusing, with little documentation explaining how much various features of Spark would benefit from this or that tuning option. Several students did report (and complain about) long running times, and they did try a few different ways to partition data, cache results, increase parallelism, or even try different EMR clusters to complete their projects. All in all, it’s quite a mess to figure out whether or not a given amount of data can be handled in a reasonable amount of time using Spark and EMR.

Full Stack Favoritism

Students looking to enter into ``big data’’ studies fall far short of the 10,000 hours one might need to know about command-line tricks,system administration, network and file utilities, let alone solid experience with common programming environments for Java, Scala, and Python. For those fortunate enough to approximate such background knowledge,
exploring AWS EC2 and EMR is far less daunting. During a typical semester, a student would expect to have around a hundred hours outside of class to work on projects. That’s just not enough time to learn enough to overcome configuration and software issues as well as do something interesting using Map-Reduce (Hadoop) or Spark frameworks.

The Sharing Economy

I initially required students to get their own Amazon accounts and install MRJob, a somewhat crufty Python package that automates cluster creation, job running and cluster shutdown, for running Map-Reduce jobs. After that exercise, I showed students how to run a somewhat more ambitious job using Java on Hadoop. Learning Map-Reduce is valuable because it exposes issues of parallel computing, while being a simple paradigm that students can experiment with and understand.

Already within the first few weeks, there was enough pushback on using Amazon to make me reconsider my plan. Quite a few students installed Hadoop on their own laptops, picking different distributions, versions and documentation: then answering questions and helping students with problems became unmanageable. Therefore I fell back on that old workhorse of teaching computer science, the computer center — this
time in the form of Amazon EMR clusters with Spark installed.

The setup I arrived at, after trying a few alternatives, consists of launching an EMR cluster, installing some software not found in stock bootstrap scripts, and then running scripts to create accounts (with ssh keys) for each student. Thereafter, students used ssh clients to log in directly to the cluster. Large input datasets for projects were put into S3 buckets, which are easy to read by Hadoop or Spark programs. This worked rather well, though the provided distributions of Hadoop and Spark aren’t set up for multi-user environments, so extra tweaking was needed. The setup did make life much simpler for students, who could concentrate on their projects. Unfortunate side-effects of this design include the many hours of idle cluster resource when no student was present, and resource competition when one student’s job would fail, say, due to insufficient memory because of other students. I doubt this strategy would scale to a hundred students.

Some Conclusions

I’m assuming that most readers are familiar with common wisdom among data-applications cognoscenti, you don’t have big data. See http://yourdatafitsinram.com to confirm this. I’d read similar bits of wisdom a few years back. A few students in the course independently discovered that the total time taken to prepare and run programs on AWS EMR was likely more than directly handling the data on their own desk machine or laptop. Yet “big data” is early enough in the hype cycle to merit teaching students something about the topic. My course projects limited the data size to under ten gigabytes, which was reasonable given student effort and computing constraints.

My impression from working on Amazon is that it’s better oriented to production, enterprise-scale use-cases than to experiments and single-case studies. In fact, this kind of thinking is known by the colorfully named the pets and cattle argument. The situation for teaching or creative experiments lies on the pet side, whereas industrial, repeatable cattle-like production workloads justify the high cost of tuning, full-stack expertise, and knowing enough to avoid stupid mistakes we amateurs make.

Some of similar observations to my experiences and mistakes are documented by experts, but these people typically don’t teach courses on big data infrastructure.

One thing about Spark I came to appreciate is how well it is marketed. Lots of big players have thrown money into Spark, and the press on Spark has momentum. There are doubters out there, with solid observations critiquing Spark, though it is a moving target. I found many things about Spark to be disappointing. First, some library APIs were accessible only through Scala, which is fine, but the Scala compiler and an example setup on how how to install Scala (effectively needing Maven installed as well) are time-consuming. Second, the transforms and operations that Spark provides initially seem to be complete, but working on problems one comes to realize that quite a few things are missing. It shouldn’t be the case that Hadoop outperforms Spark on some problems, however the lack of one-to-many and many-to-one mapping, which is a hallmark of Map-Reduce, just can’t easily be circumvented. Presumably, new versions of Spark will address these shortcomings, albeit leaving users with a confusing spectrum of tuning options and programming choices. Compared to the suite of functional operations found in, for example the APL programming language, Spark isn’t so impressive. I’m not sure why this should be. Spark’s core design is centered on immutable data structures (RDDs), though I don’t know of a mathematical proof this design is optimal.

Years ago, a colleague working in automated theorem proving complained that one technique was judged to be better than another because its implementation was superior. In other words, sometimes a bad idea can trump a better idea because of how well the bad idea is executed. We are at a similar stage in big data and cloud computing. People don’t evaluate the architectures and algorithms on their own merits, they look at the bottom line in practice. Even then, the fairness of such evaluations is skewed by selective benchmarks, so we don’t really know what will be the enduring paradigms.