App Engine ‘Big Data’ Tutorial

I originally wrote this in 2013 for a CS4HS workshop, and this is just a record for future generations.

Learn how to work with 1000’s of rows inside App Engine and datastore.

Step 1: Introduction

Step 1 complete

Application skeleton

Create a folder somewhere, calling it bigdata or bigdata-project — or anything you like. We’ll create the files below.

app.yaml

Every App Engine app requires a configuration file named app.yaml. For most Python applications, it typically follows the same format.

application: your-app-id
version: 1
runtime: python27
api_version: 1
threadsafe: yes
handlers:
- url: /.*
script: bigdata.app

The above configuration says that every request (/.* matches everything) should be routed to app inside bigdata.py.

For now, your local application ID doesn’t matter — but if you want to put your app online later, you’ll need to register your own on appspot.com.

bigdata.py

This is the absolute bare minimum for an App Engine app written in Python. For requests to /, return by writing “Hello, World!”.

Each RequestHandler represents a single endpoint — a URL that end-users can hit to load some resource or perform some action (e.g., load some HTML, perform an API request, etc). Create the bigdata.py file-

import webapp2
class MainPage(webapp2.RequestHandler):
def get(self):
self.response.write('Hello, World!')
app = webapp2.WSGIApplication([
('/', MainPage),
], debug=True)

Trying it out

Load the App Engine launcher and press the button labeled with “+”. Select the folder containing the above two files, and press Create. Once that’s done, hit Run and then Browse. If this is your first application, it will probably be served on a webserver, running on your own machine, at http://localhost:8080.

Using GoogleAppEngineLauncher to run your app

Congratulations! You’ve created your first app. If something went wrong, there’ll be information under Logs. Try changing the message you display, or if you’re feeling eager, you could open and display a file’s contents (but more on that later).


Step 2: Import Data

Step 2 complete

Upload a CSV

To work with big data, you need a way to upload and store big data — uploading through CSV and storing in datastore. CSV is a pretty common format, and Python has a built in library for dealing with its nuances.

model.py

We’re going to add a model class, in a new file, called Place. This is analagous to a table in traditional SQL-like databases, but in Python, is just a class with several defined properties.

Also, build a function that for each row in CSV file, create a Place entity and ‘puts’ it into the datastore. This method is simple now, but we’ll revisit it later.

from google.appengine.ext import db
import csv
class Place(db.Model):
name = db.StringProperty(indexed=True)
lga = db.StringProperty(indexed=True)

def create_from_file(f):
count = 0
for row in csv.reader(f):
if not row[0]:
continue
# Insert each row of the CSV into a "Place" entity.
place = Place(key_name=row[0], name=row[1], lga=row[8])
place.put()
count += 1
return count

bigdata.py

Update the original script to add a new RequestHandler that presents an upload form, and passes through the result to the create_from_file() method.

This uses the StringIO object to pretend that the data being uploaded from a user is a file, even though it hasn’t been written to disk. Place this new code below the other import …statements.

import model
from StringIO import StringIO
class ImportPage(webapp2.RequestHandler):
def get(self):
# Quickly write an upload form.
self.response.write('''
<form method="POST" enctype="multipart/form-data">
<input name="file" type="file" />
<input type="submit" />
</form>''')
  def post(self):
f = StringIO(self.request.get('file'))
count = model.create_from_file(f)
self.response.write('%d rows OK' % (count,))

As there’s a new RequestHandler involved, you need to update the WSGIApplication at the bottom of the page.

app = webapp2.WSGIApplication([
('/', MainPage),
('/import', ImportPage),
], debug=True)

You should be able to load up http://localhost:8080/import in your browser (add it to the existing URL) and see a file upload form.

Download the sample CSV data here. This contains several files from data.gov.au — speifically, data from the Geographical Names Register of NSW.

Inside the sample data, there’s a small_gnr.csv file. This contains 10,000 rows of data. Select this file and press ‘submit’ — it might take a while, but should eventually say that a number of rows are uploaded — great!


Step 3: Display data

Step 3 complete

API endpoint

One of the interesting parts of working with ‘big data’ is that you can at most return 1000 rows at a time, via a single query, with datastore. However, it’s possible to store e.g. millions of rows.

bigdata.py

We’re going to update our request handler to take AJAX requests. These requests will return data encoded in JSON. First, add the library to the top of the file with the other imports.

import json

And then, farther down, replace the entire MainPage with the following code. This allows requests made with a query to return search results prefixed by your query.

class MainPage(webapp2.RequestHandler):
def get(self):
query = self.request.get('query').upper()
if query:
q = model.Place.all()
q.filter('lga >=', query).filter('lga <', query + u'\ufffd')
output = []
for place in q.run():
output.append({'name': place.name, 'lga': place.lga})
output.sort(key=lambda f: f['name'])
self.response.write(json.dumps(output))
return
    self.response.write(open('index.html').read())

Prefix queries are ‘cheap’ in App Engine, and are done via a bit of a trick. They closely match a query such as “Prefix%” in SQL. This is done by-

  • Filtering for values greater than the query
  • Filtering for values less than the query plus the largest possible Unicode value, “\ufffd”.

Once you’ve made these changes and saved the file, you should be able to load up e.g., http://localhost:8080/?query=wol to return all places that are in a LGA prefixed with ‘wol’.

Something interesting to try is to now load a terminal and request the contents of the API endpoint via a command-line call. e.g.-

curl http://localhost:8080/?query=gos
Using curl to find information

index.html

You might have noticed that the above change now opens and serves a file named index.html if there was no query. It’s time to create this script with a bit of HTML and JS — almost out of scope of this workshop — that loads the page via AJAX.

<input id="query" placeholder="LGA search" />
<dl id="display">
</dl>
<script>
document.getElementById('query').addEventListener('keyup', function() {
var list = document.getElementById('display');
list.innerHTML = '';
this.req && this.req.abort();
  if (this.value) {
this.req = new XMLHttpRequest();
this.req.open('GET', '/?query=' + this.value);
this.req.onload = function() {
var places = JSON.parse(this.responseText);
places.forEach(function(place) {
var html = '<dt>' + place.name + '</dt><dd>' + place.lga + '</dd>';
list.innerHTML += html;
});
};
this.req.send();
}
});
</script>

Now that we serve this file via the MainPage, you should be able to refresh your site at http://localhost:8080/ and see a short HTML form. Typing keys will trigger the keyup event, which sends off an AJAX request to your new API endpoint.

With any luck, you’ll see data about the names of places you enter — almost as you type. If you have any problems, as before, please check Logs.


Step 4: Extension ideas

  • Geographic search
  • Show data on a map
  • mapreduce framework
  • group by place?
  • serve to device