How to build fast and robust REST API with Scala

--

“There is more than one way to skin a cat.”

This is a popular saying and albeit the mental picture can be disturbing, it is a universal truth, particularly for computer science.

What follow is thus a way to build REST API in Scala and not the way to build them.

For all practical purposes, let’s pretend we are building a couple of APIs for a Reddit like application where users can access their profile and submit updates. To build on the Reddit metaphor, imagine we are (re)implementing api/v1/me and api/submit

Some ground work

In a nutshell:

  1. Scala is an object oriented programming language based on lambda calculus that runs on a Java Virtual Machine and integrates seamlessly with Java.
  2. AKKA is a library built atop of Scala that provides actors (multi-threaded safe objects) and more.
  3. Spray.io is an HTTP library built atop AKKA that provides a simple, flexible HTTP protocol implementation so you can roll your own cloud service.

The challenge

REST API are expected to provide:

  1. fast, secure call level authentication and permission control;
  2. fast business logic computation and I/O;
  3. all of the above under high concurrency;
  4. did I mention fast?

Step 1, authentication and permission

Authentication should be implemented in OAUTH or OAUTH 2 or some flavor of private/public key authentication.

The benefit of an OAUTH2 approach is that you get a session token (that you can use to look up the corresponding user account and session) and a signature token, more on that in a moment.

We will continue here under the assumption that this is what we use.

The signature token is normally an encrypted token obtained by signing the entire payload of the request with a shared secret key using SHA1. The signature token thus kills two birds with one stone:

  1. it tells you whether the caller knows the right shared secret;
  2. it prevents data injection and man in the middle attacks;

There are a couple of price to pay for the above: first you have to pull the data out of your I/O layer and second you have to compute a relatively expensive encryption (i.e. SHA1) before you can compare the signature token from the caller and the one the server builds, which is considered the correct one since the back end knows all (almost).

To help with I/O, one can add a cache (Memcache? Redis?) and remove the need for an expensive trip to the persisted stack (Mongo? Postgres?).

AKKA and Spray.io are very effective in addressing the above. Spray.io encapsulates the steps needed to extract HTTP header information and payload. AKKA actors enable asynchronous tasks to be performed independently from the API parsing. This combination reduces the load on the request handler and it can be bench-marked so that most API have a processing time below 100ms. Note: I said processing time not response time, I am not including network latency.

Note: using AKKA’s actors, it is possible to fire two concurrent processes, one for the permission/authentication and one for the business logic. One would then register for their call backs and merge the results. This parallelizes the API implementation at the call level taking the optimistic approach that authentication will succeed. This approach requires minimal data repetition in that the client has to send everything the business logic needs like user id and anything you would normally extract from the session. In my experience, the gain of this approach yield around 10% reduction in execution time and it is expensive both at design time and run time as it uses more CPU and more memory. However there might be scenarios where the relatively small gain ties to the bottom line of processing millions of calls per minute thus scaling up the savings/benefits. In most cases however I would not recommend it.

Once the session token is resolved to a user, one can cache the user profile which includes the permission levels and simply compare these against the permission level required to perform the API call.

To get the permission level of an API, one parses the URI and extracts the REST resource and identifier (if applicable) and one uses the HTTP header to extract the type.

Say for example you want to allow registered users to get their profile via an HTTP GET

/api/v1/me

then this is what a permission configuration document would look like in such a system:

{
“v1/me”: [{
“admin”: [“get”, “put”, “post”, “delete”]
}, {
“registered”: [“get”, “put”, “post”, “delete”]
}, {
“read_only”: [“get”]
}, {
“blocked”: []
}],
“submit”: [{
“admin”: [ “put”, “post”, “delete”]
}, {
“registered”: [“post”, “delete”]
}, {
“read_only”: []
}, {
“blocked”: []
}]
}

The reader should note that this is a necessary but not sufficient condition for permission of a data access. So far we have established that the calling client is authorized to make the call and that the user has permission to access the API. However in many cases we also need to ensure that user A cannot see (or edit) user’s B data. So let’s extend the notation with “get_owner” meaning that the authenticated users has the permission to execute a GET only if it owns the resource. Let’s see how the config would look like then:

{
“v1/me”: [{
“admin”: [“get”, “put”, “post”, “delete”]
}, {
“registered”: [“get_owner”, “put”, “post”, “delete”]
}, {
“read_only”: [“get_owner”]
}, {
“blocked”: []
}],
“submit”: [{
“admin”: [ “put”, “post”, “delete”]
}, {
“registered”: [ "put_owner", “post”, “delete”]
}, {
“read_only”: []
}, {
“blocked”: []
}]
}

Now a registered user can access his/her own profile, read it, modify it but no one else can (other than an admin). Similarly only the owner can update a submission with:

/api/submit/<identifier>

The power of this approach is that dramatic changes to what users can and cannot do with the data can be accomplished by simply changing the permission configuration, no code changes are required. Thus, during the life cycle of the product the back end can match changes in requirements on a moment notice.

The enforcement can be encapsulated in a couple of functions that can be agnostic of the business logic of the API and just implement and enforce authentication and permission:

def validateSessionToken(sessionToken:String) UserProfile = {
...
}
def checkPermission(
method:String,
resource:String,
user:UserProfile
) {
...
// throws an exception on failure
}

These would be called at the beginning of the Spray.io handling of the API calls:

// NOTE: profileReader and sumbissionWriter are omitted here, assume they are extending an AKKA actor.def route =
{
pathPrefix("api"){
// extract the headers and HTTP information ...
var user:UserProfile = null
try {
validatedSessionToken(sessionToken)
} catch (e:Exception) {
complete(completeWithError(e.getMessage))
}
try {
checkPermission(method, resource, user)
} catch (e:Exception) {
complete(completeWithError(e.getMessage))
}
pathPrefix("v1"){
path("me"){
get {
complete(profileReader ? getUserProfile(user.id))
}
} } ~
path("submit"){
post {
entity(as[String]) { => jsonstr
val payload = read[SubmitPayload](jsonstr)
complete(submissionWriter ? sumbit(payload)) }
}
}
...
}

As we can see, this approach keeps the Spray.io handler readable and easy to maintain as it separates authentication/permission from the individual business logic of each API. Data ownership enforcement, not shown here, can be achieved passing a Boolean to the I/O layer that would then enforce user data ownership at the persistence level.

Step 2, business logic

The business logic can be encapsulated in I/O actors like the submissionWriter mentioned in the snippet of code above. This actor would implement an asynchronous I/O operation that performs writes first to a cache layer, say for example Elasticsearch, and second to a DB of choice. The DB writes can be further decoupled into a fire and forget logic that would use log based recovery so that the client does not have to wait for these expensive operations to complete.

Note this is an optimistic non-locking approach and the only way for the client to be certain that the data was written would be to follow up with a read. Until such time, a mobile client should operate under the assumption that the respective cached data is dirty.

This is a very powerful design paradigm, however the reader should be forewarned that with AKKA + Spary.io you cannot go more than three level deep in the actor call stack. For example if these are the Actors in the system:

  1. S for the Spray router.
  2. A for the API handler.
  3. B for the I/O handler.

using x?y notation to mean that x calls y requesting a callback and x!y to mean that x fire and forgets y, the following works:

S ? A ! B

However these do not:

S ! A ! B

S ? A ! B ! B

In these two cases all instances of B are destroyed as soon as A completes so effectively you only have once chance to pack all your off loaded computation to a fire and forget actor. I believe this is a limitation of Spray and not AKKA and it might have been addressed by the time this post is published.

Lastly, I/O and persistence

As shown above, we can push slow write operations into asynchronous threads to keep API POST/PUT performance within acceptable execution time. These usually range in the tens of seconds or low one hundred of milliseconds depending on server profile and how much logic can be deferred using the fire and forget approach.

However it is often the case that reads outnumber writes by one or more order of magnitude. A good caching approach thus is critical to deliver high throughput overall.

Note: the opposite is true for IOT landscapes where sensory data writes coming from nodes will outnumber the read by several order of magnitude. In this case the landscape can be configured to have a group of servers configured to only perform writes from IOT devices, dedicating another group of servers with different specs to API calls from clients (front end). Most if not all the code-base could be shared between these two classes of servers and features could simply be turned off via configuration to prevent security vulnerabilities.

A popular approach is to use a memory cache like Redis. Redis performs well when used to store user permission for authentication, that is, data that does not change often. A single Redis node can store up to 250 mi pairs.

For reads that need to query the cache we need a different solution. Elasticsearch, an in memory index, works exceptionally well for either geographical data or for data that can be partitioned in types. For example an index named submissions with types dogs and motorcycles can be easily queried to get the latest submission (subreddits?) for certain topics.

For example, using Elasticsearch’s http API notation:

curl -XPOST 'localhost:9200/submissions/dogs/_search?pretty' -d '
{
"query": {
"filtered": {
"query": { "match_all": {} },
"filter": {
"range": {
"created": {
"gte": 1464913588000
}
}
}
}
}
}'

would return all documents after the specified date in the /dogs. Similarly we could look for all posts in /submissions/motorcycles whose documents contains the work “Ducati”.

curl -XPOST 'localhost:9200/submission/motorcycles/_search?pretty' -d '
{
"query": { "match": { "text": "Ducati" } }
}'

Elasticsearch performs very well for reads when the index is carefully designed and created prior to data being entered. This might discourage some since one of the benefits of Elasticsearch is the ability to create an index simply by posting a document and to let the engine figure out types and data structures. However the benefits of defining the structure outweighs the costs and it should be noted that migrating to a new index is straightforward even in production environments when using aliases.

Note: Elasticsearch indexes are implemented as balanced trees thus inserts and deletes operation can be expensive when the tree gets large. Inserting into an index with tens of millions of documents can take up to tens of seconds, depending on the server spec. This can make your Elasticsearch writes one of the slowest running processes in your cloud (aside from DB writes, of course). However pushing the write into a fire and forget AKKA actor can ameliorate if not resolve the problem.

Conclusions

Scala+AKKA+Spray.io are a very effective technology stack for building high performance REST API when married with in memory caching and/or in memory indexing.

I worked on an implementation not too far from the concepts described here where 2000 hits per minute per node barely moved the CPU load above 1%.

Bonus round: machine learning and more

Adding Elasticsearch to the stack opens the door for both in line and off line machine learning as Elasticsearch integrates with Apache Spark. The same persistence layer used to serve the API can be re-used by machine learning modules, reducing coding, maintenance costs and stack complexity. Lastly, Scala allows us to use any Scala or Java library opening the door to more sophisticated data processing, leveraging such thing as Stanford’s Core NLP, OpenCV, Spark Mlib and more.

Links to technologies mentioned in this post

  1. http://www.scala-lang.org
  2. http://spray.io
  3. and for (2) to make sense, have a look at http://akka.io

--

--