Serverless Superheroes: Ben Kehoe and the Roomba are vacuuming up servers

Forrest Brazeal
A Cloud Guru
Published in
8 min readOct 9, 2017

Welcome to “Serverless Superheroes”!
In this space, I chat with the toolmakers, innovators, and developers who are navigating the brave new world of “serverless” cloud applications.

For today’s edition, I chatted with Ben Kehoe, an AWS Community Hero who is doing some amazing things with serverless at iRobot, maker of the Roomba. The following interview has been edited and condensed for clarity.

Forrest Brazeal: I don’t usually think of “serverless” and “robots” in the same sentence. How are you bringing the two together at iRobot?

Ben Kehoe: While finishing my PhD at UC Berkeley in 2014, I was looking for ways to turn robotics algorithms into web services. So that involved some familiar-sounding questions — how do we take people who don’t really know about the cloud, containerize their application as a service and then run it for them on demand? I didn’t get very far with this idea before I graduated, and I was looking at it from a robotics-only perspective at the time.

But then AWS Lambda came out, and I was like — “oh, this sounds exactly like what I was looking for.”

iRobot had just launched our first cloud-connected Roomba with a full-solution cloud provider, but we knew we wanted to switch to an AWS solution built on top of their IoT service.

Now, we’re not a company with a long history of elastic cloud development. We’d built networked robots, we’d built single-tenant cloud robot applications. But we hadn’t done the sort of scalable public cloud infrastructure projects that would give us knowledge of how to build and scale the kind of application we needed.

So facing the fact that we’d have to scale up teams very quickly, we looked at Lambda and all these other managed services — DynamoDB, Kinesis, Redshift, Cognito, SQS, API Gateway, KMS, about 25 AWS services in all — and we said: “We can stitch these together into the solution we need.”

We figured that running the app was going to be easier with serverless — not no work, but easier — and building would be easier too. Because once you get up to speed on the services, you’re not bogged down in keeping things running for yourself.

The drawback, especially at the time, was that there wasn’t tooling available that fit our needs. Back then there was JAWS, which is now the Serverless framework, and it just didn’t do enough. It was very focused on client-side coordination of resources, and we thought that being cloud-side, and heavy users of CloudFormation, it was better for us to write our own tooling.

All that was a significant amount of work, but it got us to the point where we now have an application that can scale to millions of connected robots and runs with a single-digit number of full time operations people. Which would never have been possible otherwise.

It sounds like you also had the advantage of not having a lot of technical debt in the cloud to start with.

Correct. This was totally green field development. Made it much easier.

So would you recommend other people to follow the path you did, particularly if they have investment in older cloud technologies?

I think there’s kind of two factors to that decision. If you’re building native serverless applications, it makes sense to design an event-driven architecture. In fact, I think event-driven design is so natural for serverless architecture that people tend to conflate the two.

But if you’re coming from a traditional architecture, you need to be able to port what you have over to serverless and have it start saving you money right away. And then, down the road, you can start remixing the underlying design to take advantages of the event-driven nature of FaaS systems, making your application simpler and more robust.

At the same time, we’re not at feature parity for synchronous, HTTP-driven web services in serverless. There’s a lot of missing pieces around deployment and management that aren’t quite there yet. And until we fix those, it’ll really hurt adoption.

Can you elaborate on some of those pain points?

Service discovery is a major missing piece. In the world of servers, distributed systems management tools like Consul and etcd are low latency, highly available, and allow changing lookup tables over time. Those tools go a long ways toward helping you deploy new parts of your infrastructure and stitch it together. You can also use VPCs and subnets to isolate different parts of your system.

There’s no notion of that in serverless — all your Lambda functions are in the same namespace in the same account. There aren’t best practices on how to use naming conventions, how to do phased rollouts. You can use aliases in Lambda or stages in API Gateway, and when you say “update” AWS can drain your connections out of the old deployment and switch over to the new one — but it’s all one big lever.

If the new deployment doesn’t work properly, you won’t know until all your traffic has switched over. So the idea of switching 10% of your clients over to a new system, and in a sticky fashion, so those 10% stay the same 10%, is not something we’re able to do in serverless yet.

So how are you getting around the service discovery limitations of serverless at iRobot?

We have a service discovery process by which a robot goes to a well-known URL and hands in some context data about itself. And then the service responds with information like “here’s the AWS IoT endpoint you need to talk to”. And that allows us to control from the same mechanisms the AWS region that the robots go to, and then different deployments within regions.

That’s also how we do phased rollout for deployments. We stand up multiple copies of our application, and then we use the service discovery mechanism to route the clients between those two deployments within a single region.

And one of the nice things about serverless is, it doesn’t cost you anything to have those multiple copies of your application! So it’s okay to just stamp out copies and leave them sitting in your environment.

The only downside is that it can become really hard to keep track of all those deployments!

We do everything through CloudFormation, so the stacks make each deployment traceable and auditable. We also send all our function logs to SumoLogic, which allows us to instrument various aspects of our operations.

You mentioned earlier that you’re excited about event-driven architecture. Why do you think this is a design paradigm that is particularly well suited to serverless?

Since I come from a background of Python, Java, and C++, the event-driven mindset is more of a shift for me. Whereas for web developers who live in Javascript, the event-driven model is very natural.

Events are nice because you can trace the flow of information and then make it manifest without figuring out to coordinate things using only synchronous requests.

To use the common example of image thumbnailing — with events, my image uploader doesn’t need to know about thumbnailing. It doesn’t need to put that event somewhere so it can be picked up by somebody else. I can directly instrument onto that datastore. So the events arise out of the system itself rather than needing to be injected somewhere.

I find some of the work around event sourcing — like the work that Nordstrom is doing, where everything has to be an event — to be really interesting. It all sounds very good, but I have to admit that in a legacy system, where all the components don’t speak events, it’s not clear to me how all that stitches together in a seamless way.

There’s a school of thought out there among event-driven developers saying that a database — particularly a relational database — is just a giant piece of mutable shared state, and we should be getting away from sharing any data globally. What’s your perspective on that?

If you looks at these event-sourced systems, they have databases all over the place. Everything has a local instantiation of the event stream. And so does it make sense to do it that way or have one instance of the event stream?

That’s where Google Cloud Spanner, CosmosDB, FaunaDB are all going — they want to solve the problem of having mutable shared state at scale. The biggest question when I see these global databases, is — how does data governance work? If I need German data to stay in Germany, how do I make that work in a so-called global DB? What about China?

But the big advantage of event streams is less architectural and more organizational. You don’t need DBAs who control the global schema. Of course you need events that are well-documented, but then any given team that’s working with the data can transform the events and store the events in ways that are useful to them. If they do it poorly, that only affects them, not other teams.

And that brings in the traditional serverless benefit of speeding up development and innovation time, too.

So looking toward the future of serverless — one thing that really fascinates me is the future of containers. As I’ve personally experienced, once you start doing more interesting things with serverless, it gets harder to manage all the packages and other dependencies that go into these little functions, to the point where using a container might actually be easier. In fact, Azure has already announced a “bring your own container” model for FaaS. What are your thoughts on this trend?

I work at a company staffed primarily by robot software engineers, who write C++ code that runs on very specific versions of Linux. Compiling that code against Amazon Linux to run in Lambda has been a real struggle. A “bring your own container” model would help us get around that by having an environment similar to what’s running on a robot.

So containers are always important. The question becomes — how lightweight can you be in what you’re bringing? If you only need to bring code, you’ll be better off, because then what’s running inside the function is managed by the provider, so you don’t have to worry about package vulnerabilities and so forth.

Docker has a great phrase for this — “batteries included, but swappable”. And I feel like serverless is the same. If I just have Python code or Javascript code, I should be able to use that directly, and have the container managed for me. But if I want to go a step further and say “I have other stuff I want in this environment that’s hard to get into your container; let me bring my own,” that should be possible too.

It’s like reserved instances in EC2. On Demand is great, but if I know that I’m going to need the instance for a long time, I can tell AWS so they can get me better pricing. We’ll hopefully see that model brought into FaaS so that we can say to the cloud providers — “Make it cheaper for me. Let me help you with your planning problem, because I know exactly what I need.”

Ben Kehoe is regular speaker at the A Cloud Guru Serverlessconf. Check out his talk from Serverlessconf Austin here, and and ServerlessConf NYC here.

--

--

Forrest Brazeal
A Cloud Guru

AWS Serverless Hero. Cloud architect @Trek10inc, words and cartoons @acloudguru. Previously cloud infrastructure @Infor. Opinions here are mine.