Transitioning to a cloud-centric architecture

An in-depth look at how an IBM product development team quickly moved to microservices written in Node.js

Published in

DevOps For the Cloud

15 min readJan 20, 2017

By introducing an entirely new approach to solving problems, cloud-centric architectures are the latest disruptor in the technology industry. Businesses are now able to take solutions to market faster — and at cloud scale — with a smaller investment. Startups and born-on-the-cloud companies were the early adopters of cloud-native development, but the movement to cloud-centric architectures is gaining popularity in the enterprise as well.

Moving to a cloud-centric architecture typically requires decisions in five main areas: data transport, data format, security, APIs, and elasticity. This article describes our real-life journey, technology decisions, and lessons learned as we moved to a cloud-centric architecture. You’ll see how we transformed a traditional Java™ Platform, Enterprise Edition application to a lightweight set of microservices written in Node.js.

Our starting point and business goal

Our journey started with an IBM product called Presence Zones. It had been on the market for a few years, and we wanted to simplify its delivery and scale it out for broader customer adoption. Changing its architecture to deliver it as a cloud-based service on the IBM Bluemix® cloud platform enabled us to provide value to the customer and offer new capabilities incrementally. The offering was renamed Presence Insights.

Our business goal for the Presence Insights service was to provide the equivalent of web analytics for the physical world. For example, imagine that your favorite coffee shop wants to improve your experience. As you (and your mobile device) enter the store, the store knows who you are (you’ve opted in), asks whether you want the same thing you ordered in the past, and offers you a free Wi-Fi code because you recently spent 3 hours there. This awareness requires an Internet of Things architecture, where the system processes events emitted from sensors such as Wi-Fi and beacons to track device movements within a physical location.

Initially, our target audience for Presence Insights was retailers — since they were the early adopters trying to understand and engage customers in their physical spaces — but many other industries and solutions are now addressing the four pillars of sensing, analyzing, acting, and engaging as consumers move about a physical space.

“We transitioned from prototype to incubator to generally available service on Bluemix in less than 6 months.”

Technology choices for our microservices architecture

Our general approach was to start by building out a set of simple Node.js microservices. We leveraged the ecosystem of existing Node.js packages that interact over a lightweight messaging protocol called AMQP (Advanced Message Queuing Protocol) using MQ Light. This gave us loose coupling between the services, as well as an easy way to iterate on each service.

As we continued to build out the platform, we embraced newer technologies such as NoSQL databases (Cloudant) and Lucene Search Engine (Elastic Search) for our analytics, as well as a novel approach to in-memory caching (Redis) using key-value stores to act as a shock absorber to Cloudant.

Table 1: Comparing Presence Zones and Presence Insights

All of our technology choices were proven cloud-scalable architectures used by successful born-on-the-cloud companies. The ultimate result of these choices proved successful for us as well. We transitioned from prototype to incubator to generally available service on Bluemix in less than 6 months.

Figure 1 shows the data flow for sensor events entering our system and interacting with the rest of the system. Presence Insights is accessible from a runtime perspective via the sensors as well as from a dashboard perspective where administrators can configure it to set up a site, configure rules, or view analytical metrics around sensor event traffic for specific time periods.

Figure 1. Presence Insights technical cloud architecture

For sensor events, the main entry point is at the connector layer. Each connector we support has a separate format described by the sensor vendor (Wi-Fi or beacon). Connectors are intended to “fire and forget” — their main purpose is to receive the payloads from the sensors, validate the payloads, and quickly publish the events to a topic on our message bus (MQ Light). There are multiple “subscribers” for each topic, and each subscriber has a well-defined purpose (according to the actor-based model, such as the Akka open-source toolkit, where they do one thing and one thing well). Purposes include hydrating events that enter the system, storing the events in our analytics repository hosted inside Elastic Search on Compose, and firing webhook events for systems that have subscribed to specific actions (for example, a user is dwelling in a particular location for a set amount of time).

Intermixed in these flows are our in-memory cache based on Redis and our persistence tier leveraging Cloudant. In this model, we defined the system so that if any part of the system is unhealthy, the system continues to operate: There are minimal dependencies or single points of failure as the data flows through the system.

Similarly, for our management dashboard, a set of microservices was tasked around the UI dashboard. As our management capabilities grew, we began splitting out the services based on function, such as a microservice for analytics, site setup, and rules processing. This enabled us to quickly iterate on each element of the system and independently deliver updates of each service as needed.

The microservices architecture allowed for a flexible deployment of each service. In our journey, we deployed our runtimes in a variety of ways starting from Node.js buildpacks, VM images, and Docker images deployed natively in Bluemix as well as in SoftLayer data centers. For each deployment, we replicated the topologies in regions located in Dallas, London, and Sydney. This is really one of the major benefits for building a cloud-centric architecture: ripping and replacing components of the architecture while the remainder of the system remains neatly intact.

How industry trends drive the evolution to cloud computing

A pattern has emerged around disruptors like cloud computing. Each disruptor is a natural evolution from the prior disruption.

Figure 2. Evolution of IT industry disruptors

When the shift away from client-server towards the web began, the industry witnessed a significant transition towards delivering web-based applications that run in the browser; the content delivered became much more rich. As a result, we needed to define standards around how we access data especially as it relates to the systems of record and mainframe systems that have historically maintained this data in the enterprise.

SOA then became the prescribed manner for describing APIs and providing well-defined endpoints and data definitions (in the Web Service Definition Language, or WSDL) for web services APIs. As Web 2.0 gained momentum, browsers spoke JSON much more naturally than XML, and defining services using nouns and verbs produced an even simpler model than web services.

This is where REST came to bear. Right around this time, mobile computing emerged, and the shift back towards client-server began, but in a much different form. The mobile device was starting to gain prominence as the application platform of choice. By using REST, we could benefit from investments in prior disruptions.

With the movement to cloud, we see the same thing occurring. The Internet of Things is now further accelerating the number of devices in play. We need a model that allows us to quickly scale up and down based on the demand for our services. Using virtual environments in the cloud becomes a much more palatable model for running systems at scale.

Rationale behind our decisions

Let’s now dig into what it means to use a cloud-centric architecture and how we achieved its five main aspects of data transport, data format, security, APIs, and elasticity.

The decision of which network protocols to support is often open to debate. In our investigation, our choice was driven by application requirements, such as the types of clients that interact with our service for inbound data. To support the widest spectrum of sensor vendors, we chose the lowest common denominator: HTTP(s). While MQTT, Websockets, and other technology protocols were available, most of the existing integrations we wanted to start with were HTTP-based and would provide the farthest reach.

For data format, we chose JSON as the default format. JSON has grown from the initial JavaScript Object Notation into a very powerful multi-use format used for a variety of interactions ranging from the browser to mobile to machine-to-machine communication. We felt that JSON was the easiest to prescribe, had the widest support by the sensor vendors, and had the strongest set of libraries and tools to support encoding and decoding payloads in a highly scalable and high-performing manner. JSON is also the default format native to Node.js (our chosen technology for implementing the services).

Security and APIs are very much tied together. For our APIs to be easily consumable, we needed a low barrier to entry and so we grouped our APIs into two categories while adhering to the principles defined by REST. The first category was around machine-to-machine APIs where there is no human interaction (sensors sending data to Presence Insights). In this model, we embraced basic authentication for these APIs. For the APIs that are more likely to involve human interaction (management APIs), we chose OAUTH for the security model and used OAUTH to drive our permissions model for the APIs. We decided against proprietary security models and really focused on open standards-based approaches to our APIs to ensure that they are consumable, maintainable, and secure.

The final element is elasticity. Because we focused on building a highly scalable set of APIs with a strong focus on multi-tenancy, we wanted to be able to quickly scale up and down on demand to meet the needs of our customers. To achieve this, we made sure our APIs were metered with a set of API metrics around the number of requests per instance and response times. By using these metrics to scale up and out as needed and by following REST principles around eliminating state management in our APIs, we were able to quickly scale on demand.

Patterns and best practices

Once we defined how we wanted to develop our set of services, we defined a set of patterns that were key to our success with Presence Insights. We first considered how to support sensor events at scale. We wanted a non-blocking solution for receiving events, which resulted in our choice of event-based programming using Node.js and messaging. We focused on a “fire-and-forget” eventing system where we leveraged our messaging background on MQ Light to drive events to a series of topics. By eliminating the blocking nature that exists in languages like Java, we were able to quickly benefit from our Node.js runtimes without having to deal with concepts such as threading and waiting for threads to complete. The callback nature of Node.js was perfect for our architecture.

We also chose MQ Light to communicate across each of our services in a publish/subscribe model vs. HTTP. We chose this model as a simpler and faster way to scale up service processing in a loosely coupled manner.

When the question of persistence came up, we knew we wanted to support a variety of data formats and be able to quickly introduce new data formats on the fly. By moving to a NoSQL approach, we were able to quickly consume large amounts of unstructured data at scale while also maintaining a consistency model to support things such as querying for analytics. While there was definitely a need for structured data, we were able to define schemas for those use cases and leveraged Node.js validation scripts to ensure consistency in our data formats and to update documents as needed to migrate to new data models.

The final piece to the puzzle for delivering a pattern is how to build a production-level service at scale. This involved dealing with failure, dependencies, and operations. For any distributed system, we have to assume some piece is going to fail from the outset, either by human or machine failure. Expecting failure forced us to develop a much more resilient service.

We also limited dependencies. Introducing vast amounts of common code with large amounts of dependencies is a recipe for disaster. By embracing small composable services that have limited dependencies (but still used common core utilities), we had deeper insights into the running code and were able to monitor things such as memory and CPU usage much more closely.

Memory and CPU consumption can limit the scale of your service. You should ensure that your services are stateless and thus can be elastically scaled. This is probably the most important element of building out a cloud-centric architecture and the most common failure seen in the industry as application developers have relied upon statefulness since the beginning of web programming.

Open source adoption

When we formed our development team, we wanted to build an environment where developers enjoyed working. This meant embracing open source technologies from a runtime perspective — and also from a development, build, and deployment perspective. Our end-to-end solution was driven by a combination well-established technologies like JavaScript and newer tools favored by the broader IBM community like GIT, Travis, and Jenkins.

To maximize our team resources, we embraced the “Platform as a Service” model. We leveraged as much of the Bluemix platform as possible so we could focus more energy on the architecture and implementation, and less on infrastructure management. As a result, we built a great platform using core foundational services such as Node.js (as a build pack), MQ Light (transitioning to Kafka), Cloudant (as a NoSQL store), Elastic Search, and Redis (through the Compose.io acquisition). This focus on providing value to end users gave us a better understanding of what we needed out of those underlying systems (that wasn’t already there), and helped us decide to either drive requirements down or do the hosting ourselves.

We were often asked why we chose Node.js. (Our choice pre-dated IBM’s 2015 acquisition of StrongLoop.) The answer was actually quite simple. We wanted to write code in a language that our developers loved and that had a rich selection of existing libraries that let us integrate open source technologies with ease. In fact, our use of Node.js across our entire development process is viewed as the #1 driver in our ability to deliver our solution to market in less than 6 months.

3 key lessons learned

We believe three decisions are vital to building scalable architectures specifically around Node.js and determining the appropriate data strategy.

When we first started developing in Node.js, we wanted to leverage a set of common node modules across all of our microservices to provide consistency to our development teams. These modules ranged from security authentication and authorization to logging modules and everything in between, such as database and caching wrappers to abstract away some of the database interactions and improve debugging. While these packages were valuable to our development team, they were not candidates for a public NPM, so we needed a solution targeting private repositories. As we continued to amass a large number of modules, we needed a way to build, deploy, and manage various versions of these libraries.

Giving back to the community
We’ve started to introduce new features to Sinopia, such as a CouchDB persistence tier to replace the default file system-based support that did not lend itself to deployment in a clustered cloud environment where the file systems are deemed transient. At the time of writing this article, we are working to submit our updates to Sinopia via a pull request to the community for further adoption.

We found that an open source solution called Sinopia (see the sidebar) met our needs perfectly. It’s essentially a private Node Package Manager that can be deployed in the cloud. By hosting our own Sinopia server, we significantly reduced our deployment times to Bluemix, improved our versioning support of our node modules, and deployed our service to the cloud in a highly secure and scalable manner. Learn more about deploying your own private Node Package Manager to Bluemix.

Scoping services

One of the most difficult aspects of building a microservices architecture is determining the appropriate scope for a given service. In our definition of a microservice, each service does one thing and one thing well. But when we started to build out our management APIs, we found that having all of the management APIs in one microservice was not maintainable.

So we needed to decide how to break up the service. Because we built our APIs to be multi-tenant by design, the same API can be accessed for a tenant with a large configuration as well as for a tenant with a small configuration. To test how to scale for both of these types of tenants, we had to evaluate what was optimal for our runtime. Does it make sense to deploy a large number of Node.js applications with a small amount of memory configured for each app or does it make sense to deploy a smaller number of Node.js applications with a larger amount of RAM? We had similar experiments around processing inbound sensor events where systems can be configured to send many small payloads at a rapid rate or batch the payloads and send events on a longer interval.

We found that no single solution addresses all use cases. The best way to validate an approach is to capture metrics on the running system and use replays of those events with various scaling policies to see how the data is processed throughout the system with the lowest latency.

Choosing and testing data solutions

As mentioned, our business goal was to provide actionable insights based on sensor event data and to engage users in near real time via third-party integrations such as push notifications.

We found that not all data solutions are the same. It’s essential to leverage the right tool for the job. For example, some solutions optimize for queries, while others optimize for large numbers of writes. Let’s take a look at some of the data solutions we employed.

Cloudant

Tracks users (entry/exit/dwell) via Change Feed Listener
Has scalability issues with many Node.js instances of long-lived connections to Cloudant

MQ Light

Tracks users (entry/exit/dwell) via MQ topics
Lacks the ability to be notified when a topic key has expired

Redis

Tracks users (entry/exit/dwell) via publish/subscribe
Warlock (node-redis-warlock) is required for distributed locking (not native in Redis)

We started with Cloudant since we were already persisting our sensor events to Cloudant and wanted to exploit our scalable data store for tracking the movements of devices that the sensors were detecting. We found that leveraging the Change Feed Listener to act as a trigger to be notified when a user moves around may not scale well in a microservices architecture with many instances connecting to Cloudant.

We then moved to MQ Light and used topics to track device movements. The key missing element here was the ability to expire a key if we did not see a device for a period of time.

We then added Redis to our infrastructure to solve this particular need and then expanded the usage of Redis for caching as well as Real Time Eventing.

We learned that these sorts of experiments are not only inevitable but also extremely healthy and a sign of a growing cloud-centric architecture. In fact, having concrete data, as opposed to opinion and conjecture, helped us make architectural changes with more confidence.

Similar experiments around our data analytics led us to a mix of Cloudant, Elastic Search, and Spark for providing actionable insights for our product. As our main business focus was to provide actionable insights, we needed to be able to dissect and view data from multiple vantage points. These data sets needed to be highly available and distributed and have well-defined TTLs (Time To Live) and well-defined models for summarizing the data as it transitions towards historical data.

To reiterate, there is no single data solution that will solve all use cases in the cloud. For best results, consider a mix of technologies such as Cloudant, Redis, MongoDB, Cassandra, Spark, and Elastic Search.

Takeaways from our journey to microservices

We hope that you’ve found value in the lessons we’ve shared, and in the reasoning behind our technology decisions. If you follow these principles, you will likely be headed down the right path.

Develop a point of view on cloud-centric architectures. This is a moving target, and there are many ways to achieve success, so be sure to choose the right tool for the job!
Document the steps of your journey so others can benefit from the lessons you learned and the choices you made.
Adopt best-of-breed technologies. Open source solutions are great accelerators.
Create a vibrant technical team. Attracting and retaining top technical talent is priority #1. Be the team that other developers want to join!
Share lessons learned with the wider community through articles (like this one!), presentations, conferences, and social media.

Conclusion

Developers will continue improving (and disrupting) existing technologies. And that effort will continuously produce new tools, techniques, and open source solutions that can help you build and evolve your own cloud-centric architecture.