Technology Adoption at Castlight

Robert Stewart
apree health (Castlight) Engineering
7 min readDec 13, 2021
Technology Adoption and Migration at Caltrans (photo by Jitze Couperus)

I was recently rereading a very interesting blog post on technology adoption at Slack and was struck by the number of similar patterns I’ve observed at Castlight. Some of the core ideas I took away from that post are that tech adoption frequently proceeds in the three phases of exploration, expansion, and migration and that by understanding the characteristics of each phase you can greatly increase the rate of successful adoption. Not every proposal is expected to succeed, of course, but too often good ideas fail due to the way they are rolled out.

Angular

Castlight’s original web application came to life in 2010 as a monolithic Ruby on Rails app. We’ve since evolved it from Rails to Angular, initially through a standalone experiment that convinced us to begin a full migration.

Using the terminology from the Slack blog post, we entered Phase 1 in 2013 when we chose Angular JS for a new front end to support a few health plans. The requirements for the new front end were different enough from how we had built our Rails app that it would have been very hard to weave the new UI into the existing codebase. Another factor in our ultimate decision was that we had already needed to move our search and pricing backend from Rails to Java for performance reasons. We initially had a lot of success in moving business logic into Java but were struggling with the impedance mismatch between Rails and Java services. Exploring Angular JS also allowed us an opportunity to experiment with a new front-end technology. Although that business adventure didn’t work out, we learned a lot about Angular and confirmed that it was a promising technology for us.

Phase 2 began a few years later when we needed to build another app for a health plan. That effort was much more successful, partly due to the maturing of Angular JS. Soon after, we started to migrate a few pages from our main Rails app to Angular JS. Our developers immediately saw the benefit and our use of Angular expanded quickly, especially after Angular 2 arrived to replace Angular JS.

We’re now nearing the final stage of Phase 3 with only a few administrative pages in the web app remaining to be migrated and we’re already starting to experiment with Phase 1 of the next major front-end technology change.

Kafka

A software architect on our team was an early fan of Apache Kafka and was in search of a good use case that would justify the effort to introduce this relatively complex technology into our stack. He saw the promise but knew it wouldn’t take hold unless there was also clear value.

In Phase 1 we started a POC to switch an analytics event service from HornetQ to Kafka. It was initially a side project for a few other engineers with minimal infrastructure experience, so it struggled to gain traction. After being picked up by our Infrastructure team, a new implementation on Kafka was completed very quickly. We made a smooth transition in production in early 2018 and it has worked like a charm ever since. This use case was great for the exploration phase since it needed to handle very high throughput but required only a simple schema and a single Kafka topic while also paving the path for the removal of HornetQ from our tech stack.

At around the same time, that same engineer was leading a complete rebuild of our ecosystem partner integrations in a far more reliable and scalable way than the original implementation. Since our partner integrations often involve a lot of asynchronous data processing, this was a perfect opportunity to utilize Kafka at scale on a more complex implementation. The migration to Kafka also enabled far greater fault tolerance and traceability, which is mission-critical when you are supporting integrations with over one hundred vendors. That effort has also been a massive success.

In phase 2 we expanded our usage of Kafka to wellbeing challenges and to our next generation segmentation framework, which Jordan Bragg recently blogged about. This segmentation framework makes sophisticated use of Kafka Connect and Kafka Streams. The first requirement for this framework was to provide an abstraction over an earlier framework that was conceptually good but extremely difficult to maintain and scale. With the new framework in place, we were able to quickly phase out the old system. This effort forced us to greatly deepen our knowledge of Kafka and made us quickly realize the importance of automation. We rolled out Cruise Control for ops automation and Prometheus and Burrow for monitoring, which made it much easier to continue to scale out our usage of Kafka.

The clear indication that we’ve been in Phase 3 for the last year is that it is a non-event when I learn that yet another team has successfully rolled out a feature using Kafka for data stream processing.

Standard Microservices Framework and Kubernetes

As mentioned above, we began extracting functionality from a monolithic Rails app into Java services in early 2011. At that time, Services Oriented Architecture (SOA) was the dominant approach to building services. The approach that is now widely known as microservices was just barely beginning to emerge. In addition, many of the open-source libraries for key parts of a microservice, e.g., MVC, metrics collection, etc., were relatively immature, so we built implementations to meet our needs. Over the following years, the microservice architecture proved itself to be a clear improvement and the relevant open-source libraries also matured significantly. We were unfortunately a few years ahead of these trends.

In early 2017, while revisiting our services architecture we were also looking to move our service deployments to Kubernetes. In our previous approach, we relied on vertical scaling as much as horizontal scaling. That is, we ran many services on multiple beefy servers with many cores, taking advantage of the extremely powerful concurrency capabilities of the JVM and JDK to scale. While this worked quite well, it is not a good fit for scaling on Kubernetes. Since refactoring some of the services for k8s would require significant changes, we decided to combine these efforts, starting with building a few new services using what we called the Standard Model for deployment on k8s.

Castlight’s Standard Model is our official set of standard libraries, frameworks, and techniques for developing and managing microservices. In hindsight, I think we struck just the right level of “opinionated” in our choices. Our Standard Model specifies core foundational components like Spring Boot, Spring MVC, and Spring Cloud, as well as technologies or approved approaches for things like build tooling, request tracing, health checks, cache configuration, project structure, debugging, error handling, security, database schema migrations, code coverage, and more. Our team plans to cover the Standard Model in much more detail in a future blog post.

Although most services were migrated by the end of 2020, that left a long tail of older services that would require more work. By that time, the benefits of on-demand deploys and easier scalability made the final migration effort overwhelmingly worth taking on. While we’ve not done full Standard Model rewrites of the oldest, largest services, we’ve significantly refactored them to better take advantage of distributed caching, thus better fitting the Kubernetes model with smaller services that are faster to start up. To quote the Slack article, “Even very successful projects might not migrate every last use case to the new way of doing things.”

Explore Versus Exploit

When evolving the infrastructure of large systems, the balance between explore versus exploit is frequently changing in different directions in different parts of the infrastructure. While this can sometimes be stomach-churning for application teams focused on releasing features on committed deadlines, it is very healthy overall for an organization. An over-emphasis on exploration can result in counterproductive technological diversity and a brittle infrastructure. An over-emphasis on exploitation can result in a stagnant infrastructure that becomes a major drag on velocity and forces other teams to create shadow infrastructure. At Castlight, allowing different systems to experiment more aggressively than others so they can pave the way for innovation is how we’ve walked this fine line and made sweeping infrastructure improvements reliably over time.

The optimal rate of technology adoption for an organization will vary based on many factors specific to that organization. However, there are some common approaches that I think most teams would benefit from considering. Hopefully, these examples from our experience are a useful complement to the info in the original article.

--

--