2017 was an important year for DevOps as the number of ecosystem players grew substantially and CNCF projects tripled. Looking to the year ahead, we expect innovation and market changes to accelerate even further. Below we’ve complied our thoughts on 2018 microservices trends: service meshes, event-driven architectures, container-native security, GraphQL, and chaos engineering.
We’ll be watching these trends and the companies that build businesses around them in the coming year. What trends are you seeing? Comment below to let us know what we’re missing or if you agree/disagree with the ones we’ve outlined here.
1. Service meshes are hot!
Service meshes, a dedicated infrastructure layer to improve service-to-service communication, are currently the most buzzed about cloud-native category. As containers become more prevalent, service topologies have become increasingly dynamic requiring improved network functionality. Service meshes can help manage traffic through service discovery, routing, load balancing, health checking, and observability. Service meshes attempt to tame unruly container complexity.
It’s clear service meshes are growing in popularity as load balancers like HAProxy, traefik, and NGINX have started repositioning themselves as data planes. We haven’t seen widespread deployment yet, but we do know of businesses running service meshes in production. Moreover, service meshes are not exclusive to microservices or Kubernetes environments and can be applied to VM and serverless environments as well. For example, the National Center for Biotechnology Information isn’t running containers, but it is using Linkerd.
Service meshes could also be leveraged for chaos engineering, “the discipline of experimenting on a distributed system in order to build confidence in the system’s capability to withstand turbulent conditions.” Instead of installing a daemon that runs on every host, a service mesh could inject latency and failures into the environment.
Istio and Buoyant’s Linkerd are the most high-profile offerings in the space. Please note Buoyant released Conduit v0.1, an open source service mesh for Kubernetes, last December.
2. Rise of event-driven architectures.
As the need for business agility increases we’ve started seeing a movement towards a “push” or event-based architecture, in which one service sends an event and one or more observer containers that were watching for that event respond by running logic asynchronously without the event producer being aware. Unlike request-response architectures, in event-driven systems the functional process and transaction load in the initiating container does not depend on the availability and completion of remote processes in downstream containers. An added advantage of this is that developers can be more independent when designing their respective services.
While developers can architect container environments to be event-driven, Function-as-a-Service (FaaS) inherently embodies this quality. In FaaS architectures a function is stored as text in a database and is triggered by an event. Once the function is called an API controller receives the message and sends it through a load balancer to the message bus, which queues it up to be scheduled and provisioned to an invoker container. After the execution, the result is stored in a database, the user is sent the result, and the function is decommissioned until triggered again.
Benefits of FaaS include 1) shortened time from writing code to running a service because there is no artifact to create or push beyond the source code and 2) decreased overhead as functions are managed and scaled by FaaS platforms like AWS Lambda. However, FaaS is not without its challenges. Since FaaS requires the decoupling of each piece of a service there can be a proliferation of functions which can be hard to discover, manage, orchestrate, and monitor. Finally, without comprehensive visibility including dependencies, it is difficult to debug FaaS systems and infinite loops could emerge.
Currently, FaaS is a poor fit for processes that require longer invocations, a large amount of data loaded into memory, and consistent performance. While developers are using FaaS for background jobs and temporal events, we believe the use cases will expand over time as the storage layer accelerates and platforms become more performant.
In Fall 2017, the Cloud Native Computing Foundation (CNCF) surveyed over 550 people, of which 31% use serverless technologies and 28% planned to use serverless in the next 18 months. The survey followed up by asking which specific serverless platform is being used. Of the 169 that use serverless technology, 77% said they used AWS Lambda. While Lambda may be leading serverless platforms, we believe there could be interesting opportunities at the edge. Edge compute will be especially powerful for the IoT and AR/VR use cases.
3. Security needs are changing.
Applications packaged in containers are fundamentally more secure by default because of kernel access. In VM environments the only point of visibility would be the virtual device driver. Now moving to a container environment, the OS has syscalls and semantic meaning. It’s a much richer signal. Previously operators could have achieved some of this signal by dropping an agent into a VM, but it was complex and a lot to manage. Containers offer cleaner visibility and integration in a container environment is trivial in comparison to a VM environment.
With this in mind, a 451 Research survey reported that security was the biggest hurdle to container adoption — challenges persist! Initially, vulnerabilities were the main security concern in container environments. As the number of ready-to-use container images in public registries multiplied it became important to make sure that they were vulnerability free as well. Overtime, image scanning and authentication have become a commodity.
Unlike virtualized environments where a hypervisor serves as a point of access and control, any container with access to the kernel root ultimately has access to all containers on the kernel. In turn, organizations must secure how containers interact with the host, and which containers may perform certain actions or system calls. Hardening the host to ensure that cgroups and namespaces are appropriately configured is also important for maintaining security.
Finally, traditional firewalls rely on IP address rules to allow network flows. This technique isn’t extensible to container environments because dynamic orchestrators reuse IPs. Runtime threat detection and response is crucial for production environments and achieved by fingerprinting the container environment and building a detailed picture for a behavioral baseline so it is easy to detect anomalous behavior and sandbox the attacker. A 451 Research report noted that 52% of companies surveyed are running containers in production, suggesting an acceleration of businesses adopting container-native runtime threat detection solutions.
4. Moving to GraphQL from REST.
Created by Facebook in 2012 and open sourced in 2015, GraphQL is an API specification that is a query language and a runtime for fulfilling queries. The GraphQL type systems allows developers to define data schemas. New fields can be added, and fields can be aged without affecting the existing queries or restructuring the client application. GraphQL is powerful because isn’t tied to a specific database or storage engine.
The GraphQL server operates as a single HTTP endpoint that expresses the full set of capabilities of the service. By defining relationships between resources in terms of types and fields (not endpoints like REST), GraphQL can follow references between properties so services can receive data from multiple resources using a single query. Alternatively, REST APIs require loading multiple URLs for a single request, increasing the network hops, slowing down the query. With fewer roundtrips, GraphQL decreases the amount of resources required for each data request. The data returned is typically formatted as JSON.
There are additional benefits of using GraphQL over REST. First, clients and servers are decoupled so they can be maintained separately. Unlike REST, GraphQL uses a similar language to communicate between clients and servers so debugging is easier. The shape of the query fully matches the shape of the data being fetched from the server, making GraphQL highly efficient and effective compared to other languages such as SQL or Gremlin. Queries reflect the shape of their response so deviations can be detected and fields that aren’t resolving correctly can be identified. Because queries are simpler there is more stability in the entire process. The specification is best known to support external APIs, but we find that it’s being utilized for internal APIs as well.
GraphQL users include Amplitude, Credit Karma, KLM, NY Times, Twitch, Yelp, etc. In November, Amazon validated the popularity of GraphQL by launching AWS AppSync which included GraphQL support. It will be interesting to watch how GraphQL evolves in the context of gRPC and alternatives like Twitch’s Twirp RPC framework.
5. Chaos engineering becomes more well-known.
Popularized initially by Netflix, and later practiced by Amazon, Google, Microsoft, and Facebook, chaos engineering experiments on a system to improve certainty in its ability withstand production issues. Chaos engineering evolved over the past ten years. It started with Chaos Monkeys, which turned off services in production environments, and expanded its scale with Failure Injection Testing (FIT) and Chaos Kong for larger environments.
Superficially it appears chaos engineering is just about injecting turmoil. While breaking systems can be fun, it may not always be productive or provide useful information. Chaos engineering embodies a broader scope of not just injecting failures but also other symptoms like traffic spikes, unusual request combinations, etc. to discover existing issues. Beyond verifying assumptions, it should also surface new properties of the system. By unearthing system weaknesses teams can help improve resiliency and prevent poor customer experiences.
Newer technologies like neural networks and deep learning are so complex that determining how something works may become less important than prove that it works. Chaos engineering helps with this challenge by testing the system holistically to identify instability. It’s likely to become an even more accepted practice as engineers work to make their increasingly convoluted systems more robust.
As chaos engineering becomes more mainstream it could take the form of existing open source projects, commercial offerings, or, as mentioned above, implemented through a service mesh.