Recently I had the chance to sit down with some of the other engineers at Praekelt.org to discuss their use of the Elixir programming language. I wanted to get a sense of how they were using Elixir to solve problems, so as to move beyond the theoretical advantages of the language and into the practical use-cases.
Historically, the engineering teams at Praekelt.org have used Python, and I have already written about some nice features of Elixir compared to Python.
As we move to higher-throughput messaging systems such as WhatsApp, greater scale, and more complex, distributed infrastructure, we’re starting to explore new technologies — such as Elixir — to solve the problems we have.
WhatsApp for Social Impact
WhatsApp for Social Impact is a new platform from Praekelt.org for engaging with users on WhatsApp. Currently in private Beta, it provides a higher-level API for WhatsApp, integrates with technology such as RapidPro, and provides tools such as Natural Language Processing (NLP) for helping to interact with WhatsApp users.
Our initial work with WhatsApp used RapidPro (more will be explained about RapidPro later). The issue with this was that, at the time, different conversations were interleaved, so if multiple people were demoing our WhatsApp integration it became difficult to tell who was saying what. So the first problem was just to create a user interface (UI) to display different conversations in different views.
His first iteration of this system was to still use Django but only for the backend; by using Django REST framework to build a RESTful API. While he got this system up-and-running quickly, he also quickly ran into performance problems — there was a lot of overhead for the interactive UI because it needed to perform many REST API calls in order to maintain its state.
It quickly became apparent that the REST API wasn’t going to work so well — specifically for a real-time messaging application — because it revolved around polling the backend for state updates.
His next approach was to try using Django Channels for WebSocket support. “It just felt like a less than ideal solution”, said Simon. He described how Django Channels 1.0 mixed some of Django’s synchronous logic with some of Twisted’s asynchronous logic. “It didn’t feel like a coherent stack.” The WebSocket support did allow him to stream messages in real-time without polling a REST API, but he didn’t go further with Django Channels. “It just didn’t feel native…it just felt disjointed.”
Finally, Simon decided to try Elixir. “I’ve always liked Erlang, but I’ve never written anything in it that got to production… Mostly because every time I’ve come back to Erlang, the syntax has made it feel like I’ve had to completely re-introduce myself to the language to get productive.” Simon found Elixir a lot easier to get into, particularly as he had written some production systems in Ruby and Ruby-on-Rails and so Elixir’s syntax felt familiar.
“The whole mismatch I saw with WebSockets with Django Channels… In Elixir and with the Phoenix framework it didn’t feel like a mismatch at all. It just felt native to the whole thing.” With the socket abstraction in Phoenix, it didn’t matter if a client was streaming over a WebSocket or polling an API, which made designing the application straightforward.
Simon also took the opportunity to switch over to GraphQL using Absinthe, in order to reduce the large number of API calls needed with a RESTful API. Initially this was using an HTTP API, but later he discovered he could instead provide the GraphQL interface over WebSockets — and it ended up only being a one-line change in Phoenix. When using GraphQL over a single connection, it became much easier to reason about what operations were mutations or not and so maintaining caches and the overall UI state became much simpler. As a result, the UI was snappier and more responsive.
Expanding RapidPro’s functionality
At Praekelt.org we have built several of our recent products around RapidPro. RapidPro is a tool developed by Nyaruka that we use to build messaging-based services. It allows us to define message flows, dashboards, and other functionality. While RapidPro is flexible, it is developed by a third-party, and eventually we hit some kind of limitation in what we can do with it alone.
RapidPro’s answer to extra customisation is to implement features via webhooks. It can also be interacted with via a RESTful API. In particular, when we need to interact with third-party services for which RapidPro does not have an existing integration, we have to develop a “middleperson” between RapidPro and the third-party in order to support communication between the two.
For the NurseConnect program, we were moving from our older “Seed” application stack to RapidPro and needed to do so without losing any functionality. An application called NurseConnect Companion was developed that filled the niche use cases that we weren’t able to achieve with RapidPro alone.
Two pieces of functionality were developed, the first of which was to handle opt-out messages from users. Opt-outs needed to be recorded in the South African National Department of Health’s data store, DHIS2.
When the app received a webhook from RapidPro, it would put a row in a database and return. Then a background process (implemented with the Honeydew library) would collect the information it needed and finally submit an HTTP request to DHIS2 to finalise the opt-out. Various error-handling was necessary to ensure all opt-outs were processed correctly.
The second piece of functionality was to send WhatsApp templated messages. This time around, a different design was used that was closer to being a proxy between RapidPro and WhatsApp. Upon receiving a webhook, the app would send a request to the WhatsApp API, receive a response, and send that response back to RapidPro.
Rudi Giesler, one of the engineers for NurseConnect Companion, explained his decision to use Elixir. “What I wanted was an async system. The work it would be doing would be very async — it’s consuming a webhook and then making requests to other APIs, all of which can happen independently.” The equivalent Python async tooling at the time didn’t seem mature, and there was momentum in the organisation around Elixir, with other teams starting to use Elixir, particularly in conjunction with WhatsApp APIs.
It’s worth noting that in the past the way we would have designed a system like this would be much closer to the first flow described above. We often use a synchronous Django application that hands off tasks to a background worker process, such as Celery. It’s difficult to build designs like the second flow, as a typical Django deployment would not be able to handle many long-running connections before running out of resources. Django and the webserver tooling around it was designed for short requests, rather than long-running or streaming connections.
Dynamic load-balancer configuration
Relay ties several bits of infrastructure together in order to achieve this goal:
Relay is an Envoy Discovery Service. What this means is that it provides the configuration (where and how to route requests) to the Envoy load-balancers. This configuration is generated based on the state of containers in the cluster which is sourced from the container orchestrator, Marathon. Additionally, Let’s Encrypt certificates are stored in Vault by our tool, marathon-acme, and must be provided to Envoy so that it can serve HTTPS.
Relay must do several things at once and support various asynchronous protocols:
- Provide configuration to each Envoy instance via bidirectional GRPC streams.
- Accept reload requests from marathon-acme and then fetch certificates from Vault.
- Periodically sync with Vault in case marathon-acme reload requests are missed.
- Watch events for changes from Marathon via a server-sent events (SSE) stream.
- Periodically sync with Marathon in case events are missed.
Elixir’s actor-based concurrency model makes it easy to have several independent components that each have a single responsibility (watching for events, streaming Envoy configuration, collating routes with certificates, etc.) and communicate only with the other components they need to know about.
Jeremy Thurgood, an SRE working on Relay, emphasised the importance of the Single responsibility principle (SRP) when building Relay. “The component that watches for Marathon events doesn’t know or care about certificate stuff…or whatever else, which makes it simpler and more reliable. Things that don’t interact with each other can’t get in each other’s way.”
He also noted the advantages of Supervisor-based error-handling. “You don’t need to handle invalid messages, timeouts, precondition failures, etc. specifically everywhere, you just assert that things are in the state you expect and crash if they aren’t. Anything unexpected causes a crash, any crash causes a restart of the necessary subsystems, and the Supervisor applies whatever threshold or backoff rules might be necessary for the things it’s supervising.” An important part of how this works is that data is immutable in Elixir and so can’t be corrupted during a restart.
You’ll notice that marathon-acme was written in Python and handles some similar tasks to Relay. marathon-acme uses the Twisted library for asynchronous networking operations — so it can do things like stream events from Marathon while also issuing and renewing certificates.
We have actually also used Twisted for a lot of messaging applications, for example Junebug. Twisted’s event-loop concurrency model is certainly scalable but often difficult to reason about, and error-handling can be challenging. Until recently, Python didn’t have native syntax support for asynchronous operations, leading to various hacks to the language in an attempt to avoid “callback hell”. In Elixir, asynchronicity is natural as it has always been part of the language.
The maturity of Elixir
All the engineers I interviewed were satisfied with the tools that Elixir provided out-of-the-box. Mix provides a useful set of tools and everybody seemed happy with the package management system.
Engineers were also pleased with Phoenix and how easy and fast it was to develop simple and interactive web apps. Rudi did feel there were a few things he was writing from scratch that would have been available for him already in Django. “Auth[orization] was quite a pain to get running and if I needed to do something more complicated with users and groups and roles and specific permissions… I feel like I would basically be writing that from scratch.”, he said.
In some areas engineers were unsure of how best to use the available tooling. In particular, we use a container orchestration platform to deploy our software and some people were unsure of how best to use Docker with Elixir. Phoenix heavily relies on Mix for performing tasks such as database migrations, which doesn’t necessarily play nicely with deployment tools like Distillery.
The Erlang VM provides tools like hot code swapping, but how that should be used in conjunction with container orchestration is unclear. We’re more likely to use deployment patterns that are platform-agnostic and that we already have tooling for, like blue-green deployments. Similarly, Elixir supports running distributed tasks that communicate over the network, but we haven’t explored this option yet. We are more likely to use common protocols we know well, like HTTP APIs.
Again, the message from this blog post is not that you should drop everything and write all your software in Elixir. We find that Elixir works well for particular kinds of problems — messaging, interactive UIs, and multi-tasking infrastructure tools.
We still have much less experience with Elixir than with Python, and we’d like to see some of Elixir’s tooling refined to work more naturally with the container-based or “cloud native” tooling that a lot of organisations are now adopting. We’re excited to see how the community will improve popular projects like the Phoenix framework so that they can be easily used for more complex systems.