Elixir |> GenServers and a real example

I know, I know… So many posts about this. But what if we think of a real example? Let's check out how we used GenServer in production to track the level of "interest" of a conversation!

The truth is that GenServer is such a cool concept that since the first time I saw it I wanted to use it, but never had the chance to apply it in a real project, until very recently. So before we show you how I used it, let's first dig in why did I use it, and why (I think) it was the right use-case.

Basically what we had was a chat application that involved groups. Groups were formed by multiple people and they could chat freely once they were part of the group. We wanted to know if the conversation was overall going "well", without looking at the messages. Also potentially we could have many groups, so it was useful knowing how many were doing well, for instance.
Not only that, but we also wanted to periodically send updates to our clients about these conversations, and capture certain events. Now there are many ways to do this, but we were out for a simple MVP and nothing fancy, so we thought:

  1. Why not count the # of messages? — Poor, because a conversation has value whenever we have "back and forth" between people. We could have a conversation with many messages but just one or two people participating.
  2. How about how long the conversation lasted? Also not a good indicator, since has some of the same problems of #1 and also, we were looking to see in the short term, when the group was first created, if it sparked a conversation.

Cool! So our idea was something somewhat simple, let's just assign a score to each message sent, but with some constraints:

  1. Each message will have an initial score value (constant).
  2. Subsequent messages will have less value, unless there was a message in between from someone else other than you. So, in case one person "spams" the conversation, the value of each message will go down to near 0.
  3. The score of the messages contribute to an overall "group score" — which represents the conversation score.
  4. The conversation score should naturally go down over time, so that if no one talks, it will eventually come close to 0.
  5. If the conversation is very active and it surpasses a threshold, we want to trigger an event!

Cool, so as soon as we understood the requirements, I started thinking: Could this be done in a cron job? Do I truly need to increase the complexity of the system with a GenServer? I though this would be a great use case since the data was ephemeral, not critical (meaning that if by some reason the score had to be reset or start over there would be no major downside) and the score had some logic to it, it would be better to use a GenServer as opposed to database or cache + some sort of periodically running job. That was the trail of thought that led to this solution— feel free to send your ideas in the comments if you have suggestions.

That said, the final architecture looked something like this (I'll explain):

So lets dig this baby out!

  1. Supervisor was an actual supervisor
  2. GroupsManager was a GenServer, responsible to decide which of the many groups we had needed a process to take care of it. Note that the reason this was done was so that we didn't need to run a process for each Group (conversation) all the time. Conversations long abandoned or with insufficient criteria (that we don't need to explain too deeply here) did not get a cool GenServer to track them, hence we implemented almost a "manual supervisor".
  3. GroupWorkers were the actual workers, also GenServers that tracked the group score and acted upon it. This is also the place where the score for each message was calculated, and then added to the total score.

1. Supervisor

Very simple supervisor. Notice that it starts both our Registry (more on that later) and our Manager. Also notice the syntax for this changed after Elixir 1.5.

2. Manager

Ok, now we start getting some of the good stuff — nothing crazy though. Let's walk through the startup, then the some of the more interesting Client and Server APIs. If you're not familiar with GenServers at all, it might be worth checking the docs in parallel at this point.

First things first, inside `start_link` you will notice we are making use of Singleton, I don't want to get too deep into this, but briefly, this is related to our deployment strategy. Because we deploy many nodes into a cluster, we don't want to have many Managers running at the same time — there is no need to run more than one process per group either. Singleton is a very nice wrapper around global that allows us to ensure only one process of Manager will run at a time inside the entire cluster. If you are running only one server, you do not need to worry about this.

After starting up and calling `schedule_workers`, the manager will have started its child processes. After that, periodically it will wake up, check the cache for new pending groups, check the current groups for unhealthy ones (and terminate them if necessary) and schedule itself to run again.
This is a great example on how to run a scheduled process with GenServer, it can be used to substitute a cron job, for instance, since all it does it manage its children and make sure everything's good every 10 minutes.

The rest of the code is pretty self-explanatory but there are two points I think are worth mentioning: 
`start_worker` — Notice how we are making use of a Registry here for the pid name. Its main use is to ensure we can later reference the child process through a global naming system. This is pretty neat since I can reference any group worker by id without knowing the actual PID. The Registry becomes responsible for mapping data from the Cache (or DB) and the GenServers. This is also how the main application will be able to send messages to our workers later on.
`lookup` — Again, notice how we're making use of the Registry to find the process we need.

3. Worker

Ok, from a high level view:

  1. Worker starts up and sets a initial score for a group.
  2. Worker will periodically wake up (every 2s) and run the `:report` routine. When that happens, it will:
    1. Check if you reached the "event" threshold — remember, when people are chatting the score will go up and this should cause us to be notified 🎉
    2. Check if you reached the lower threshold, in which case the group is dead and we should notify the clients ☠️
    3. If none of that happens, we should just broadcast the current (updated) score to our clients — notice here we also "decay" the score a little bit (so if nothing else happens it will eventually run to 0).
  3. The client API `new_message` is responsible to bump the score every time a new message is created on the application side. Notice how the application has no information about pids or nothing of the sort, it just calls this api passing the message (which contains a group_id) and the worker makes use of `lookup` to find the proper worker.
  4. Process will be alive until one of the limits is reached, when it should be properly terminated and the Manager should be notified.

The rest of the code should be simple to understand… 😀

Final considerations

Working with GenServer was actually very fun and once you have the basics working it is really easy to expand and transform the processes to do more.
It is worth mentioning that the above example is modified and not meant for production code, but instead tries to illustrate an interesting situation. I know there are many places we could refactor the code to make it prettier and more efficient… (feel free to contribute :) )

Some of the things that we should consider that I also don't mention too much into this article are:

  1. What happens if the GenServer go down for some reason? In this specific case, I'm using the cache to rebuild the child process, but the current score will be reset if that happens — which I am ok with, that is why it is important to understand how much fidelity you need around the data. If you need a fault-tolerant system you will either have to design your application in a way that is smart enough to handle problems and/or most likely need to store the ephemeral data somewhere (cache?). Luckily, I didn't have that problem in this situation, but it is something I spent some time wondering about (imagine many processes hitting the cache every 2s to store the information… is it worth it?). Remember that if you deploy separate nodes into a cluster, you'll likely face that problem since it is normal to spin up new containers and kill old ones.
  2. What happens if bad data corrupts your source of truth? In my particular case, I always start with the same score (`@initial_score`) but if this data was being modified or being fetched somewhere and it was corrupt, it could make the Supervisor keep spawning processes that were fated to die. I researched around this a little and there are some solutions and suggestions, but I'd love to hear more about it in the comments if you have ideas 💡.

And I think thats it! I hope you found this post entertaining, it certainly was a lot of fun playing with GenServers in a real application. If you have any questions, suggestions or ideas, please feel free to comment out!

Next post I will talk about strong params in Phoenix! You know you want it… 🙃
You may also be interested in admin routes in Phoenix or how sometimes Ecto preloads can be evil!