Setting up an Elixir cluster on EC2 with Edeliver and Distillery

Bruce Pomeroy
8 min readFeb 12, 2017

--

Here are a couple of articles that helped me a lot, please check them out:

https://dockyard.com/blog/2016/01/28/running-elixir-and-phoenix-projects-on-a-cluster-of-nodes

It’s trivial to deploy to multiple servers using Edeliver. Simply list the ips or hostnames of your instances in .deliver/config separating the ips or host names with spaces.

#.deliver/config
PRODUCTION_HOSTS="1.1.1.1 1.1.1.2" #separate ips or hostnames with spaces

Obviously you’ll want to put a load balancer, proxy or DNS router in front of these servers so users will be able to browse to your app using a single domain name and will not be concerned with the ips of the servers in your cluster, that’s another topic. For the purposes of this example we could simply run cowboy on port 80 on each server.

# config/prod.exs
config :my_app, MyApp.Endpoint,
http: [port: 80],
...

If you don’t have any hostname restrictions in your router you should be able to browse to your app at http://1.1.1.1 or http://1.1.1.2 (obviously replace those ips with the ips of your servers). Great!

There’s something wrong though. We have two instances of our app, each running on its own machine, but they’re not connected to each other. This is ok in many cases, you’re still going to get the redundancy and scalability of multiple machines but you’re missing out. The first thing you might notice is if you’re using Phoenix Channels’ broadcast and related functions. You have two instances of your app, your load balancer has connected some of your visitors to one instance and some to the other. For the sake of example, open two browser tabs and browse to your app in both windows, if your app broadcasts on a channel you can see a broadcast that was initiated by an action in one window reflected in the other (depends of course on what your app does). Now try going to your app using the ip of one instance in one tab and the other instance in the other. Notice that you won’t see the broadcast sent to the other window. So how do we make the instances of our app collaborate with one another so channels can broadcast to all users, no matter which instance of your app they’re connected to?

There are a few things we need to do:

  1. Ensure each instance runs with a unique erlang node name.
  2. Ensure each instance is configured try to connect to other instances by name when it starts.
  3. Ensure the necessary ports are open in our AWS security group for the nodes to communicate.

Let’s use the vm.args file to configure each node, note that this configuration file is read by the node when it starts, it isn’t compiled into the app when you build like config/prod.exs and other Elixir config files. We’ll tell Edeliver to create a symlink to a vm.args file that we’ll create outside of the release. In .deliver/config add the following:

LINK_VM_ARGS=/home/myapp_deployer/myapp/vm.args

Obviously pick a path that makes sense for your app. We’re telling Edeliver that it should expect us to have put a vm.args file at that location and that it should create a symlink to it. WARNING: If using Distillery you need to add a plugin for LINK_VM_ARGS to take effect see this link for more info: https://github.com/boldpoker/edeliver/blob/master/lib/distillery/plugins/link_config.ex#L5-L15. Now we need to make sure the file is there. Use scp or ssh to create the following file at that location on each server, note that the file will be different for each server, specifically the file includes the ip of the server it’s on:

# vm.args
-name myapp@1.1.1.1
-kernel inet_dist_listen_min 9100 inet_dist_listen_max 9155
-setcookie 123456789123456789123456789123456789
-config /home/myapp_deployer/myapp/myapp.config

This file does a few things:

  1. Declares the unique name of this node. The text before the @ sign is arbitrary and useful for making the name unique when you have multiple nodes on one machine.
  2. Declares that erlang should use ports 9100…9155 for intra-cluster communication.
  3. Specifies a “cookie”, this must be the same for all nodes in the cluster. It should be kept secret, look for “magic cookie” in the Erlang docs for more info.
  4. References a config file which will list the “peer” nodes that this node should try to connect to. Let’s create that file next:
# myapp.config
[{kernel,
[
{sync_nodes_optional, ['myapp@1.1.1.2']},
{sync_nodes_timeout, 30000}
]}
].

Note that this file lists the “peer” nodes. I.e. Ideally this file should list every node on the cluster except the node the file belongs to. Also note the file uses `sync_nodes_optional` meaning the node will still start if it’s unable to connect to its peers. Also note the timeout, specified in milliseconds specifying how long the node should keep trying to connect for.

So here’s what we’ve done:

  1. Created a file on every server called vm.args that declares the name of the node along with some additional info.
  2. Created a file called myapp.config that lists all the other nodes (the nodes this node should try to connect to.
  3. Used `LINK_VM_ARGS` in our Edeliver config to tell Edeliver to create a symlink to our vm.args file so that the Erlang app will read from it on startup.

One more thing, in order for our nodes to collaborate we need to make sure they can reach each other on the ports they expect. At this stage we could get everything working by adding the following inbound rules to our AWS security group. If you’re working with a test app you could do this now for the sake of the example but I don’t recommend this in production, you’ll see why soon.

If everything went smoothly you should restart your apps and the nodes should connect. To troubleshoot or verify you can ssh in to any member of your cluster and locate an executable, you should find it somewhere like:

/<deploy_path>myapp/bin/myapp you can run ./myapp stop to stop the node if it’s currently running, then run ./myapp console to run the node with a repl attached, the repl is an iex shell running in the context of your node. You can type Elixir into it. For example to list the other nodes that this node is connected to type Node.list if things went well you should see the node names of all the other members of your cluster.

So that was the mechanics of it and pretty much the bare minimum required to get a cluster connected.

There’s something wrong though. Specifically the 0.0.0.0/0 in our security group rules. It means the these ports are open to the internet. To demonstrate this problem, let’s connect to our cluster from an iex session on our laptop:

$ iex --name intruder
> Node.set_cookie(:"magic-cookie") # match the cookie from vm.args
true
> Node.connect(:"myapp@1.1.1.1") # match one of your node names
true
> Node.list()
[:"myapp@1.1.1.1", :"myapp@1.1.1.2"]

Sure, we had to know the magic cookie and the name of one of our nodes but this isn’t good. It’s way too easy for an outsider to connect to our cluster. You can read more about the magic cookie and Erlang network security in general here http://learnyousomeerlang.com/distribunomicon but the key point is that magic cookies shouldn’t be considered a way of securing clusters.

So first thing, lets use the AWS security group associated with our web servers to restrict access to only allow connections from other members of the same group. Replace those 0.0.0.0/0 with the id or the security group that your servers are associated with:

That’s better. But now you’ll find that your nodes can no longer connect. That’s because we used the public ips of the servers. If we’re locking down the security group as shown above we need to use private ips to refer to the machines in the cluster. That’s ok, in your myapp.config and vm.args files replace all public ips with the corresponding private ip. Now when you restart your app your nodes should connect again and this time they’re connecting within the AWS security group. You’ll no longer be able to connect to your cluster from your laptop and neither will anyone else.

So we have a functioning cluster now. It’s not very practical though. It will be hard to maintain. You don’t want to be sshing into your servers and manually copying config files to them, especially when each server needs a slight variation on the config file. Adding or replacing machines is going to be tedious and error-prone. We’re hardcoding ips in the config files, depending on your setup you may find that if you shutdown a server then start it back up it’s IP may have changed, then you’ll need to ssh in and edit the config files.

Some of the awesome guys on the Elixir Slack channel mentioned that using a service discovery tool like consul or etcd is a good approach, they also suggested using Chef or similar to manage the cluster and presumably automate the creation of the appropriate config files. I believe there are also some Elixir libraries that help with discovering other nodes to connect to. I will be looking into those approaches in the future.

For now I have a crude Bash script which takes a list of ips and loops over them creating the appropriate config files for each and copying them up to each machine using ssh. It’s still essentially a manual solution but it’s much quicker and less error prone than editing the files manually on each server. Whenever the membership of the cluster changes (due to machines being added or removed). I can update the script with the latest list of ips and run it again to update the config files on all the members of the cluster. I know it’s nasty but maybe it will be a useful starting point for someone. I pushed my Bash scripting skills to their limits on this one ( ;

I’m sure someone can radically improve this but it serves the purpose of setting up all the nodes that you list with the correct config to connect to all the other nodes. Obviously you’ll need to adapt the file for your needs but perhaps it will be a useful starting point. Note that it needs your public and private ips in pairs (it needs the public ip for sshing and private for configuring nodes). HOSTS is a list of public_ip,private_ip pairs seperated by spaces. Note that in this example the 1.1.1.x ips are the public ips and the 5.5.5.x are the private ones.

#!/bin/bash
HOSTS="1.1.1.1,5.5.5.5 1.1.1.2,5.5.5.6" # List of ips
VM_ARGS_LOCATION="/.../vm.args"
CONFIG_LOCATION="/.../myapp.config"
NODE_NAME_PREFIX="myapp"
SSH_USER="deployer"
CONNECT_TIMEOUT=10000
MAGIC_COOKIE="123456789123456789123456789"
for HOST_AND_PRIVATE_IP in $HOSTS; do
IFS=, read HOST PRIVATE_IP <<< "$HOST_AND_PRIVATE_IP"
echo "======================="
echo "Setting up server $HOST with private ip $PRIVATE_IP"
# Generate content to write to vm.args:
VM_ARGS_CONTENT="-name $NODE_NAME_PREFIX@$PRIVATE_IP
-kernel inet_dist_listen_min 9100 inet_dist_listen_max 9155
-setcookie $MAGIC_COOKIE
-config $CONFIG_LOCATION
"
echo "--------------"
echo $VM_ARGS_LOCATION
echo "$VM_ARGS_CONTENT"
# Write vm.args file for this host:
echo "$VM_ARGS_CONTENT" | ssh $SSH_USER@$HOST "cat > $VM_ARGS_LOCATION"
# Create a comma-seperated list of the names of all the nodes except
# the current node (all the nodes this node might connect to):
PEER_NAMES=""
for MAYBE_PEER in $HOSTS; do
if [ $MAYBE_PEER != $HOST_AND_PRIVATE_IP ]; then
IFS=, read PEER_HOST PEER_PRIVATE_IP <<< "$MAYBE_PEER"
PEER_NAMES="$HOST_NAMES '$NODE_NAME_PREFIX@$PEER_PRIVATE_IP'";
fi
done
PEER_NODES="$(echo $PEER_NAMES | tr ' ' ', ')"
#Generate the content for the config file including the names of all the other nodes and the timeout:
CONFIG_CONTENT="[{kernel,
[
{sync_nodes_optional, [$PEER_NODES]},
{sync_nodes_timeout, $CONNECT_TIMEOUT}
]}
]."
echo "--------------"
echo $CONFIG_LOCATION
echo "$CONFIG_CONTENT"
# Write the config file for this host:
echo "$CONFIG_CONTENT" | ssh $SSH_USER@$HOST "cat > $CONFIG_LOCATION"
done

--

--

Bruce Pomeroy

Full-stack developer, specializing in Elixir and React