Elixir/OTP : Basics of processes

Published in

Elemental Elixir

30 min readJan 23, 2024

Processes in Elixir are basic units of concurrency that form the foundation for various properties of a highly available system such as scalability, fault tolerance, resilience and seamless distribution. Code in Elixir runs inside processes and thousands of processes run in isolation and communicate with each other via messages to successfully run a highly available system. The Open Telecom Platform(OTP) is a framework that provides a lot of commonly used readily available abstractions built around processes that makes creating highly available applications much easier. This article explains what Elixir processes are, their internal memory structure, syntax, usage and their applications. Please note that Elixir processes are deeply rooted in the Erlang Virtual Machine (BEAM). Therefore, when we talk about Elixir processes in this article, we are implicitly discussing the underlying concurrent execution model inherited from Erlang.

Elixir processes are extremely lightweight and hence thousands of processes can be run efficiently in a machine. To understand how lightweight they are, let’s first understand the difference between OS processes, OS threads, virtual threads, Elixir processes and how all of these units of concurrent execution relate with running an Elixir application.

OS process

An OS process is a heavy weight fundamental unit of concurrency created and managed by the operating system. Every program or application that you start runs inside an OS process. Every OS process has its own memory space and every OS process can create and manage multiple OS threads within it. Since every OS process has its own memory space, it has strong isolation from other processes run by the OS. Since an OS process is created and managed by the OS directly, it takes up a considerable amount of memory and time to start an OS process. Applications like the Google Chrome browser use a separate OS process for each tab that you open. This provides isolation between tabs ensuring that problems in a single tab do not affect the other tabs as each underlying OS process that runs these tabs are strongly isolated. This isolation by OS processes in the Chrome browser comes with the cost of using up a lot of memory resources. Elixir applications run on the Beam VM and every time you start a Beam VM instance, it takes up a single OS process.

OS thread

An OS thread is a lighter unit of concurrency and execution in the operating system, that can be created within an OS process. It takes less time and memory to create and manage an OS thread. Each OS thread contains specific memory areas like the stack and register that can be used to run instructions independently from the other OS threads in the process. But all of the OS threads within an OS process share the data present in the OS process’s memory space. Hence, there is not as much isolation in OS threads present within an OS process. These threads can communicate easily with each other by using the shared memory space, but in order for multiple OS threads to read, access or modify the same data in the OS process’s shared memory space, synchronisation mechanisms such as locks and mutexes must be employed, making them difficult to deal with concurrent modification of shared data. Programming languages like Java use threads to handle concurrent requests while sharing the memory within the OS process. Each Java thread directly maps with an OS thread created within the OS process that runs the Java VM. The Beam VM that runs within an OS process creates a pool of OS threads that are used for handling different parts of the VM.

Virtual threads

Virtual threads are a lighter version of OS threads. Unlike OS threads that are created and managed by the OS, the virtual threads are created and managed by the runtime or the application itself, thus reducing the OS overhead. They are lighter than the OS threads and hence take up less memory and time to start up. The individual application or runtime defines the properties of these virtual threads such as their memory structure, communication mechanism, level of isolation etc and multiple virtual threads are created within an OS thread to obtain finer and lighter units of concurrency. Internally multiple virtual threads may still map to a particular OS thread. Various programming languages like Java, Ruby, Go etc offer lightweight concurrency via virtual threads as a part of the language or through libraries.

Elixir process

Elixir processes are similar to virtual threads but they are not exactly the same. The Beam VM creates and manages the Elixir processes. Unlike virtual threads which may be internally mapped to an OS thread, Elixir processes are not tied to any particular OS thread and are completely managed internally by the Beam VM. Elixir processes are strongly isolated as each process has its own memory space. The only way of communication between processes is through passing messages. This strong isolation forms the foundation for fault tolerance, as issues with one particular process do not affect the other running processes in the Beam VM. Each process takes a very small amount of initial memory and hence spawning a process is extremely fast since they do not require direct intervention from the OS.

Concurrency and Parallelism

Theoretically, the number of tasks or instructions that can be processed parallelly at a single point of time is restricted by the number of cores or CPUs present in a machine. Hence if your machine has four cores, then the number of tasks that be performed parallelly at a single point of time is 4, one for each core. But in a typical machine, there would be thousands of tasks and operations that would be required to be processed simultaneously. This is where concurrency comes in.

Concurrency involves allocating small chunks of time slices to different tasks that need to be performed. Each task is scheduled and is run for the provided time slice and then suspended, after which another task is then run in the same manner. This process of running a task, suspending it and picking up and running another task is called context switching. Context switching is done very fast so that it appears that all those tasks are being run simultaneously. Context switching also requires saving the state of a task before suspending it so that the next time it is scheduled for running, the previous saved state can be loaded to continue from where it left off.

Every unit of concurrency that we have seen above involves scheduling, and for OS processes and OS threads, the OS handles the scheduling and context switching, while for virtual threads and Elixir processes, the runtime or application manages scheduling and context switching. Thus concurrency allows for the execution of multiple tasks in overlapping time intervals, while parallelism involves multiple tasks being run at the same time, simultaneously by using multiple CPU cores.

Beam VM has its own scheduler within it and by default it employs one scheduler per core, with each scheduler mapped to an OS thread run within the Beam VM OS process. These schedulers make sure that multiple processes are queued in run queues and scheduled for execution, thus utilising all of the CPU cores present in the machine. This is how the Beam VM employs both concurrency and parallelism to run thousands of processes in a machine. The Beam VM also uses separate threads or mechanisms for I/O operations so that the execution of processes is not affected. Furthermore Beam VM uses preemptive multitasking, where the scheduler has the control to suspend processes at will and perform context switching of processes after a provided time window, ensuring that processes that have long running tasks do not block execution of other processes.

In Beam VM, the execution time slice for a process is determined by a reduction/function call limit. Once a process starts execution and crosses the allocated reduction limit, context switching will be performed by the scheduler to run another process. This is how Elixir enables responsiveness of an application. Unlike OS processes and OS threads, context switching in Elixir processes is much faster and requires a minimum amount of state to be stored and loaded. Elixir’s immutability property also plays a major role in enabling concurrency and parallelism by simplifying state management and providing thread safety. Even though immutability takes a toll on performance by creating new versions of data every time a piece of data is modified, it is a required trade off for achieving a high level of concurrency.

A Comparison between units of concurrency

Actor model

Elixir’s concurrency model is based on the Actor model which defines the following properties.

Actors are the fundamental units of computation in the Actor Model. They are independent entities that can run concurrently, each with its own isolated state. The state of one actor cannot be directly accessed or modified by another. This isolation enhances fault tolerance since issues in one actor do not impact others, aiding in localising the effect of an error or issue to the process.
Communication between actors is done exclusively through message passing. Actors can send messages asynchronously(non-blocking) to each other to exchange information and trigger behaviour. Each actor has an address and a mailbox where it receives messages from other actors. The actor can read the messages one by one from the mailbox and can decide how to process the message.
Actors can be distributed across multiple nodes, and the programming model remains the same, as actors interact through messages, regardless of their physical location. This property is the foundation for seamless distribution and simplified scalability.
Actors can create more actors if required and thus enable creation, termination and supervision of new actors when required. This property is the foundation for building resilient systems which heal automatically in case of errors and issues.
Actors can execute concurrently without any need for explicit synchronisation, allowing for parallelism and efficient use of resources.

Thus in Elixir, the implementation of the Actor Model is deeply integrated into the language and runtime environment. Processes in Elixir act as lightweight actors, and the Beam VM provides the necessary infrastructure for message passing, isolation, and supervision.

Internal memory structure of an Elixir process

An Elixir process’s internal memory structure contains many memory areas such as the stack, heap, process control block(PCB), mailbox, heap fragments and process dictionary. Spawning an Elixir process can take up to 338 words(~ 3kb in a 64-bit system) as initial memory that includes memory allocation for all the above mentioned memory areas.

The stack and heap are allocated in a single contiguous memory block where the heap starts at the lowest address and grows upwards, while the stack starts at the highest address and grows downwards. When these two memory structures meet and are out of memory, the garbage collection process will happen to recover and reclaim memory. If even after the memory recovery, the heap needs more memory, then the size of the heap is increased, thus increasing the memory used by the whole process. The garbage collection process happens per process and will not affect the execution of other processes. The stack is used to store function calls and its associated local variables including immediate terms and boxed terms. The initial stack size takes up way less than a kilobyte when a process is spawned. The heap on the other hand is used to store data related to data structures such as tuples, maps, lists, floats, binaries etc. The initial default heap size for a process takes up to 233 words(~ 2kb in a 64-bit system). This initial heap size can be configured using flags when starting the VM.

The process control block(PCB) of a process is a small fixed size memory area used mainly to store metadata about the process. It contains information such as the process identifier(PID), registered name, current state of the process, pointers to stack and heap locations, reduction count, message queue length(unread messages in mailbox), memory usage information etc.

The mailbox is the memory area that is used for storing the received messages sent by other processes. It uses two FIFO queues built on top of linked lists. When a message is received, the original data related to the message is added to the heap of the process and the reference to that message data is added to the tail of the message queue in the same order that they are received. These messages can be read one by one in the same order of arrival and can be processed as required by the process. The messages are removed from the message queue and are pattern matched to find the respective code block that needs to be executed for the respective message. The second queue called the save queue is used to temporarily store messages that do not match any of the patterns used to match the messages. These unmatched messages from the save queue are added back to the head of the message queue for retrying, after the process successfully matches and processes a valid message. Since message passing is asynchronous, the sender process, by default, will not wait once it sends out a message. Synchronous message passing can be programmatically performed which will be discussed later.

The heap fragments memory area is used as a temporary small additional memory for the heap, when there is not enough memory in the heap and the garbage collection process is yet to be performed to reclaim memory for the heap. Once the garbage collection process is done, they will be merged back into the heap memory. They are only created in the above scenario and may not be present initially in a process with already sufficient heap memory.

Process dictionary is a memory area that can be used to store and read values based on keys. They are internally used to store additional metadata about the process such as the initial function call of the process, parents of the process etc. Even though they can be used for storing custom key-value data and can be used for certain use cases, it is advised against explicitly using the process dictionaries, since relying on them could lead to side effects and cause bugs that would be hard to debug.

Simple representation of a process’s memory structure

Lifecycle of a process

A process can be spawned or created by another process and the spawned process will be provided a function to run. Once the Beam VM spawns a new process, the newly spawned process will start executing the function provided to it. Once it finishes execution of the provided function it will be terminated. Hence a spawned process will be alive as long as it has something to execute. Once a process is terminated all of the memory used by it is reclaimed for use. This also includes any resources that the particular process has opened such as files, ports etc.

Process creation

Code always runs inside processes in Elixir and this is true for an iex shell as well. Whenever an iex shell is fired up, a process is spawned and the code/expressions entered in the shell is executed within the process. A PID is a data type in Elixir that serves as an address for a process. Any communication between processes occurs mainly by using the PID to identify different processes. The PID of the current process that is executing code can be obtained by using the self/0.

iex> self()
#PID<0.109.0>

A process can be spawned primarily using spawn/1 and spawn/3 functions. They take in either an anonymous function or a named function in the form of module, function and arguments. As discussed above, a new process will be spawned and the provided function will be executed after which the spawned process will be terminated.

spawn(fn -> IO.puts("from another process") end)
from another process
#PID<0.110.0>

spawn(IO, :puts, ["from another process"])
from another process
#PID<0.111.0>

In the above examples, a new process will be spawned and the IO.puts call will be executed within the newly created process. You can also see a PID printed after every spawn call. The spawn functions always return a process identifier(PID) of the newly created process which can be stored for future communication with the spawned process. The execution in a newly spawned process is isolated and the current process that spawned it will not have any idea about the context or execution in the spawned process. We are seeing the message being printed in the console only because the console is common for all the processes. If a function that does not print anything in the console is provided in the spawn function call, there will be no indication of any execution for the parent process.

spawn(fn -> String.length("hello") end)
#PID<0.113.0>

The Process module provides various functions for working with processes. The Process.alive?/1 functions takes in a PID and returns a boolean indicating whether the process associated with the provided PID is alive at the moment. This can be used to ensure that once a process is spawned and it finishes execution, then the process will terminate and will not be alive thereafter.

current_pid = self()

Process.alive?(current_pid)
true

spawned_pid = spawn(fn -> IO.puts("Executing...") end)
Executing...
#PID<0.112.0>

Process.alive?(spawned_pid)
false

Sending and receiving messages

Communication between processes is one of the major properties of Elixir’s concurrency model. A message can be sent from one process to another using the send/2 function that takes in the receiver process’s PID and the message data that needs to be sent. Message sending is asynchronous and hence, the sender process will move on to execute the rest of the code as soon as it has sent the message to the receiver process. The send function call will immediately return the same message passed into it and move on to the next expression. This is true even if you send a message to a PID associated with a terminated process. A process can also send messages to itself by using its own PID as the first argument to the send/2 function.

send_fn = fn ->
  IO.puts("sending msg to myself..")
  send(self(), :msg)
  IO.puts("executing next expression")
end

send_fn.()
sending msg to myself..
executing next expression
:ok

Once a message is received by the receiver process, the data related to the message will be copied to its heap and a reference to this heap message data will be added to the tail of the message queue. We can verify this by using the Process.info/1 function that takes in a PID and returns a lot of information about the process associated with the PID. You can also get information about specific attributes by using the Process.info/2 function.

Process.info(self(), :message_queue_len)
{:message_queue_len, 1}

Process.info(self(), :messages)
{:messages, [:msg]}

As you can see above, there is the one message :msg present in the message queue yet to be processed. The messages by default stay there in the message queue until the receiver process explicitly uses the receive macro to read and process the message. The receive macro takes in a do block and an optional after block. The do block’s contents follow the same syntax and behaviour as the case construct. It has a series of patterns that will be matched against the message data of received messages in the message queue. The first pattern that matches with the message data will have its code block executed. Once the receive macro successfully matches a message and runs the respective code block, the execution will get out of the receive block and will move on to the next expression. The process will terminate once all the expressions are executed in the receiver process.

send(self(), {:msg, "Message"})

receive do
  {:msg, message}-> IO.puts("message found in mailbox: #{message}")
end
message found in mailbox: Message
:ok

The receive block goes through all the messages one by one sequentially in the same order that they arrived until it finds a message that matches a provided pattern. The messages that do not match any pattern will be added to the save queue. Once the receive block finds the next valid matching message and executes the respective code block, any messages in the save queue will be added back to the head of the message queue. Hence the messages are always read from the mailbox in the order of oldest to newest. The more the messages are in the save queue the more time it takes to process a valid message, as the invalid messages will be added every time to the head of the message queue and has to be processed again before reaching the next valid message.

Moreover the message data referenced by these invalid messages will keep clogging up the heap memory leading to a poor performance of the process. A common practice of dealing with this is to add a match-all pattern as the last pattern in the receive block so that all invalid messages are matched and removed from the mailbox, thus not being in the way of the next valid messages. Once a message is removed from the mailbox then its respective message data in the heap will also be eligible for garbage collection. The code block associated with the invalid messages usually logs them, sends them to another process for debugging or just ignores them if required. For a process that is anticipated to receive a lot of messages, it is also possible to store the message data outside the process heap so that the process performs well. This can be done by changing the default value, :on_heap of the :message_queue_data flag of the process to :off_heap using Process.flag/2 function.

send(self(), {:invalid, "invalid Message"})
send(self(), {:msg "Message"})

receive do
  {:msg, message}-> IO.puts("message found in mailbox: #{message}")
  _ -> IO.puts("Invalid message ignored")
end
Invalid message ignored
:ok

receive do
  {:msg, message}-> IO.puts("message found in mailbox: #{message}")
  _ -> IO.puts("Invalid message ignored")
end
message found in mailbox: Message
:ok

In a process, when the execution gets into the receive block and if there are no messages matching the patterns provided or if there are no messages at all in the mailbox, then the process’s state will be changed as waiting. The scheduler will perform context switching to execute another process instead. This is done so that a process that is waiting does not take up resources and block the execution of other processes. This is one of the scenarios where the process notifies the scheduler to perform context switching before reaching the allotted reduction limit. Once a waiting process receives a message, then the state of the process will be changed to scheduled and the process will be added to a run queue from where the scheduler will pick it up for execution and execute it until it reaches its reduction limit or if it goes into waiting state again.

Process.info(self(), :message_queue_len)
{:message_queue_len, 0}

receive_fn = fn -> IO.puts("starting :: #{Time.utc_now}")
  receive do
    x -> IO.puts("Message received : #{x}")
  end
  IO.puts("Outside receive block")
end

receive_fn.()
starting :: 14:25:20.134000

The above receive block when run in an iex shell with no messages in the mailbox, will become unresponsive since the process running the iex shell goes into waiting state. The optional after block at the end of the receive macro’s do block can take a single clause that can be used to specify a particular time in milliseconds for which the process will be in waiting state. After the provided time has elapsed, the process will get out of the waiting state and the execution is resumed starting from the after block’s code and then the expressions outside the receive block in the receiver process’s code. If the value 0 is given to the after block, the execution will get out of the receive block as soon as it scans all the messages in the mailbox for a match and doesn’t find one.

receive_fn = fn -> IO.puts("starting :: #{Time.utc_now}")
  receive do
    x -> IO.puts("Message received : #{x}")
    after
      5000 -> IO.puts("5 seconds elapsed :: #{Time.utc_now}")
    end
  IO.puts("Outside receive block")
end

Process.info(self(), :message_queue_len)
{:message_queue_len, 0}

receive_fn.()
starting :: 14:30:20.142000
5 seconds elapsed :: 14:30:25.165000
Outside receive block
:ok

The fact that the save queue stores all the unmatched messages for retrying them later, can be utilised for processing messages based on a criteria such as priority instead of just relying on the arrival order. This involves tagging the message with a term indicating priority and having multiple receive blocks, each associated with processing a message of different priorities starting from high to low. This way, the first receive block will wait for the provided time to process a high priority message and the save queue will accumulate low priority messages in the meantime. Once the first receive block processes the high priority message or if there were no high priority messages for the provided time in the after block, then the process will move on to the next receive block that processes the low priority messages saved in the save queue.

receive_fn = fn -> receive do
    {priority, msg_data} when priority > 7 -> IO.puts("high priority")
  after 
    0 -> IO.puts("No high priority msgs. Moving on to low priority")
  end

  receive do
    {_priority, msg_data} -> IO.puts("low priority")
  after
    0 -> IO.puts("No msgs")
  end
end

send(self(), {4, :low})
send(self(), {5, :low})
send(self(), {9, :high})

Process.info(self(), :messages)
{:messages, [{4, :low}, {5, :low}, {9, :high}]}

receive_fn.()
high priority
low priority

Process.info(self(), :messages)
{:messages, [{5, :low}]}

receive_fn.()
No high priority msgs. Moving on to low priority
low priority

receive_fn.()
No high priority msgs. Moving on to low priority
No msgs

Synchronous message passing

Whenever a message is sent to another process, the message sending is asynchronous by default. The sender process will not have any idea about the execution happening in the receiver’s process. But in most of the cases, the sender process would require the result of the receiver process’s execution before moving on to execute other expressions.

defmodule Receiver do
  def receive() do
    receive do
      {:sum, a, b} -> a + b
      _ -> IO.puts("Invalid operation")
    end
  end
end

defmodule Sender do
  def sum(receiver_pid, a, b) do
    result = send(receiver_pid, {:sum, a, b})
    IO.inspect(result, label: "Result")
  end
end

receiver_pid = spawn(Receiver, :receive, [])

Sender.sum(receiver_pid, 1, 2)
Result: {:sum, 1, 2} # prints the return value of the send/2 call
# returned immediately instead of the executed result

Synchronous messaging can be simulated by using the send and receive constructs. Whenever a message is sent to a receiver process, the PID of the sender can be attached to it. From the receiver’s end, the message can be pattern matched to obtain the sender’s PID and the other message data can be used to determine what operation to perform. Once the respective code block is executed and the result is obtained, then the result can be sent back to the sender with the same send/2 function call using the sender’s PID received as part of the message. From the sender’s end the next expression right after sending the message to the receiver could in turn be a receive block that will wait for the result sent back by the receiver process. This way the sender will send a message and wait back for the result before moving on to the next expression, thus simulating a synchronous message passing behaviour.

defmodule Receiver do
  def receive() do
    receive do
      {sender_pid, :sum, {a, b}} -> send(sender_pid, {:result, a + b})
      _ -> IO.puts("Invalid operation")
    end
  end
end

defmodule Sender do
  def sum(receiver_pid, a, b) do
    send(receiver_pid, {self(), :sum, {a, b}})
    receive do
      {:result, res} -> IO.inspect(res, label: "Result")
      _ -> IO.puts("Invalid message")
    end
  end
end

receiver_pid = spawn(Receiver, :receive, [])

Sender.sum(receiver_pid, 1, 2)
Result: 3
3

It is a common practice to use tuples with atoms as tags for pattern matching, sender’s PID for synchronous message passing and the message data as elements.

Long running processes

In our code above, once the receiver process receives a message, computes the result and sends it back to the sender PID, the execution will get out of the receive block and since there is no other code to execute, the receiver process will ultimately terminate. If we have to send another message to the receiver process, then it has to be spawned again every time.

defmodule Receiver do
  def receive() do
    receive do
      {sender_pid, :sum, {a, b}} -> send(sender_pid, {:result, a + b})
      _ -> IO.puts("Invalid operation")
    end
  end
end

defmodule Sender do
  def sum(receiver_pid, a, b) do
    send(receiver_pid, {self(), :sum, {a, b}})
    receive do
      {:result, res} -> IO.inspect(res, label: "Result")
      _ -> IO.puts("Invalid message")
    after
      5000 -> "No result"
    end
  end
end

receiver_pid = spawn(Receiver, :receive, [])

Sender.sum(receiver_pid, 1, 2)
Result: 3
3

Process.alive?(receiver_pid)
false

Sender.sum(receiver_pid, 1, 2)
No result
:ok

In order to start a process and keep it alive, we can make use of recursion. After the end of the receive block, we can call the same executed function again so that the execution will keep going into the receive block over and over again due to recursion, keeping the process alive and continuously processing messages one by one. Thus the process will stay alive since it will never run out of code to execute.

defmodule Receiver do
  def loop() do
    receive do
      {sender_pid, :sum, {a, b}} -> send(sender_pid, {:result, a + b})
      _ -> IO.puts("Invalid operation")
    end
    loop() # recursion to keep process alive
  end
end

defmodule Sender do
  def sum(receiver_pid, a, b) do
    send(receiver_pid, {self(), :sum, {a, b}})
    receive do
      {:result, res} -> IO.inspect(res, label: "Result")
      _ -> IO.puts("Invalid message")
    after
      5000 -> "No result"
    end
  end
end

receiver_pid = spawn(Receiver, loop, [])

Sender.sum(receiver_pid, 1, 2)
Result: 3
3

Process.alive?(receiver_pid)
true

Sender.sum(receiver_pid, 1, 2)
Result: 3
3
Sender.sum(receiver_pid, 1, 4)
Result: 5
5
Sender.sum(receiver_pid, 1, 6)
Result: 7
7

Managing state in long running processes

So far we have seen how to create a long running process by keeping it alive using recursion. But the process did not have any state to maintain. It took data from the messages it received, computed a result and sent it back to the sender PID. We can use the same recursion technique and pass in an initial state to the function being run by the receiver, update the state based on the messages received and pass the updated state as an argument again to the same function call and recurse. The updated state will be available within the function’s context which can then be accessed inside the receive block to continuously maintain and update state. This is a commonly used technique to run long running state machines. Let us create a simple key-value store using this technique.

defmodule KeyValueStore do
  def start(init_data \\ %{}), do: loop(init_data)

  def get(key, store_pid) do
    send(store_pid, {:get, self(), key})
    receive do
      {:result, val} -> val
    after
      5000 -> {:error, "Timed out"}
    end
  end
  
  def put(key, val, store_pid) do
    send(store_pid, {:put, {key, val}})
    :ok
  end
  
  def delete(key, store_pid) do 
    send(store_pid, {:delete, key})
    :ok
  end

  defp loop(state) do
     new_state = receive do
      {:get, sender_pid, key} -> send(sender_pid, {:result, state[key]})
                                 state
      {:put, {key, val}} -> Map.put(state, key, val)
      {:delete, key} -> Map.delete(state, key)
      _ -> state
    end
    loop(new_state)
  end
end
---------------------------------------------------------------------------
key_val_pid = spawn(KeyValueStore, :start, [])

KeyValueStore.put(:one, 1, key_val_pid)
:ok
KeyValueStore.put(:two, 2, key_val_pid)
:ok
KeyValueStore.get(:one, key_val_pid)
1
KeyValueStore.delete(:one, key_val_pid)
:ok
KeyValueStore.get(:one, key_val_pid)
nil

As you see in the code above, we are recursing the function loop with the updated state to simulate a long running state machine. The functions start, get, put and delete will be called by the sender processes and are commonly called as the client API, while the function loop is called internally by the receiver process to query, update, maintain state and perform other operations based on the messages received, commonly called as the server API. It is a common practice for server processes to contain both public client API and private server API functions within a single module.

Registering processes

So far, to send a message to a process to enable communication, we have been using its associated PID in the send/2 function call. This can lead to difficulty since the client processes need to explicitly know and store the PID of a server process. Moreover if the server process is terminated and restarted again, it will now be associated with a new PID which should then be updated in all the client processes that use the server process. To overcome this difficulty, a process can be registered with an unique name and the registered name can be used instead of the PID to discover and communicate with processes.

The Process.register/2 function takes in the PID as the first argument and the registered name as atom, and creates a mapping between the PID and the provided name. The function will throw an error if the PID is not valid , if it is not associated with a running process, if the name used is already registered to another process or if the process associated with the PID is already registered to a name. The atom name used for registering must also not be nil, false, true or :undefined. If you have to update the registered name for a process, then the existing registration must first be removed using the Process.unregister(registered_name) function before re-registering the process with a new name. In order to get a mapped PID for a registered name, then the Process.whereis(registered_name) function can be used.

current_pid = self()
#PID<0.109.0>

Process.register(current_pid, :shell)
true

Process.whereis(:shell)
#PID<0.109.0>

send(:shell, :msg)
:msg

Process.info(current_pid, :messages)
{:messages, [:msg]}

Let’s modify the KeyValueStore module by registering the spawned process and using the registered name instead of the PID.

defmodule KeyValueStore do
  def start(init_data \\ %{}) do 
    if !Process.whereis(__MODULE__) do
      server_pid = spawn(fn -> loop(init_data) end)
      Process.register(server_pid, __MODULE__)
    end
    :ok
  end

  def get(key) do
    send(__MODULE__, {:get, self(), key})
    receive do
      {:result, val} -> val
    after
      5000 -> {:error, "Timed out"}
    end
  end
  
  def put(key, val) do
    send(__MODULE__, {:put, {key, val}})
    :ok
  end
  
  def delete(key) do 
    send(__MODULE__, {:delete, key})
    :ok
  end

  def loop(state) do
     new_state = receive do
      {:get, sender_pid, key} -> send(sender_pid, {:result, state[key]})
                                 state
      {:put, {key, val}} -> Map.put(state, key, val)
      {:delete, key} -> Map.delete(state, key)
      _ -> state
    end
    loop(new_state)
  end
end
---------------------------------------------------------------------------
KeyValueStore.start()
:ok
KeyValueStore.put(:one, 1)
:ok
KeyValueStore.put(:two, 2)
:ok
KeyValueStore.get(:one)
1
KeyValueStore.delete(:one)
:ok
KeyValueStore.get(:one)
nil

The process spawning logic which was previously done explicitly by the client to obtain the server PID can be removed and the process spawning code can be abstracted away into the server process module. It is a common practice to use the name of the server module as the registered name for the spawned processes. The clients need not have to worry about any of the internal details and can use the server process through the exposed client api functions. We are also checking if there is already a running process by using the Process.whereis function call before trying to spawn a process and register it with the module name.

Process linking

In the sections above we have seen that a process will terminate once it finishes execution of all its code. One more reason why processes terminate is when there is a runtime error during its execution. Runtime errors occurring in code can be controlled and prevented only to a certain extent. A lot of external factors or dependencies may cause runtime errors. One of the main principles of Elixir’s concurrency model regarding errors is “let it crash”. Instead of writing defensive code that anticipates and handles all of the known possible errors that could happen, the recovery and healing after the crash is focused on more in Elixir’s concurrency model.

In a concurrency model where thousands of processes are running together every process will depend on multiple other processes and they all function together by communicating via messages. Hence, if one process terminates due to an error, it will eventually affect multiple processes depending on it. Let’s look at an example with the synchronous messaging simulation.

defmodule Server do
  def start() do
   if !Process.whereis(__MODULE__) do
     server_pid = spawn(fn -> loop() end)
     Process.register(server_pid, __MODULE__)
   end
   :ok
  end

def divide(a, b) do 
    send(__MODULE__, {:divide, self(), {a, b}})
    receive do
      {:result, res} -> res
      _ -> IO.puts("Invalid message")
    end
  end
  
  defp loop() do
    receive do
      {:divide, sender_pid, {a, b}} -> send(sender_pid, {:result, div(a, b)}) 
      _ -> IO.puts("Invalid message")
    end
    loop()
  end
end
----------------------------------------------------------------------------
Server.start()
:ok
Server.divide(3,1)
3
Server.divide(4,2)
2
Server.divide(8, 0)
18:33:28.024 [error] Process #PID<0.150.0> raised an exception
** (ArithmeticError) bad argument in arithmetic expression
    :erlang.div(8, 0)
    iex:54: Server.loop/0

In the above code, we have created a long running server process that computes the integer division of two numbers and returns back the result. The client API function divide/2 simulates the synchronous messaging scenario by waiting for a message after sending the request message to the server. In the client code the first two valid scenarios work fine while for the third scenario where the divisor is zero, an arithmetic error will be raised in the server process. This terminates the server process, but the client process which is waiting for the result back has no clue about the error in the server process. It will keep waiting for the result and stay in the waiting state. We are only seeing the error in the shell as the console is common for all processes. In real time applications, the client process will have no clue in this case. Even if you use an after block and exit out of the receive block, the next time you try to use the client API, an error will be raised in the client process since the server process was already terminated.

To avoid the above scenario, there has to be a way for dependent processes to know if a process termination has occurred. This is where process linking comes in. Process linking lets you link two processes bidirectionally so that when one of the processes terminates, all of its linked processes will be sent an exit signal. Process linking can be done using the function Process.link(pid) that creates a bidirectional link between the calling process and the process associated with the pid argument. Alternatively, you can use the spawn_link/1 and spawn_link/3 that spawns a process and creates a link between the two processes in one go.

The exit signal sent to the linked processes on process termination contains information about the reason for termination. For a process that finishes execution of code and terminates normally, the reason would be :normal and for processes that terminate abnormally the reasons could contain information about the termination. This exit signal, when it propagates to the linked processes, if the reason for termination is anything other than :normal the linked processes will also be terminated, they themselves sending an exit signal to all of their linked processes.

defmodule Server do
  def start do
    server_pid = spawn_link(fn -> loop() end)
    Process.register(server_pid, __MODULE__)
  end
  
  defp loop do
    receive do
      :err -> raise "error"
    end
    loop()
  end
end

defmodule Client do
  def start do
    client_pid = spawn(fn -> loop() end)
    Process.register(client_pid, __MODULE__)
  end
  
  defp loop do
    receive do
      :spawn_link -> Server.start()
      :err -> raise "error"
    end
    loop()
  end
end
---------------------------------------------------------------------------
Client.start()

client_pid = Process.whereis(Client)
Process.alive?(client_pid)
true

send(Client, :spawn_link)
server_pid = Process.whereis(Server)
Process.alive?(server_pid)
true

send(Client, :err)

Process.alive?(server_pid)
false
Process.alive?(client_pid)
false
---------------------------------------------------------------------------
Client.start()

client_pid = Process.whereis(Client)
Process.alive?(client_pid)
true

send(Client, :spawn_link)
server_pid = Process.whereis(Server)
Process.alive?(server_pid)
true

send(Server, :err)

Process.alive?(server_pid)
false
Process.alive?(client_pid)
false

In the above code, we are initially creating a client process that will in turn create and link with another process called the server process. The goal is to see what happens when abnormal termination happens in either of the linked processes. In the first scenario the client process is abnormally terminated by making it raise an error on receiving the :err message. The Process.alive? function call returns false for both the client and server processes indicating that the abnormal termination of the client process has in turn sent an exit signal to its linked server process and has terminated it as well. Similarly in the second scenario, abnormal termination of the server process is simulated which in turn has also propagated and terminated the linked client process, thus proving a bidirectional link between processes.

defmodule Server do
  def start do
    server_pid = spawn_link(fn -> loop() end)
    Process.register(server_pid, __MODULE__)
  end
  
  defp loop do
    receive do
      :msg -> IO.puts("received")
    end
    IO.puts("Exited normally")
  end
end

defmodule Client do
  def start do
    client_pid = spawn(fn -> loop() end)
    Process.register(client_pid, __MODULE__)
  end
  
  defp loop do
    receive do
      :spawn_link -> Server.start()
    end
    loop()
  end
end
---------------------------------------------------------------------------
Client.start()

client_pid = Process.whereis(Client)

send(Client, :spawn_link)
server_pid = Process.whereis(Server)
Process.alive?(server_pid)
true

send(Server, :msg)
received
Exited normally

Process.alive?(server_pid)
false
Process.alive?(client_pid)
true

In the above code, we are letting the server process terminate normally and checking if the linked Client process is still alive, thus proving that exit signals with :normal reason will not terminate the linked processes.

Trapping exits

Process linking makes sure that when a process terminates, an exit signal with exit reason is sent to all of the linked processes. When a process terminates abnormally, the exit signal will propagate and terminate all of its linked processes. This is not ideal for building resilient systems that self heal. In order to handle this, you can trap exits for a process. Trapping exits for a process makes sure that any exit signal that reaches the particular process is converted into a message and added to the process’s mailbox. This gives the linked processes the ability to view details about a linked process’s termination such as the exact PID of the process that terminated and the exact reason for termination. Even if it is an exit signal with abnormal reason, the signal will not terminate the linked process if it traps exits. This gives you more control on how to process a linked process’s termination.

Exits can be trapped for a process by updating the :trap_exit flag value of the process to true. Any flag value for a process can be updated using the Process.flag/2 and Process.flag/3 functions. The trapped exit message will be in the format {:EXIT, terminated_process_pid, reason}, which can be pattern matched in the receive block to do the required, such as restarting the terminated process, logging termination info or communicating with other processes etc.

defmodule Server do
  def start do
    server_pid = spawn_link(fn -> loop() end)
    Process.register(server_pid, __MODULE__)
  end
  
  defp loop do
    receive do
      :err -> raise "error"
    end
    loop()
  end
end

defmodule Client do
  def start do
    client_pid = spawn(fn -> loop() end)
    Process.register(client_pid, __MODULE__)
  end
  
  defp loop do
    receive do
      :spawn_link -> Server.start()
      :trap_exits -> Process.flag(:trap_exit, true)
      {:EXIT, _pid, _reason} = exit_msg -> 
        IO.inspect(exit_msg, label: "Exit message")
        Server.start() 
    end
    loop()
  end
end
---------------------------------------------------------------------------
Client.start()

client_pid = Process.whereis(Client)

send(Client, :trap_exits)
Process.info(client_pid, :trap_exit)
{:trap_exit, true}

send(Client, :spawn_link)
server_pid = Process.whereis(Server)
#PID<0.110.0>
Process.alive?(server_pid)
true

send(Client, :err)
Exit message: {:EXIT, #PID<0.110.0>
 {%RuntimeError{message: "error"},
  [
    {Server, :loop, 0,
     [file: ~c"iex", line: 49, error_info: %{module: Exception}]}
  ]}}

Process.whereis(Server)
#PID<0.111.0>

In the above code, once we start a Client process, we are changing its :trap_exit flag’s value to true. Then we are creating a Server process and linking both the processes. We are adding a pattern to the Client process’s receive block to match the exit signal message and to restart the Server process once it terminates. When we terminate the Server process abnormally, the exit signal sent to its linked process, Client, is converted into a message and is added to the mailbox. We are printing the exit signal message and restarting the Server process in this case. You can verify this by looking at the last function call Process.whereis(Server) that returns a new PID, indicating that the Server has been restarted after its termination as a new process by the Client process.

On the other hand, if the Client process terminates abnormally, then the Server process will die as we have not explicitly altered its :trap_exit flag. There are scenarios where we would require processes to trap and to not trap the exits. For e.g. consider a parent process that has spawned and linked to 2 other processes for performing additional tasks. These 2 child processes are created for the sole purpose of serving the parent process. The only messages that they receive is from the parent process. In this case, if the parent process terminates, there is no need for the child processes to stay alive. Hence exits should not be trapped for the 2 child processes to ensure that there are no abandoned processes clinging on. When the parent process is restarted, 2 more child processes can be created again along with it. On the other hand, if one of the child processes terminates due to an error, the parent process need not be terminated. Instead the exits can be trapped and the child process can be simply restarted. This forms the foundation of self healing and resilient systems in Elixir. Complex supervisions and management techniques can be created by leveraging all the different options available.

Processes can also voluntarily terminate themselves by using the exit/1 function. It takes in a reason term that will be part of the exit signal sent to all its linked processes. The Process.exit/2 function can also be used to send an exit signal from the caller process to another process via PID, without termination of the caller process.

Monitoring processes

Similar to process linking that offers bidirectional linking of processes, Elixir also offers monitoring, which involves a unidirectional way of monitoring other processes’ termination. The parent process that initiates monitoring of another process, instead of receiving an exit signal, receives a message about the termination of the monitored process in the format {:DOWN, uniq_reference, :process, PID, reason}. Hence if you do not want linked crashes and just want to monitor other processes, the spawn_monitor/1, spawn_monitor/3 and Process.monitor/1 functions can be used.

spawn_monitor(fn -> IO.puts("monitored process") end)
monitored process
#PID<0.175.0>

Process.info(self(), :messages)
{:messages, [{:DOWN, #Reference<0.28121606.2759065601.221218>, :process,
               #PID<0.175.0>, :normal}]}