Exploring The Apple FaceTime Bug with ReasonML State Machines

Photo by Wynand Uys on Unsplash

The Apple FaceTime bug swept the news in early 2019 — callers using FaceTime were able to hear the microphone of a new participant being added to a call, before the new participant had accepted the incoming call request. Yikes!

@DavidKPiano recently wrote a very interesting Medium article about the bug, using it to illustrate the dangers of implicit state machines in your codebase. David’s post presents a strong case for explicitly identifying your state machines — by making them a prominent part of your design and your code itself, your thinking about changes becomes more robust. I encourage you to check out his post and also his JS library xstate!

I’m going to explore the adapted example from that post from a slightly different angle — we’ll look at various ways of encoding his example as a state machine in ReasonML code, and show how this process can help us prevent invalid states from being admitted in the first place. We’ll also have a look at using Imandra to verify properties of our state machines (Imandra offers a mechanised formal semantics for OCaml/Reason code, so we can use it to reason about Reason programs in an automated way).

I’ll also add that we’re in no way making any assumptions or judgements about how Apple employees actually develop FaceTime here — this post is more to illustrate that these kind of bugs are pervasive, and the more tools we have to help eliminate them, the better.

Overall, I hope to reiterate David’s message that state machines are a very useful tool for modelling parts of your applications. Writing them explicitly and using a tool like xstate can help make it easier for you to reason about your system, but it turns out they also make it easier for computers to reason about your system too! Let’s dive in.

If you want to play with the examples at any point, open the version of this post on our docs site and hit the ‘Try this!’ buttons to get an interactive notebook where you can experiment.

A simplified model

Let’s start with some simple types represeting actions in an application for receiving incoming calls. We’ll assume that once remote users make it into the peopleInCall list, they can hear the microphone of the phone running the app:

We can receive incoming calls from a person, accept or reject an incoming call, and then end a call. So far so good — next, let’s come up with a state type for these actions to act on:

Nothing looks too out of the ordinary here — these are all fields we’re likely to need as our app runs. Now let’s write the core state update logic:

This function represents the transitions of a state machine!

Encoding this in the Reason/OCaml type system gives us a couple of advantages — the switch statement (match in OCaml syntax) can also check that we’ve exhaustively covered all the actions. Also, if we decide to wire this up to a React UI, ReasonML’s React bindings include a ReducerComponent out of the box that expects a function of this shape. React’s new Hooks API also has a similar primitive in the form of the useReducer Hook.

Changing the model

Now let’s imagine a new requirement comes in.

Handle messages from the system requesting new people be added to a call.

Let’s give that a go. We’ll add a new action AddPerson representing the system message:

We’ll even write a test for our new action:

# val test_add_person_action : test_outcome = Pass

…and all is right with the world. Or is it? If you look carefully, we’ve baked into our test an assumption about the starting status of the state when handling the AddPerson action — something that’s easily done if we’ve got our feature implementation blinkers on.

What the test misses is the AddPerson action being used in a state that isn’t CallActive. As nothing actually validates that the state is correct, we could potentially end up having a person added to peopleInCall before the call is active — similar to what happened with the real bug.

The ideal approach

Although the type system has helped us out here a bit, we can leverage it even further by writing our states in a way that makes expressing this bug impossible in the first place.

Here we’re using a variant type to represent mutually exclusive bits of state. This means we can’t have any call state unless we’re actually in an active call, and similarly it eliminates some possible uncertainty in the CallIncoming state about the person we’re receiving the call from.

We also define a type for the outcome of an update:

Previously we only returned the original state for unexpected actions (returning state for the state.callIncomingFrom == None branch of the CallAccepted action, in update_v2). We know intuitively this shouldn’t happen, but let’s use the type system to our advantage by explicitly making cases like that an invalid outcome. That way the code that’s actually running our state machine can check for invalid outcomes and log an error message to let us know that something is wired up incorrectly, which may indicate another bug somewhere.

Now let’s use these new types in our new and improved update function:

The types have actually guided us here to make our transition function more correct — we are now forced to switch on both the action coming in and the current state in order to have access to the call state that we need to update when adding a new person to the call.

Meeting halfway

While this representation is ideal for this small example, we can’t always quite pull this off. Sometimes we simply can’t express the type of constraint we want, due to the limitations of the type system. Sometimes we can, but by doing so we make the code unreadable or very unwieldy to use. Alternatively, we may have started with the original representation, but have too much code dependent on it to be able to make the change under given time constraints (although the Reason/OCaml type system would definitely help with the refactoring!). So let’s not let the ‘perfect’ be the enemy of the good — instead, let’s take a step back to the original version of our update function in order to demonstrate what it looks like to meet somewhere in the middle.

One way of looking at static type systems like this is as tools for proving properties of your code at compile time. By writing our code in a way that satisfies the rules of the type system, we’re given a certain class of guarantees about our programs — the trade-off is that we lose some expressiveness. In our ideal representation we can no longer express certain invalid states, which is what we want! But sometimes by following the rules of the type system we find it hard to express certain things that we do want.

If we’re using Reason/OCaml, we have another option — we can use Imandra, a cloud-native automated reasoning engine.

Like a type system, Imandra is also a tool that can be used to prove that properties of your program hold true, but we can express these properties themselves as snippets of Reason/OCaml code. This allows us to direct the checking process in a way that’s much more tailored to the domain of the program itself, rather than on the more generic language level.

What does this all mean? Let’s start by writing a property, or verification goal, about our earlier implementation update_v2:

This is a function that takes an arbitrary state and person, performs some actions on them, and then returns true or false to indicate whether the property holds. So here we’re saying that if there are people in the call after responding to the AddPersonmessage, we’re in the CallActive state, which is a nice general concept we’d like to be true! We are specifically testing the action that we know is broken here, but let’s accept that for now and see what happens when we ask Imandra to verify the property for us (we’re specifying a max unrolling depth for verification of 100 with the ~upto flag — we’ll come back to this later):

# verify ~upto:100 add_person_property;
- : state -> int -> bool = <fun>
1 base ground recursive instance
Counterexample (after 2 steps, 0.017s):
let (_x_0 : state) = {status = Idle; callIncomingFrom = None; peopleInCall = []}
let (_x_1 : int) = 2
[✗] Conjecture refuted.
module CX : sig val _x_0 : state val _x_1 : int end

Imandra analyses our code symbolically and we don’t just get a pass/failure result — we get concrete counterexample values for the inputs to our function that illustrate the flaw in our thinking. If we want, we can compute and experiment with them (which can be very useful for investigating bugs Imandra has found):

# List.length(CX._x_0.peopleInCall);
- : Z.t = 0

As we said, the goal above is fairly focused in on our new feature, and we’ve written it already knowing there’s a bug in the area we’re targeting— let’s work with something that’s more universal, helped by the fact that we’ve modeled things as a state machine. A helpful pattern when dealing with state machines is to check that properties hold under arbitrary sequences of actions:

This checks the same underlying idea that we never have more than 0 people in the call unless the call is active, but in a more general way — we no longer check the single specific action. The nice thing about properties like this is that we can imagine that we’d still want them to hold true as we make additional changes to our state machine, and staying decoupled from specific actions helps us achieve that.

We simulate actions run in a react-like reducer using a fold (fold and reduce of reducer fame are synonymous). We’re also passing in a target update_fn as we’re going to use it on multiple versions of update as we progress. Imandra will check that the property holds for all possible values of the type ‘list of actions’ (list(‘a) — the last parameter to our property function). Let’s try it out:

# let initial_state = { status: Idle, callIncomingFrom: None, peopleInCall: [] };
# verify ~upto:100 call_with_people_in_is_active_property(update_v2, initial_state);
- : action_v2 list -> bool = <fun>
2 base ground recursive instances
Counterexample (after 4 steps, 0.020s):
let (_x_0 : action_v2 list) = [(AddPerson (4))]
[✗] Conjecture refuted.
module CX : sig val _x_0 : action_v2 list end

The sequence of actions with a single AddPerson item already contradicts our property, which immediately shows us our issue and gives us an example to help out:

# update_v2(initial_state, AddPerson(4));
- : state = {status = Idle; calling = None; peopleInCall = [4]}

Let’s also try running the same property on our original update function, from before we added the new AddPerson action:

# verify ~upto:100 call_with_people_in_is_active_property(update, initial_state);
- : action list -> bool = <fun>
2 base ground recursive instances
Counterexample (after 7 steps, 0.026s):
let (_x_0 : action list) = [(CallIncomingFrom (4)); CallAccepted; (CallIncomingFrom (5))]
[✗] Conjecture refuted.
module CX : sig val _x_0 : action list end

This reveals another case we hadn’t considered! We don’t handle the CallIncomingFrom action from the CallActive state — it drops us straight out of CallActive back into CallIncoming while leaving people in the call, which might not be what we want:

# update(update(update(initial_state, CallIncomingFrom(4)), CallAccepted), CallIncomingFrom(5));
- : state = {status = Calling; callIncomingFrom = Some 5; peopleInCall = [4]}

Now we know there’s a problem, we can re-work our update logic to accommodate. We learnt earlier while working on our ‘ideal’ representation that checking the current state is a good idea, so let’s incorporate that here:

Next, let’s check it with our general property:

# verify ~upto:100 call_with_people_in_is_active_property(good_update, initial_state);
- : action_v2 list -> bool = <fun>
2 base ground recursive instances
[✓] Conjecture proved up to 100 steps.

Imandra’s standard unrolling verification method can’t find any issues up to our fairly high bound of 100 here. Although it hasn’t totally proved things for us, this is a good indicator that we’re on the right lines as it can’t find any counter-examples. It’s pretty hard for us to prove things completely in this case using this method, due to the nature of our property — as the list of actions is arbitrary and our state machine contains cycles, there are valid sequences of actions that are infinite, for example [CallIncomingFrom(1), CallAccepted, EndCall, CallIncomingFrom(1), CallAccepted, EndCall, …].

If we want to increase our level of confidence even further, we can spend a bit longer to get a complete proof. In this case we can try Imandra’s [@auto] method, which performs a proof by induction for all possible inputs:

# [@auto] verify(call_with_people_in_is_active_property(good_update, initial_state));
... (output truncated for brevity) ...
Error: Maximum induction depth reached (3). You can set this with #max_induct.

Examining the output (truncated here), we can see we run into a limit due to our use of fold_left. A common trick when using induction is to switch to using fold_right instead, which is easier to induct on (this also means the actions list is ‘reduced’ in reverse order, so any counter-examples Imandra finds will come out in the opposite order from before, but that doesn’t make a difference here):

# [@auto] verify(call_with_people_in_is_active_property_fold_right(good_update, initial_state));
[✓] Theorem proved.

Fully proved! For larger functions going into Imandra’s more advanced features will require more expertise, and may or may not be worth it over the guarantees that the basic unrolling method gives us. Whether the cost makes sense will depend on what you’re working on.

One other improvement we could make is to incorporate the update_outcome concept from our ideal version, and check that the property holds for valid resulting states only. This would enable us to stop passing the concrete, valid initial_state in as an argument and allow Imandra to verify that this works for all possible state values in our property arguments, as the update function would handle filtering out invalid states for us. Analogously we can ‘guard’ our states with a separate is_valid function as part of the property. If you’re interested, have a go at doing this yourself in an Imandra notebook!

We’ve given you a quick taste of some new possibilities here, but the takeaway is that state machines are a useful tool — having a formalized way of thinking about a class of system makes life easier for both you and the computer, whether that’s via pure human brain power, a type system or an automated reasoning tool like Imandra.

If you found this interesting and would like to discuss any part of it, we’d love to hear from you — come and join us in Discord. We’ll also be at ReasonConf in Vienna in April — come and talk to us there!