Optimizing Permission Lookups in Freehand

Pan Thomakos
InVision Engineering
8 min readJun 2, 2021

Freehand is InVision’s online collaborative whiteboard. It enables your entire team (designers, engineers, product managers) to brainstorm, wireframe, run meetings, and more — all in real-time.

As you might imagine, building a real-time collaborative web experience requires a lot of speedy communication between the application running in your browser and the server in the cloud that coordinates everyone’s simultaneous changes. There’s a whole slew of interesting problems related to managing change conflicts that I’m going to save for another day. Today, I want to talk about two other parts of this architecture: buffered changes and permissions. Before we dive into the specific change, let’s cover some background to give you a better sense of our architecture.

The Background

Each Freehand is represented by a single binary protocol buffer encoded file stored in S3. Each protocol buffer is a collection of entities (such as lines, shapes, and text areas) and their positions and properties (color, size, etc.) that make up the collaborative Freehand file at a point in time.

An active Freehand is one that has at least one user connected to it. When a Freehand is active, that protocol buffer is decoded and kept in memory on a single Freehand GoLang server for as long as there are any users actively working on the file. It's persisted every ten seconds to S3 as well.

The in-memory representation of a Freehand is called a HUB. In order to support virtually infinite horizontal scaling, we want to avoid connecting every one of a possible hundred or thousand users that are all accessing a single Freehand to a single HUB and a single server. Our solution is to define two types of HUBs: primaries and secondaries. There is only ever one primary HUB, while there are any number of secondary HUBs. Each user is pinned via a websocket connection to one HUB. Whether that HUB is primary or secondary, the user’s experience is identical.

The primary HUBs are the only ones with in-memory representations of the Freehand. They receive incoming changes from all of the secondary HUBs, resolve conflicts, apply changes, and then publish the canonical version of each entity on the whiteboard canvas to all of the secondary HUBs — and subsequently to all of the users connected to each of those HUBs. As you can imagine, primary HUBs do a lot more work, so we want to ensure that they are basically evenly distributed among our servers, in order to equally distribute the load.

Secondary HUBs do not have an in-memory representation of a Freehand. They collect messages and pass them along to the primary HUB, and they broadcast messages back to their connected users. The collection of messages that the secondary hub passes to the primary HUB is where some of the buffering and permissions checks come into play. If we can process and combine messages on secondary HUBs, that relieves load and pressure on primary HUBs, and also improves performance and scalability.

Buffering… Please Wait

Consider the case of a user dragging an object across the screen (left) and how that information is communicated to another user connected to that same file (right).

In this scenario, we need to be careful about how often we send changes to the server. If we were to issue a new message to the server for every mouse move event, we would generate a ton of unnecessary traffic. Subsequently, if every incoming message to a secondary HUB was propagated immediately to the primary HUB, we would need to apply and then publish an equivalently massive number of messages to all connected clients.

There are two buffers that mitigate this load. One is on the client side — we don’t send every mouse move event to the server. Instead, we buffer changes and only send the final state of every entity that has changed per tick of a timer to the server. This ticker is configurable, so let’s say it triggers every 50ms. The other buffer is on the server HUB. Each HUB buffers incoming messages on a similar timer so that it doesn’t overwhelm the primary HUB with unnecessary messages.

I want to dig into the server buffer and explain how it works. I’d also like to explain a recent change we made to improve performance and simplify our architecture. Before I do that, it’s worth mentioning that the clients can avoid some choppiness in animation by interpolating a path between updates and then animating that change. This is entirely configurable, but it does enable us to create a smoother experience without having to send all mouse move events from every user to all other users.

Finally, aside from buffering frequent move events, the server buffer also provides another important throughput modulator. In order for HUBs to communicate with one another, we use a Redis pub/sub.

This enables a secondary HUB to communicate directly with the primary HUB, and enables any HUB to broadcast a message to all other HUBs (for example, when a primary has a new state for an entity). In order to avoid excessive I/O, we batch changes that we communicate through the Redis pub/sub system.

Let’s look at this from the perspective of a secondary HUB: you are secondary HUB A and you receive four changes from connected users within a 50ms interval:

  • Andrea moved box A to position x1, y1 at T0.
  • Philipe changed the color of box A from green to blue at T1.
  • Daniel moved box A to position x2, y2 at T2.
  • Andrea moved box A to position x3, y3 at T3.

Let’s assume that the desired outcome of these four operations (that all occurred within 50ms of each other) is that we move box A to position x3, y3 and set its color to blue. Our clients have already buffered changes, so there might be many intermediate positions that they never sent us. The duplicate position change from Andrea could be from multiple connected sessions, or it could be that we experienced a delay in receiving or processing messages. Doesn’t really matter — we have two messages from her that we need to process.

In our server buffers, we combine all of these changes into a single message that updates the state of box A. That message is batched together with all other updates that happened in that 50ms interval and published once to the primary HUB via Redis pub/sub. When that change is propagated to the primary HUB, the primary only needs to process one message.

Permissions — or are you allowed to do that?

But there’s a problem here. What if, in our previous scenario, Andrea did not have permission to move box A? Well, then the expected outcome of these four operations would be to move box A to position x2, y2 (Daniel’s final change) and set its color to blue.

In the past, in order to accommodate this possibility, we made all of our secondary HUBs permission-aware. In other words, they need to be able to verify the validity of every operation before applying it to their internal buffers. So how expensive is it to make secondary HUBs permission aware? The answer depends on the complexity of the permission system of course!

We have two broad permission levels to account for. The first is file level access such as: user X has access to edit all of Freehand Y. The second is entity level access such as: user X only has access to manipulate entities she has created on Freehand Y. File level access is much easier to validate, while entity level access is more complex. If I can move only my own entities, I need to inspect each message I receive and validate that I created the entity that it modifies before permitting the operation.

The data backing the first of these permission checks is also relatively easy to store on secondary HUBs and refresh periodically. It is bounded by the number of users connected to the HUB. The second set of permission checks is more difficult to store. It requires a creator lookup table with a key for every entity on the whiteboard. And the operation needs to be fast! Remember that my buffer is flushed every 50ms and can contain updates from multiple users.

If I store all of those entity ownership relationships, in memory, on every secondary HUB, I am beginning to border on being a primary HUB, and my memory footprint increases linearly based on the size of the file. While this approach is not infeasible, the way that we solved this problem in the past was by storing entity ownership information in a hash in Redis. We call the Redis entity ownership cache the EntityStore. This enabled us to do a quick lookup when that permission was needed, without having to store the value in memory. The downside of this approach was that we now needed at least one bulk Redis lookup for a set of changes before determining if we could apply them.

An Improved Buffer

Recently, we made a change to the buffering system to avoid the Redis EntityStore entirely. The new approach also removes the need for secondary HUBs to perform most permission checks. Here’s how the new approach works.

The buffer now has an additional hash level. Instead of buffering all changes per entity, we create a new entity buffer for every user. This means that we don’t need to resolve cross-user conflicts right away, and we can send all of the buffered changes in bulk to the primary HUB. The primary HUB is the only one that needs to check permissions (which it can do because it already has all of the entity ownership data in memory by virtue of being the primary HUB).

This new approach certainly has the potential to increase the buffer message size. But the increase only manifests when there are multiple users, connected to the same hub, that are modifying the same entity within the same buffer time window. In practice, the overhead will be low in the vast majority of cases. The other possible downside to the new approach is that the primary HUB must now perform additional permission checks and process more messages than it did before. Again, in practice the overhead is low because these are in-memory checks, but it is a trade-off. The benefits of the new approach are clear: we can avoid an entire class of Redis I/O operations and the maintenance of the EntityStore cache. The change also resulted in the removal of about 2000 lines of GoLang code.

What’s Next?

Hopefully you learned a little bit about our cool Freehand product and how it is designed. We’re so excited to be leveraging this technology to shape the future of design and collaboration. If you’re interested in working with me on these awesome multiplayer and canvas technologies, we’re always hiring!

--

--

Pan Thomakos
InVision Engineering

Principal Engineer at InVision, previously at Strava, he/him