Pass the Remote: User Input on TV Devices

by Andrew Eichacker

The Netflix TV team works with device manufacturers to explore new input methods (like your phone!) and improve the screens we watch our favorite shows on. Beyond that, we’re testing the boundaries for content discovery and playback while bringing Netflix to more users around the world.

We’ve come a long way from the television dial. From simple remotes to dedicated tablets to waving hands to saying “Hi TV!”, there are a variety of ways that users interact with their TV today. For the sake of this post, we’ll keep it to the two primary input methods: standard and pointer remotes.

Up Up Down Down Left Right Left Right…

We use the acronym LRUD to describe input via directional controls — that is: Left, Right, Up, and Down. Of course, there is also a selection button (e.g. OK) and usually a Back button. Users navigate the screen via UI elements that can be focused via a directional key, and subsequent key events are handled with that element as the target. For web developers familiar with accessibility requirements, this might sound similar to tab order, with the additional dimension of directional navigation.

Unlike tab order, there is no default handling by the platform — navigational order is defined by UI developers, as if everything has to specify a tabindex. While it might seem simple to navigate to another element in one direction, it can be challenging to maintain order in the midst of dynamic UI layouts and AB tests. One simple approach is to use delegation to let the common parent control the flow of navigation.

Working with a device that only supports LRUD is a bit different than a web/mobile app. Maintaining the correct focus for a single element is essential, as the user doesn’t have direct interaction like they do with a mouse or with touch. If focus gets in a bad state, there is no way for the user to recover by clicking or tapping around, so the user will think the app has frozen.

They Say It’s Rude to Point

Some TVs support a pointer remote, which allows the user to point the remote at their TV to interact with what’s on screen. Pointer navigation should be familiar to most web developers, as it is very similar to mouse/touch input. The UX tends to parallel mouse usage more than touch, such as using a scroll wheel or arrow affordances to navigate lists instead of swiping. Due to the distance to the TV and difficulty holding the pointer still while your arm is in the air, buttons require large targets like a touch-based UI.

TV pointer remotes aren’t just mice, however; they also have LRUD buttons. This has the potential to make things rather confusing for the user. If using the pointer to focus one element and LRUD navigates upward, which one remains focused — the element above or the one under the pointer?

As a result, most devices have introduced modality to the remote: when using the pointer, only the pointer behavior is respected. If the user starts interacting with LRUD, the pointer is hidden and the interaction switches to LRUD only.

There are a couple of things we need to do to support and reinforce this modality. When in pointer mode, focus could be lost by pointing over a non-interactive element, so we must establish a reasonable focus when switching to LRUD mode — either the last-focused element or some screen default. Since pointer scroll affordances only make sense for pointer mode, we hide those in LRUD mode.

Building a Better Mouse (and LRUD) Trap

Over time, we found that handling input was rather cumbersome. Many of our views had custom-built focus handling that broke when the composition of the screen changed, such as with features introduced by AB tests. The UX differed slightly from screen to screen for things like re-establishing focus when switching to LRUD mode or LRUD navigation in asymmetrical layouts. We had a number of bugs that we ran into repeatedly on multiple views, such as:

  1. Multiple items try to focus at once, causing both to appear highlighted but with undefined behavior regarding which one will get key events
  2. Something behind the top-most view steals focus, causing navigation in an unseen area
  3. Switching between pointer/LRUD modes causes dual focus or no focus
  4. A focused element is removed, and nothing claims focus — leaving nothing focused

When we set out to build our app with React, we wanted to craft a more robust solution for user input. We landed on 3 core systems: declarative focus, spatial navigation, and focus requests.

Declarative Focus

While moving to React, we tried to rethink a lot of our controls that were historically based on imperative APIs and see how we could design them to be more declarative. This also allows us to avoid the usage of refs to imperatively focus.

Instead of divs, our core building block is a Widget. In order to differentiate between any widget and one that could receive focus, we created FocusableWidget, which takes an additional 2 props:

  • focused— boolean indicating if this widget should have focus
  • focusKey — used to identify the FocusableWidget and construct a focusPath

FocusableWidgets can be nested, giving a structure to how a FocusableWidget relates to others. A focusPath is just a way to signify a path from the root FocusableWidget all the way down to a leaf, e.g. ‘/app/menu/play’. These will come into play more with the other two systems.

Since focus is declared as part of rendering, we can validate and apply the declared focus when rendering all elements completes. This gives us an opportunity to assert on error conditions (e.g. multiple focused widgets, nothing focused).

Spatial Navigation

Spatial navigation is intended to make it easier to determine what should be focused by an LRUD event without having to write custom navigation code. The idea is to identify the most-likely desired element in the direction of the key press.

The primary concerns with this approach were:

  • Performance — a recurring theme for us — how are we going to look through all of these elements quickly?
  • Correctness — how do we ensure the correct element is focused, even in cases where the closest element is not the correct target? We ensured spatial navigation can be interrupted so custom handling can be implemented when necessary, but we’d prefer to avoid doing that all over the place.

Focusable Tree
Part of the answer for both of these is the focusable tree, which is a structure of FocusableWidgets culled from the widget tree.

For performance, this limits the number of elements to only those that could influence the end result. Not by much, in this example, but a full UI has far more Widgets than FocusableWidgets.

For correctness, this gives us a way to influence navigation structurally instead of just spatially. If we always want the rate button above to be focused first, for example, we can make the menu a FocusableWidget container. Moving left from related would then focus menu instead of play, which can then drive down focus to its children as it sees fit.

Nearest Neighbor
The nearest neighbor algorithm itself was also tuned for performance, and was inspired by collision detection algorithms in game programming.

  1. Provide the current element and its focusable siblings from the focusable tree
  2. Filter out elements that don’t lie within a frustum extending in the direction of the keypress
  3. Determine the Minkowski difference box between each element and the focused element
  4. Find the shortest vector from the origin to each box
  5. Select the closest element with the smallest vector to focus
  6. If nothing is found, we repeat the algorithm recursively with the current element’s parent

Focus Requests

Spatial navigation just finds the right element to focus, so we don’t need any fancy algorithms for pointer — we can just use the FocusableWidget the mouse is over. We also save the last focused element so that the focused element can be reset when switching to LRUD mode, making the LRUD/pointer switch a breeze. In all cases, once we have a target element, we can emit a focus request.

Once a target is established, a focus request is emitted with the focusPath. This event is handled by the root of the application, which saves the path as part of our application state. This kicks off a new top-down render, communicating the focusPath downward to designate the path to the component that should receive focus.

We use a Higher Order Component to convert the path into helpful props (like focused and entered) so that components can modify their visual styles as necessary, and ultimately assign focus to the proper FocusableWidget.

Impact

With these systems working together, UI developers could compose dynamic views without building custom navigation logic. Less customization means a more consistent UX and a single place to fix problems, avoiding regressions.

Allowing components to utilize a single source of truth (the focusPath) avoids issues where individual components try to focus or relinquish focus out of turn. Centralizing the assignment of focus enables validation to find bugs early and provide clear messaging to the developer.

From Prototype to Product

We built and tested these systems with a simple UI in an odd layout and a handful of different focusable tree configurations. It was pretty amazing to see LRUD and pointer working perfectly together without a single line of code customizing the navigation. We use it today on millions of TV devices ranging from streaming sticks to high-end game consoles.

Does this spark your interest, or do you have a better idea for handling TV input? Join us, and help us make the Netflix experience even better!