Most authoring applications like Word, Excel and PowerPoint combine three key elements: a complex data model for artifacts that the application creates, a complex runtime for interpreting and presenting that data model, and a complex authoring experience for creating and manipulating that data model. Complexity is often viewed as a negative — we want applications to be simple, intuitive, easy to use. But historically, complexity — or to put a positive spin on it — “richness” or “power” — have been critical to the success of the Office applications. In fact, an argument can be made that the network effects and competitive moat created by the complex Office data formats has been key to the incredible longevity and robustness of Office’s competitive position over decades.
While it may be intuitively obvious that it is harder to write an easy-to-use experience for a complex data model, in fact I would argue that the challenges are really “just math” and it is helpful to understand the basic reasoning.
When I say one data model is “more complex” than another, I mean that there are more possible states that the data model can represent. The challenge this creates is that any editing action needs to map from one location to another in this larger state space. The user is still making simple gestures but is operating in a more complex space. This inherently creates ambiguity. An example will probably help.
Consider that I want to draw a text box onto a document surface. In an application like PowerPoint, all boxes are specified using absolute coordinates relative to the top of the slide. So an interface that allows me to simple drag out a box is unambiguous. An application like FrontPage supports the full HTML data model. In HTML, that box can be positioned relative to any object in the document, it can be sized using multiple different units (including percentages or leaving it unsized to be automatically sized based on content) and can be anchored relative to left, top, bottom or right (with an arbitrary offset, again using multiple different units). If the user simply drags out a box, the application has a much harder time mapping that simple user gesture into an explicit user intent. In fact, if the user is not aware of all the power, they might not even have a clear intent — or at least not one that can be cleanly mapped into a new data model state.
This challenge of mapping a limited set of user actions into user intent and then into a specific transition of the state space of the data model is the core challenge. Whenever the data model is made more powerful, you inherently create ambiguity in interpreting user intent when they make simple gestures. That’s just math.
One characteristic of this challenge we often face is that it is comparatively easy to leverage recursion and composition in both the design of a data model and the design of a runtime. Programmers love recursion and composition. However, In the authoring experience, recursion and composition often create problems. For example, adding the ability to nest tables is powerful, but if I copy a table row and then go to paste that copied element into another table, do I want to paste it as a top-level row of the table or do I want to create another nested table? This is the “larger state space” problem I alluded to above. I now have ambiguity about how to interpret this simple user gesture and must either force the user to disambiguate (dreaded interstitial dialogs such as Outlook’s “Should I open this instance of the meeting or the recurring meeting?”) or must try to use subtle heuristics to divine user intent (e.g. precisely where the selection was located when the paste was done).
A deeper problem that recursion and composition creates is around selection. In a complex, composed data model, it is harder to make selections and then it is harder to specify which composed elements I want to operate on within the selection. Selection is such a basic element of most of our object-verb interfaces that challenges there are fundamental to challenges in simplifying the user experience.
As a data model gets more powerful, users can often use the primitive capabilities to represent domain-specific semantics that the application does not fully understand. The application will then evolve over time to directly support and encode some of these semantics in the data model in order to provide a better authoring experience for these scenarios. This often involves adding a level of indirection (style sheets in word processors are the classic example). As every programmer knows, any problem can be solved by an additional level of indirection. Indirection is both very powerful and especially difficult to represent in the user experience. This often ends up being a slippery slope since those new capabilities now complicate the interface for all users, not just the ones whose special scenario you were trying to support.
Attempts to simplify the interface where the user doesn’t have a mental model of the full power of the application often runs into challenges where we can create simple ways of creating data model states that the user does not understand. Word has a very powerful “section” model and uses this for the “Insert Cover Page” feature accessible from the ribbon. That’s a very easy-to-use interface, but if you don’t understand how this works (in particular, page numbering and page layout is bound to the section) it is easy to get into trouble where the behavior of the document becomes hard to understand and control. Even long-time Word users sometimes find themselves editing a document that behaves unexpectedly, most often because it is using a feature whose behavior they are unfamiliar with.
What’s the lesson here?
My main point is that you want to go into these design trade-offs with eyes wide open and recognize that slippery slope. If you can actually keep the data model simple (e.g. Twitter’s “140 character message”), simplification of the user experience typically follows. As you make the data model (and runtime) more powerful, you will inherently run into problems about how to map user intent. An application interface can be unnecessarily complex, but in many cases the introduced application complexity is inherent in the increase in power.
How to map user intent is the hard problem. In some cases, we can simplify by providing an entry point where the user directly expresses their intent. So a user interacting with PowerPoint’s slide thumbnail view is always operating at the full slide level (copying, pasting, rearranging). The entry point itself is used to disambiguate the user’s intent when they take some action like Paste. We can also take this approach at the app level — so Microsoft provides Office Lens as a separate application that is tuned around the user’s intent to create new OneNote pages with a camera image on them.
The fundamental problem is how to map user intent. You cannot create more power in the data model without introducing matching problems in understanding what the user actually intended when they took some action.