Software Robots & You

Nicolas Ouporov
8 min readNov 22, 2022

--

Before the advent of buttons, menus, drag-and-drop, tabs, sliders, and dropdowns — all of the modern UI/UX innovations that shape our digital lives—the dominant mode of controlling computers was with language.

While it is easy to look back on this period, before Steve Jobs’ chance visit to Xerox Park and the massive wave of graphical computing that followed, as the awkward adolescent stage that precursed modern GUIs, this understanding couldn’t be farther from the truth.

On the contrary, this time produced some of the most enlightened innovations in human-computer interaction. Just have a look at these demos: Classic HCI.

In 1970, MIT grad student Terry Winograd released SHRDLU, a program that carried out conversations in a virtual “blocks world”: moving items at the user’s request and explaining the position of objects when asked.

An illustration of SHRDLU

And in 1983, a year before the release of the first Macintosh and 28 years before Siri, MIT Media Lab scientist Chris Schmandt demoed Put That There, an interactive map that placed elements based on voice input and hand gestures — Schmandt saying “Put that boat there” and pointing with his finger.

Schmandt & “Put That There”

Yet today, the dream of software as a perfect collaborator, responding seamlessly to voice, text, and gestural controls, seems to have been lost somewhere between the 1970s and now.

Half a century later, the tools we use on a daily basis — the Google Calendars, Salesforces, Figmas, and Photoshops of the digital world — simply do not offer comprehensive natural language controls or native multimodal support.

In contrast to SHRDLU and “Put That There,” these modern tools, in all their glory, seem to take their interfaces for granted.

At the same time, we have to deal with a digital world that is constantly growing in complexity. As so much of our lives has moved online and the resultant amount of content on internet has exploded, we spend more and more our time just searching, bookmarking, and cataloging content.

And even learning how to use our software tools is becoming increasingly challenging.

In the case of photo-editing, tools like Photoshop, by increasing the sophistication and number of their features, have had to make major tradeoffs on simplicity. As a result, beginners turn to online tutorials, classes, and books just to get started on the platform. Or they give up. In fact, this problem is so wide-spread that one-third of internet users watch online tutorials each week.

In addition to complex workflows with high barriers-to-entry, many of these applications are additionally siloed in “walled gardens” that don’t communicate with each-other, restricting the amount of useful information about a user that the software has access to.

Just take a look at all the buttons, panels, sliders, inputs, and important web information on this screen.

Like many others, I am a firm believer that “if you can say it, you can do it.” Or, at least in the digital world, you should be able to. To me, until machines can understand our intentions and the barrier-to-entry for new tools is negligible, there is still work to be done.

Fortunately, there is a technology on the horizon that shows promise in simplifying our online lives.

From “Visualizing A Neural Machine Translation Model” by Jay Alammar

Enter large language models. LLMs for short. These billion-parameter models are trained on internet-scale language data, enabling them to accurately predict the next word or series of words in an input sequence.

It turns out that these models demonstrate some pretty interesting emergent properties: like reason and artistry. They can generate poetry, create your next startup name, and write your emails for you. Researchers at Google are now using them to control robots. And by compiling games into sequences of text, these models can even play decent chess.

The most compelling fact about these Large Language Models is not just that they are smart (when prompted correctly), but it is also that they understand conversation — and conversation is one of the strongest tools that humans wield.

For fun, imagine an interface where you can communicate with an AI agent and present it questions about the world like “What instrument should I learn next” and it can combine reason with your past conversations on your love of jazz to craft the perfect response “You should learn the sax.”

Just consider the potential of this new approach. In a few years wouldn’t it be nice to be able to speak with a chatbot like this that can offer sensible advice.

The truth is I was being deceptive. Google already made this. And it was so convincing that one of their engineers thought it was sentient.

Google’s LaMDA: Language Model for Dialogue Applications

If these successes are any indication, language might might be that universal interface we are looking for. These large language models then become the common-sense engines that will power our software tools.

All signs point to a need to revisit language as a dominant mode for controlling software.

Right now, most workflows, like AI text or image generation, rely on a single prompt that determines the output. But the real opportunity lies in expanded this process into a free-flowing conversation — a partnership — between human user and LLM-based collaborator.

Using these LLMs, we can convert lengthy software processes into conversations between a helpful AI agent and the user. Say a user wants to add a calendar UI component to their design and the AI guides them through a number of available design decisions.

We could also have the agent carry out menial and routine tasks on our behalf. Like collecting examples of website landing pages that feature a rotating globe. In theory, the AI could ask the user whether they want a realistic or stylized globe and refine the results it returns.

The key is that a user asks the AI agent to generate an idea, sequence of actions, or asset (like a logo), and the user then responds with their thoughts. As this back-and-forth goes on, the user gets a better solution for their needs and the AI learns more about the user’s preferences.

In this paradigm, where you control complex software by having a conversation, a radical new world opens up. Finally, we have a model for technology that doesn’t just understand our needs, but can translate that understanding into a concrete set of actions on our behalf.

This is a paradigm where we work more creatively and collaboratively with our software. More human.

Soon every profession and process we carry out online will come with its own AI assistant-collaborator. Using Github’s CoPilot and Replit’s Ghostwriter, it is clear that AI assistants are incredibly useful for programming— and once you get used to them, the experience is nothing short of magical.

After using these tools, it’s hard not to envision a future with:

  • An AI copilot for filmmaking, that can color-grade for you or suggest edits. (Runway ML is close to this already)
  • An AI copilot for composing music, that suggests new chords and can find obscure Hip-Hop samples for you.
  • An AI copilot for lawyers, that can tell you relevant legal cases and analyze the probable outcomes of cases using knowledge of legal precedents.
  • An AI copilot for designers, that can understand your design preferences and source inspiration automatically or generate many possible versions of an existing webpage mockup.

In these examples, the AI serves as welcome assistant that you can offload computation to, one that understands your preferences and can do valuable work for you on the side.

Given that these types of AI tools can understand the digital content around them and then take action, let’s call them software robots. Intelligent software robots.

Companies like Adept have already made great progress in creating robots that demonstrate general intelligence and reasoning capabilities in the digital world. They also have made some of the coolest demos in the process. However, I am inclined to believe that the most useful application of intelligent software robots will be for specific use cases and professions. Places where there is a certain level of domain-knowledge, jargon, and expertise that the AI can be trained on.

Robot Chess. Generated by AI using Lexica.art

I believe that the next major advancements in computing will be a direct result of teaching software robots to understand us and fluently take actions safely and accurately on our behalf. And the best applications of these advancements are all the places where automation is the most needed.

To me, one of the most compelling places to apply intelligent software robots in in design. Design tools can get notoriously complex and have been historically splintered between professional and consumer use. The users of these tools also spend so much of their time carrying out tedious tasks that detract from the design process:

  • Searching for inspiration online
  • Seeing how other people worked with the same brief
  • Repeating tasks like copying, moving items, changing colors.
  • Searching up tutorials on advanced features
  • Managing complex UI kits
  • Aligning existing designs to the company “style”
  • Creating variations of existing designs.
  • Making designs look believable: inserting text & stock images.

Many of these problems get even more pronounced for entry-level users.

As such, I am most interested in creating AI tools for designers, marketers, and creatives. The people that live and breath in ambiguity, hate switching contexts between apps and workflows, and value their time & productivity to the nth degree.

It is also one of the best environments for stress testing this technology. A good AI assistant for design should understand the world via text, voice, and vision, and then convert all that information into a useful end result for the user. In addition, as design is, by nature, a process filled with ambiguities, an AI-tool can be useful even if it doesn’t always have a perfect solution.

My core belief is that an AI collaborator will enable creatives to be more productive, reduce context switching between apps, and spend more time in the flow of being creative and less time on menial tasks.

Ultimately, I hope that this technology will make complex software tools approachable, supercharge understanding, and democratize access to design.

Graph Adapted from Silvio Savarese’s “The Age of Conversational AI”

That’s the dream behind Newt. So far, we are building a tool in Figma that allows for object insertion and manipulation just using text, AI-generated text and image creation, and supports both basic shapes and custom elements from a user’s UI Kit. Using the tool, even in this early stage, feels nothing short of a superpower. And this is just the beginning of a larger system that integrates multiple creative apps & workflows seamlessly.

We are dedicated to building intelligent software robots — starting out with design and eagerly expanding across disciplines.

If this vision of the future appeals to you, reach out

Let’s build.

-Nic

--

--