Actionable Large Language Models

Sivesh Sukumar
Balderton
Published in
4 min readDec 8, 2022

The launch of ChatGPT (alongside other models from the likes of Stability and Cohere) really brought AI into the limelight and it’s now safe to say that AI is being consumerised. Countless posts have hypothesised as to what’s going to be possible in the very near future with these models and it seems like the upper bound is constantly increasing.

AI can generate content but what if AI could generate actions? At Balderton, we see a future in which AI not only generates instructions for a problem but also goes on to solve it (and we don’t think it’s far away!).

The technology underpinning all great LLMs such as ChatGPT, T5 and RoBERTa is known as the transformer. It’s an iteration of the recurrent neural network (RNN), which produced breakthroughs in sequential analysis problems such as natural language processing and time-series analysis (i.e. anything that could be modelled as a sequence). LLMs have shown how transformers have levelled up NLP and there’s now evidence to show transformers are just as effective in other time-series problems such as trading. We’ve also seen examples of transformers being used beyond sequential analysis problems such as computer vision by using clever techniques to convert the image into a sequence — the paper is aptly named “An Image is Worth 16x16 Words” and results in SOTA performance with substantially lower computational resources.

Architecture of Vision Transformers — converting images into sequences

This post explores one use case of transformers and LLMs which we’re particularly excited about.

Actionable LLMs

It’s clear that transformers and other breakthroughs in AI are great for generating content (such as text, code, images and videos) but what if AI could generate decisions and take actions, based on some simple plain-language prompts?

AI has previously made headlines by being very good at making decisions (mainly courtesy of DeepMind) and becoming world champions at complex games such as Go. The technology underpinning these breakthroughs is known as reinforcement learning (RL), which is a framework to build decision-making agents and learn optimal behaviour by interacting with the environment through trial and error and receiving rewards as unique feedback. RL led to huge advances in a wide range of real-life, decision-based use cases such as industrial automation, healthcare, marketing and autonomous cars.

Reinforcement Learning framework

Decision Transformers were introduced by Facebook AI Research and Google Brain last year by applying transformers to an RL framework. In the same way “An Image is Worth 16x16 Words” abstracted an image into a sequence, Decision Transformers abstract RL into a sequence modelling problem. A great Hugging Face blog post explores this, if you want to dig further, here. This is just one way of building actionable models, there are plenty of other frameworks which set out to solve the same problem such as ReAct by Google and MRKL.

ChatGPT has shown us that the next era of computing will be defined by natural language interfaces, which allow us to tell computers what we want directly. The real beauty is that they can interpret intent. Adept is taking this to the next level by developing the Action Transformer (ACT-1), which is a model to act within the action space of the UI elements in a web page i.e. you can tell the model to do anything within a browser or enterprise application. If you’re not already excited by just reading this then it’s worth watching a few demos here to really appreciate what this could mean.

Screenshot of ACT-1 at work

Adept is going for the OpenAI approach and building a broad foundation model with an insanely large “action space” — this defines the bounds within which actions are supposed to be made. Whilst the concept of Decision Transformers are cool, they’re not trivial to build and it’s still unclear as to how they’ll be used. However, there’s now an immediate opportunity in leveraging LLMs to build out logic and act within verticalised action spaces whilst also focusing on a great UX.

There’s already signs of this happening — for example, Glyphic is building a product to take actions within the action space of B2B Sales, ShiftLab is building a product to take actions within the action space of e-commerce and Harvey is building for the action space of a lawyer.

It’s worth noting that there are many action spaces which aren’t widely perceived as action spaces e.g. Jasper.ai took on the action space of a blank advertisement and Copilot took on the action space of VSCode. Any no-code tool is essentially an action space, so it’s only a matter of time before these tools all start building AI features for users to interact with their platforms via natural language — Glide, Fillout and Qatalog are already exploring this.

There are bound to be AI use cases in the ultimate action space, the physical world, and we’re already seeing advances in robotics via unsupervised learning.

Conclusions

We believe that the most useful models will be “models that act” rather than just generate and we’re moving towards a world of domain-specific versions of Copilot to reach new levels of productivity.

In the past 12 months we’ve backed many AI-native companies such as Levity and Photoroom. If you’re building in the space we’d love to speak with you — feel free to reach out at ssukumar@balderton.com

--

--