From Cups to Consciousness (Part 1): How are cups related to intelligence?
The philosophy within cups, language grounding and a survey of 3D environments
At MTank, we work towards two goals. (1) Model and distil knowledge within AI. (2) Make progress towards creating truly intelligent machines. As part of these efforts the MTank team release pieces about our work for people to enjoy and learn from, completely for free. If you like our work, then please show your support. Thanks in advance!
- Part 1: How are cups related to intelligence?
- Part 2: From simulation to the real world
- Part 3: Mapping your home with SLAM
The ‘philosophy of consciousness’ debate
“True happiness is found at the bottom of a cup”
A core idea that brought the Mtank team together was our interest in building ‘conscious machines’ — a vague statement that we started chucking around together a few years back. Given enough time, each grouping of members in our team, regardless of configuration, will eventually have their conversations converge on this idea. A conversational gravity well against which we do not struggle.
These conversations are not unique to us, debates over thought and consciousness appear everywhere — René Descartes’ ‘I think, therefore I am’, Alan Turing’s ‘Computing Machinery and Intelligence’, John Searle’s ‘Chinese Room Debate’, David Chalmers’ ‘The Hard Problem of Consciousness’, etc. They are a staple of modern, and classical, philosophical thought. They are also staples of modern media, although often times the original nuance is bludgeoned into a one-size-fits-all-terminator-box. There is considerable daylight between these ideas and what twitter-hype and modern tech journalism have to say about topics like ‘machine consciousness’ and ‘artificial general intelligence (AGI)’.
Even more so when we look at the divergent methods through which researchers believe that general intelligence(s) may be created — for instance, see Yann LeCun and Gary Marcus arguing about what is necessary for AGI for the umpteenth time, or the Bitter vs Sweet vs Better Lesson. Suffice to say that there are near-countless thoughts, theories, definitions and supposed-answers to the question of whether machines will ever equal or surpass human capabilities.
Pinning down what makes an intelligence generalisable is difficult; agreeing on a single definition is near-impossible. The shifting goal-posts of intelligence-definition are one of the most interesting aspects of the field itself. Philosophers argue constantly about what consciousness is, stating that it may be one of, if not the, single most mysterious concept in the universe. And the gravity well has brought us towards its centre.
So all of this consciousness stuff is very exciting, but we don’t actually intend to provide any answers to these deepest questions. Nor can we shed much light on intelligence or consciousness. But what we can do is:
Teach a robot to pick up cups at an alarming rate…
In the real world. Teach it to reason about cups, to measure its worth in cups, and to understand cups, and cup-related things in hundreds of different homes. Yes, even yours.
The philosophy ‘at the bottom of every cup’
“In the beginning, God created Heaven & Earth. At the end, MTank created robots that collect cups.”
Right, so why cups?
Beginning with cups allows us to abstract away the difficulties related to ‘consciousness’, and focus on the simple, pragmatic tasks (which may be) required to create truly general intelligences. An apt and pragmatic definition of AGI, for our purposes, would be a machine which can complete the majority of tasks that humans are capable of. This could be narrowed by environmental domains, i.e. all household tasks, or by general roles, e.g. farming, cooking, accountancy, gym instruction, etc.
Again, why cups?
So, as we pointed out, consciousness is something that has captivated and vexed humanity for centuries, even millennia. But cups — cups we figured out ages ago. Cups have remained relatively constant throughout human development, a bulwark against a changing world, a stable representation throughout time and geography. We like that people have pondered the biggest questions clutching a humble cup in reflection; we like that cups are everywhere.
As individual as it is, intelligence is also built on social interactions. Intelligent machines will have to manage within our environments; they should be adapted to interact with the modern world. For our purposes, we thought that household environments, i.e. apartments and such, would make an interesting stomping ground. Places where our agent can navigate an intimate space with people, and (potentially) help them in their day-to-day.
Also there were a raft of kick-ass 3D household environments released recently and we were eager to test them out!
If consciousness is an emergent property of sufficiently complex systems, then we want our agent’s consciousness to emerge while on the hunt for cups. We want our little fella to ponder his back-and-forth in search of cups, ponder its usefulness, ponder its true value — ponder his true value.
Sure this might seem far-fetched, unsubstantiated and honestly, a little cruel — but we do promise to adjust the variety, complexity and difficulty of the agent’s tasks over time. Just so that sentience isn’t extinguished by crushing boredom, as is so often the case in the real world.
The why, what and how of cup-picking
Start small, start simple. Let’s train embodied agents to navigate through realistic 3D house environments, while picking up cups, and let’s increase the speed (dramatically) over time. As you read this, there are armies of PhDs attempting to program hand-like grippers to pick up arbitrary objects, and legions of industry researchers and engineers programming robots to do increasingly complicated tasks; we’ll start here too.
We intend to iterate quickly, meting-out task configurations that require an agent to understand natural language instructions, that require our agent to communicate, that require our agent to understand. Together, instruction, environment and nascent common sense guide him to accomplish tasks defined in response to our deepest desires, e.g. “pick up the cup”.
Natural language-instructed cup-picking, as we define it, involves a combination of Reinforcement Learning (RL), Computer Vision (CNN architectures) and Natural Language Processing (NLP) to land in the realm of language grounding.
What is this mythical language grounding of which you speak? Well, grounding establishes concepts beyond the physical presence of objects, i.e. using abstract symbols or representations, and then “grounds” them towards some meaning represented in the real world. Natural language itself is one such symbolic system.
To illustrate this we can use the concept of “dog” as an example. When you read the word “dog” in this document, you likely have a clear picture of an animal with 4 legs, fur and other features that weren’t mentioned. That is, we have a “common ground” or understanding of what a dog is, and can discuss the concept without needing to carry an example around with us all the time.
We have grounded the concept of “dog” with language, i.e. we use the word as an abstract equivalent of the animal. But it could be any other arbitrary symbol that we agreed upon, not just a word, and not necessarily the letters d-o-g together. Language is often how we understand another’s thought process, it will also allow us to communicate with the agent.
Any true intelligence (likely) requires an ability for language, whether shared or internal, as well as the agent’s embodiment within the real world. Said agent could then adapt their internal goals and priorities based on the situation.
Language instructions are one of the ways in which we can train a reinforcement learning system, or policy, to accomplish multiple tasks. We chose it because it makes human-agent communication more convenient, while also enabling multi-task learning.
Recent work, like Devendra Chaplot’s, showed that the foundations of language grounding are already within reach in the ViZDoom environment. For instance, if you place an agent in a room containing several objects in randomised positions. Depending on the language instruction given as input at the beginning of each episode, the agent learned to navigate to the particular object while avoiding the others, e.g. “go to the small green torch”. DeepMind also contributed by adding relational instructions, e.g. “the object next to” or “in the green room”, to this task in their lab environment.
This involves multi-modal inputs — vision and language (see our publication on ‘Multi-Modal Methods’) — since the agent will have to use the natural language symbolic system to establish the meaning of the word “cup” as he observes cups in the environment. And that “pick up” actually defines a specific goal in the real world.
From here, the symbol of “cup” can be then be mapped-to a particular desire of, or “meaning” for, an agent — a burning desire to go find a cup within his visual system. Such desires are only the beginning. Should our agent have limbs, then he may wish to manipulate the cup and engage his haptic feedback systems.
Our agent shall learn the smooth roundness of a cup and associate the symbol of “cup” to this smooth roundness which he is particularly fond of. He will begin to understand what is “cup-like”. He will have unknowingly grounded himself in the meaning of ‘cups’ and have begun his journey From Cups to Consciousness.
Viva la Robo-lution, Oui?
For a machine to ever achieve this dream it would need a world to play with, i.e. an environment. To date, robots have been stuck in strictly controlled environments, mostly imprisoned within factories.
The upcoming robot revolution is inevitable with the quickening development of learning machines. Machines which can handle ambiguity, as well as the complex and dynamic environments with changing requirements that our homes represent. Most people would welcome a robot to relieve us from mundane tasks like vacuuming, cleaning and setting the table. And the list goes on.
So what is a good starting point for all of this? Can we finally move away from simplistic 2D Atari games as the test bed for our strongest RL algorithms? The short answer is yes…
Training and experimenting with our agents in the real world will be cumbersome, slow and difficult. However, by having an accurate and realistic enough simulation of typical rooms, we can train thousands of agents simultaneously with tasks relevant to the real world. It is also possible to train for rare edge cases that we couldn’t possibly handle or recreate. All of which can be done much faster than real time.
Our goal is, therefore, to start with simulated but realistic 3D household environments, e.g. kitchens and living rooms. With these we can create tasks that better represent day-to-day problems involving a person’s most common requests (“make me tea”, “clean the kitchen”, “find my favourite cup”). To deploy these systems in our homes, we need to transfer knowledge and skills from the simulation to the real world.
“Simulation is doomed to succeed” — Rodney Brooks
“Prediction: Any AI problem that you can simulate and sample endlessly many training samples for can be solved with today’s algorithms such as deep and reinforcement learning.” — Richard Socher
3D Environments: More than just plain games
There has been a huge effort on the part of the major companies and research institutions to publish new environments to tackle the challenges around learning in complicated 3D worlds. In a similar vein to how ImageNet propelled progress within computer vision, this time we open up the path to a different, broader and potentially a multi-modal revolution. One that will bring totally new RL and AI algorithms, as well as approaches that combine multiple advancements together, e.g. recent work in Variational Inference, world models, RL and intrinsic motivation.
In recent years, there has been an explosion of 3D environments released that are finally reaching a scale and realism that resembles the real world. These environments include AI2Thor, HoME, MINOS, ViZDoom, House3D, habitat, CHALET, UnrealROX and Gibson. Many of these environments build on the 3D datasets of Matterport3D, SUNCG (45K different scenes) and others. There are a variety of ways these environments differ, i.e. customisability, scale, manipulation, physics and photo-realism.
Check the tables below!
As one can see, this is a very exciting time for RL algorithms. We have arrived at a point where it is possible to test on an increasing number of difficult environments and the tasks within them. The increased demand for ‘intelligent, increasingly-autonomous systems’ have produced a supply of environments for training them, which have created a feedback loop between researchers who push the environments and environments which push the researchers.
That is, many clearly defined tasks within many variants of 3D environments with measurable metrics will force researchers to eventually come up with agents that can learn some high level of perception, prediction, reasoning, egomotion, manipulation, grounding, affordances, abstraction, planning, awareness and eventually to something we can call “truly intelligent”.
Exciting yes, but where do we start?
There are many pros and cons to each of these environments, so we worked backwards from what we would like our agents to do. For instance, to pick up cups, more physics and realism was desirable, and customisability was very important to us. Due to our particular focus on interactions with objects in the environment, we settled on The House Of inteRactions (AI2THOR) as the foundation upon which our research shall be built.
If ‘AI2Thor’ isn’t your cup of tea (hehe), we created a comprehensive list of the hottest RL frameworks and environments that you can explore. It can be found at the RL-code-resources repository we created on GitHub. If you have any questions about selection, or the environments on the list, feel free to ping us and we’ll be more than happy to discuss our thoughts.
AI2Thor
We began with AI2Thor because of the beauty of its API’s simplicity. Writing “pip install ai2thor” in your terminal is nearly the extent of setup, and the interface to the environment itself is user friendly with a lot of customisation possibilities.
We quickly found that AI2Thor not only had cups, but allows object interaction, e.g. opening and closing microwaves/fridges, turning on the taps, and placing cups and other objects into/onto receptacles. The second we saw that this environment contained cups and allowed simplistic interaction with said cups, we jumped on it as this was the essence of what is needed to complete our incredibly lofty goals of reaching synthetic sentience through human-machine symbiosis. Or at the very least, we foresaw the ability for an agent to at least be able to recognise cups and navigate towards them as a good starting point for many of our algorithms.
In the next instalment of C2C #cuplife
In the next blog we’ll go deeper into an interface we are developing to make ai2thor an OpenAI gym environment. Such an interface would allow for task customisation, and our repo also includes the code of several state of the art algorithms that can be trained on these tasks. OpenAI popularised gym as not only a set of environments, but a general way of defining them in code structure; this we extended by including the idea of tasks and how they relate to the environment itself (although separate concepts).
Please feel free to like our material, follow along or laugh with (at) us as we set-out on our modest, ambitious, aggressive, provocative and light-hearted attempt at telling the story of the journey from cups to consciousness. And, if you can’t wait for the next instalment, you can follow along with the latest releases on our repo.
Be sure to give our repo a star, fittingly named cups-rl (Customisable Unified Physical Simulations (CUPS) for Reinforcement Learning algorithms). Otherwise find us here, or at our website.
And as always, thank you for reading!