Software 3.0 — the era of intelligent software development
TL;DR: Software 2.0 is revolutionizing how we develop a meaningful yet small portion of our software, while Software 3.0 stack will take us into the era of ubiquitous intelligent software development accessible to everyone.
“Standing on the shoulders of giants”, this article is inspired by texts and opinions of wonderful people worldwide, and mainly on Andrej Karpathy’s blog post “Software 2.0” [1]. Let’s begin with quoting [1] (with slight modification):
The “classical stack” of Software 1.0 is what we’re all familiar with — it is written by a programmer in languages such as Python, C++, JS, CSS, etc. It consists of explicit (declarative or imperative) instructions to a compiler/interpreter. By writing each line of code/instruction, the programmer identifies a specific point in the program space with some desirable behavior.
In contrast, Software 2.0, is written in much more abstract, human unfriendly language, such as the weights of a neural network. No human is involved in writing this code because there are a lot of weights (typical networks might have millions).
The latest breakthroughs enable us to envision Software 3.0.
In SW 3.0, programmers provide a set of instructions and a dataset, which define the program’s desired behavior. Then, an AI agent takes these instructions and dataset, and generates the program. This creation process includes the agent generating programmer-readable code and training the neural net models.
The generated programmer-readable code could be in languages and syntax from SW 1.0 and SW 2.0, such as a Javascript code for an application front-end or Python with a neural-network model definition.
The instructions may come in many formats. For example, they could be a web-application design or wireframe files accompanied by the fitting app’s user stories (in natural language or other forms). In another example, the instructions could be a set of guardrails and requests on the expected output of a conversational AI bot, given in natural language, in order to influence the bot’s behavior further, in addition to (or in contrast with) the biases induced from the dataset.
To make the analogy explicit, in SW 1.0, human-engineered source code (e.g., some .cpp files) is compiled into a binary that does valuable work.
In SW 2.0, most often, the source code comprises 1) a fixed “classical” code with business logic, I/O control and pre/post-processing of data, 2.a) the dataset that defines a significant part of the desirable behavior, 2.b) the neural net (NN) architecture with many details (the weights) to be filled in, and 3) a code for filling in the NN weights and compiling a final neural network.
In SW 3.0, the business logic, I/O control, and the data pre/post-processing code are partially or even entirely created by the AI agent during the optimization process.
SW 3.0 expressiveness (i.e. program space) covers both SW 1.0 and SW 2.0 program spaces as it can produce the output of both. That is, the program created by SW 3.0 stack may include the knowledge or computational power of any code line, data point and neural-network layer from SW 1.0 and SW 2.0 stacks.
Paradigm transition
In most machine learning based practical applications today [2022 April], the neural net architectures and the training systems are increasingly standardized into a commodity, so most of the active “software 2.0 development” takes the form of curating, growing, massaging, and cleansing labeled datasets. (Of course, there are other glorious machine learning acts and methods! Just an example, self-supervision techniques are improving, so less labelling is required). This is fundamentally altering the programming paradigm by which we iterate on our software (SW 1.0 → SW 2.0), as the teams split into three:
- the 2.0 programmers (which include data/ML engineers and data scientists for example) collect and cleanse the datasets, and run training/optimization processes,
- a few 1.0 programmers (which include MLOPs engineers for example) maintain and iterate on the surrounding training code infrastructure, analytics, visualizations and labeling interfaces,
- and another group of 1.0 programmers develop the applications’ user interface (UI), the non-statistical part of the business logic, workflows, etc…
In most cases, a significant portion of an overall application’s logic is actually written manually by the SW1.0 programmer and is “fixed” (i.e., most of the logic and code are not suggested by a machine during an optimization process).
Winds Of Change. Thanks to the latest large-language-models (LLM) technologies and training schemes, we see a new shift in the programming paradigm, where code is written (/suggested) by an intelligent agent, like GitHub Co-pilot [3]. When SW 3.0 will mature, most programmers will use the SW 3.0 stack, as it should cover most of the program space (as opposed to the SW 2.0 stack; see Figure 2 above).
It is an ongoing transition (SW 1.0 → 2.0 → 3.0)
Visual Recognition, Speech Recognition, Speech Synthesis, Machine Translation, Games Agents, and even Databases Cacheing/Indexing/Querying.
In each of the aforementioned areas, we’ve already made improvements. Over the last few years, we gave up on trying to completely address a complex problem by writing explicit code using SW 1.0 stack, and instead transitioned into the 2.0 stack.
Lately, we have seen meaningful improvements in technology and products for code auto-completion/suggestion, such as Codex [2] and its embodiment in Github Co-pilot [3] (there are earlier attempts too, e.g. TabNine [17]).
Many users report a magical feeling using these, but it is also reported that the suggested code is not always accurate. See, for example, a review by Jeremy Howard “Is GitHub Copilot a blessing, or a curse?” [4].
Interest in the field is growing fast, and new solutions for automatic bug discovery are in the making [5].
Under the hood, LLMs (GPT-3 [6] and alike) provide code suggestion capabilities. LLMs are not restricted to having natural language as the input and output. They can, for example, be used to transform a natural language request into SQL, Python, or other languages that fit many useful frameworks and services APIs.
Let’s take an example, AI21Labs' latest work, Jurassic-X [7]. (See Figure 3 below). Their team has paired their LLM (named Jurassic-1) with a set of heads (i.e. multi functionalities), each functioning as an expert and being more accurate on a specific field or topic, such as doing a calculation or participating in planning. This is a neat demonstration of a way to get around some of the shortcomings of contemporary generative models.
“A MRKL system consists of an extendable set of modules [on top of a Jurassic-1 model], which we term ‘experts’, and a router that routes every incoming natural language input to a module that can best respond to the input,” the authors write.
In another example, GPT-3 [6] can be tweaked by providing different instructions (&| prompts).
Some practitioners already report that today’s machine learning work comprises providing instructions (&| prompts) to GPT-3 or alike. For example, see L2P (Learning to Prompt) [16], prompting visual-language-models [18], or the tweet below.
(A prompt is a reformulation of the input query to the LLM, usually a piece of text inserted at the beginning of the query. Its purpose is to guide the pre-trained LLM to provide its output according to the desired format and behavior.)
The limitations of Software 3.0
The 2.0 stack has some of its own disadvantages. At the end of the optimization our program consist of large networks that work well, but it’s very hard to tell how.
The 2.0 stack can fail in unintuitive and embarrassing ways [8], or worse, they can “silently fail”, e.g., by silently adopting biases in their training data, which may be very difficult to properly identify, analyze and resolve.
We’re still discovering some of the peculiar properties of this stack. For instance, the existence of adversarial examples [9] and attacks [10] highlights the unintuitive nature of this stack.
These limitations above are also applicable for 3.0 stack.
More specifically for SW 3.0, since code is generated by a statistical model (the mentioned AI agent), it could also create buggy and unoptimized code (similar to a human?) which might be overlooked, even if provided in human/programmer-readable language.
The benefits of Software 3.0
Why should we prefer SW 3.0 over SW 2.0?
Larger program space. Neural networks and their inference [2022-April] engines do not practically fit to express and execute all the behaviors that are possible with “classic” programming language.
A combination of deterministic and statistical models. Deterministic behavior is encoded in code, statistical behavior in NNs.
Explainability and interpretability. In SW 3.0 there is perhaps less incentive to express the non-statistical program behavior with NNs, as it becomes feasible to automatically transform a desired behavior to programmer-readable code, which is probably more interpretable to humans than NNs.
Adaptability. It is easier for a programmer to fix or change code than playing with NNs weights.
More people can become programmers. In SW 3.0 a bigger portion of the overall application is being expressed with natural language like instructions, making it more feasible to enlarge the programmers pool.
This could have a huge impact on society and companies. There is currently a massive mismatch between supply and demand, while both SW 1.0 and SW 2.0 stack introduce significant entry barriers despite the growing online education materials.
Increased development velocity. SW programs can be written faster once automation is introduced into both the deterministic and statistical aspects of building SW applications.
A step towards AGI. :) … more about this in the last section.
Programming in the 3.0 stack
Software 1.0 is code we write. Software 3.0 is code written via an optimization process guided by an evaluation criterion (such as “best fit to the desired behavior given by the instructions and data”). It is likely that any setting where the program is not obvious but one can repeatedly evaluate its performance (e.g. — did you classify some images correctly? do you win programming competitions?) will be subject to this transition, because the optimization can probably find much better code than what an average human can write.
Considering this transition, it makes sense that we will also see new automated [intelligent] testing tools and processes.
It is likely that testing [automation] will become even more crucial in the SW 3.0 stack compared to SW 1.0 & SW 2.0 for a number of possible reasons. To name two: 1) programmers would be able to do more, including less experienced ones, 2) applications and programs will be much more dynamic, and statistically-based programs will power more parts of the applications.
For example, with SW 3.0, it might be much easier to develop websites and mobile apps that adapt their layout, composition and content per user, as well as introduce intelligent chatbots that help with navigation, automation and making the user satisfied. This could make manual testing impractical. Instead, testing AI agents would mimic many different kind of users using the site or app.
Application examples that will benefit from Software 3.0
Thinking [and working] on the above already gets me really excited. But thinking on what we can do with the technology and how it will evolve makes me ultra-excited.
Web design and development — Will enable product managers to make meaningful changes or suggestions on web or mobile apps using a simple GUI, and then an AI agent will accordingly publish a relevant code release into the programmers’ code repository for them to review and approve.
No-code platforms — Will be much more flexible.
Next-gen RPA — Will automate many workflows and back-office tasks [12, 13].
Data tools — Will be ubiquitous and usable by most knowledge workers. This includes data gathering [15],
Testing tools — Will become ultra-smart, automated and informative. This will help to greatly remediate the way that manual testing is done today ($10B+ market?). This is more than “nice to have”, it will become a necessity to develop intelligent testing tools to cope with the sophistication that SW 3.0 will enable.
Research tools — Will offload a big part of the research from the researchers, e.g. doing an automated literature review. As an example, this can include automated web crawling [15].
And many more — I would love to read suggestions in the comments.
Summary and Future
Is SW 3.0 a framework for reaching AGI (Artificial General Intelligence)?
I read opinions in both directions [12].
In SW 3.0, a desired behavior for a new program is provided by means of natural-language-like instructions and data to an AI agent, which in turn outputs an AI program compiled from programmer-readable code and a neural-network having the desired behavior:
Instructions + Data → AI agent → Code + NN
Then, imagine a specific use-case where the “Instructions + Data” expresses a desired behavior for a better AI agent?
Thus,
Instructions + Data → AI agent → Improved AI Agent
Question 1) What would this kind of “Instructions + Data” look like? Can we create it with some simple or sophisticated simulation?
Question 2) Then, can we run this over and over again?
Before discussing this magical AI circle further and debating what can enable it, even a more straightforward setup might do wonders. Take for example AlphaCode [11] announced by DeepMind, a SW 3.0 embodiment. AlphaCode placed in the top 54% of participants in programming competitions hosted on Codeforces, participating in contests that post-dated its training data. AlphaCode:
Instructions + Data → AI agent → Winning programming competitions
“The problem-solving abilities required to excel at these competitions are beyond the capabilities of existing AI systems. However, by combining advances in large-scale transformer models (that have recently shown promising abilities to generate code) with large-scale sampling and filtering, we’ve made significant progress in the number of problems we can solve,” DeepMind writes.
Does the specific setup used in AlphaCode suffice to ignite the magical AI circle? Probably not, since the programming tasks currently do not include the task of creating an improved AI Agent.
In the coming years, we will see immense research efforts to improve the setup and learning schemes used to create AI agents that generate programs.
Regardless of the setup, when AlphaCode will be in the top %1 of participants, I expect it to have a disruptive power already.
In addition, with such powerful capabilities arising, I believe we will see much more research, demand and achievements in the following fields:
1) Responsible and Truthful AI [14]
2) Intelligent software testing, evaluation and monitoring
3) Advanced data analysis
Software 3.0 is about humans and machines working closely together to create intelligent software development.
We will see both bottom-up and top-down progress in using SW 3.0 stack and related development processes.
Bottom-up: Functions, classes, and capabilities will be generated with newly introduced IDEs, code management, testing, code generation and analysis of SW 3.0 stack.
Top-down: graphical and programable application interfaces will be generated with newly introduced no/low-code SW 3.0 based platforms.
Last note: if you are about to read Karpathy’s original post, I suggest also reading the posted comments and seeing if you think they aged well or not.
Thanks!
[1] https://karpathy.medium.com/software-2-0-a64152b37c35
[2] https://openai.com/blog/openai-codex/
[3] https://copilot.github.com/
[4] https://www.fast.ai/2021/07/19/copilot/
[5] https://www.microsoft.com/en-us/research/blog/finding-and-fixing-bugs-with-deep-learning/
[6] Instruct GPT-3
[7] https://www.ai21.com/blog/jurassic-x-crossing-the-neuro-symbolic-chasm-with-the-mrkl-system
[8] motherboard.vice article about bias
[9] https://blog.openai.com/adversarial-example-research/
[10] https://github.com/yenchenlin/awesome-adversarial-machine-learning
[11] AlphaCode
[12] https://techcrunch.com/2022/04/26/2304039/
[13] A data-driven approach for learning to control computers https://arxiv.org/pdf/2202.08137.pdf
[14] Truthful AI post
[15] https://openai.com/blog/webgpt/
[16] https://ai.googleblog.com/2022/04/learning-to-prompt-for-continual.html?m=1
[17] https://github.com/codota/TabNine
[18] https://www.deepmind.com/blog/tackling-multiple-tasks-with-a-single-visual-language-model
Acknowledgment: Special thanks to Asaf Noy, Prof. Lihi Zelnik, Tal Ridnik, Avi Ben-Cohen, Shai Geva, Brian Sack and Dedy Kredo for the post review.