Member preview

The Practicalities of Software 2.0 or How Deep Learning Will Replace Software Engineering


Now, a couple of days ago I happened to stumble upon a rather interesting read by Andrej Karpathy on Software 2.0. I do implore you — just for the duration of this article — not to roll your eyes or splash your screen with the cup of coffee you’ve just brewed or perform any other act of closed-minded aggression against your reading device. I assure you, you can leave your invective below in the comment section. In fact, I would very much appreciate that. Yes, ladies and gentlemen, we’ve got another version of something. You can put it right next to Web 3.0, thank you.

Of course, by now, it would be reasonable of you to demand an explanation of what Software 2.0 even is. I shall oblige: No matter if you’re a budding script kiddie, Dennis Ritchie (although you would be very much dead), or a flower arranger, you’ve heard about “Software.” It is a collection of instructions that your computer can understand and execute. But you can also define “Software” as the most precise specification for a business process that’s possible to write.

Let me explain.

Suppose your product owner waddles by your desk. You’re busy doing other important things, but Agile requires you to be flexible, so you drop everything, turn to the product owner with your usual vexed countenance and take off your headphones.

“Now look, me and the people higher up have decided that we need a registration page. Can you build one for us?”

“Yes,” you sigh and take out a sheet of coffee-stained paper. “So, we’ll have this form, and all buttons will be red — ”

“No, not red, blue.”

“Alright, blue. So, we have this form, and all the buttons are blue — ”

“Except the ‘register’ button. It should be a different shade of blue.”

“What shade?”

“Like a darker kind of blue.”

“Is #0033CC OK?”

“Bang on.”

“Alright, so we have this form, all buttons are blue, this one button is has hex #0033CC.”

“Right, and don’t forget about form validation. We should only accept UK phone numbers.”

“Gotcha. We’ve got a form that only accepts UK phone numbers, and all buttons are blue apart from this one that’s #0033CC.

This back and forth continues for a couple of hours. You’ve used up 10 sheets of paper, but you’ve got yourself a solid, unambiguous, precise execution plan. In fact, the execution plan is so precise and so unambiguous that… well… it may as well be code. Through this fun exercise, we’ve arrived at the most detailed specification possible.

But what if we could write this specification differently? What if instead of specifying all the exact steps your computer should execute, we simply gave it a collection of inputs and and expected outputs. Given

{“name”: “Arthur”, “phoneNumber”: “(510)668–7059”}

the program should return


but given

{“name”: “Arthur”, “phoneNumber”: “+441234567891”} 

it should give us back


What if we wrote software purely by specifying what output we want to see for a given input? What if we stopped caring what goes on inside a black box and only cared what comes out of it?

Interestingly, this is where neural nets can help us. Being universal function approximators, we can use them to approximate (or learn) a great variety of functions: given this set of physical characteristics, does this person have cancer? Given this set of historical data, does this set of people exhibit high churn risk? Given this sequence of words in English, what is the likeliest corresponding sequence of words in French?

Given this sequence of input bits, what is the most appropriate sequence of output bits?

Perhaps I’m getting ahead of myself.

For anyone interested, I recommend reading the original article by Andrej:

Plain and simple, Software 2.0 is a new iteration of software where your code is replaced by a neural net.

I won’t go into much detail, but a particularly curious property of Software 2.0 in my opinion, is that it allows you to forget about time complexity. It will always be O(1). Whatever the input is, the computation will always take the same amount of time. Forward propagation in neural nets is cheap. Optimising a program would reduce to simply finding the best architecture for your neural net. With prominent tools like AutoML and AutoKeras, I think that will be a reality soon.

To develop Andrej’s point, I think that this can be applied to traditional software too— not just ML application, which was the focus of his article. Here, I would like to propose some practical pathways of achieving a migration to Software 2.0.

Functional Programming

I would first like to briefly discuss a collection of programming styles.

The most common programming style is the imperative. We’ve all done this:

content = parseJson(request.body)
db = getConnection(dbConfig)

Each line does something straightaway. Each line is an action. Each line either modifies data, wrangles data, obtains data — even adds some data to some remote location. By far, it is the most common and intuitive programming style — one most of us start off with.

Another common programming style is the declarative:

stream = getStream()

Each line doesn’t do a whole lot on its own. Instead, we declare our intent. We give our computer a shopping list of what we want it to do, and it obliges.

You’ll recognise this style of programming from packages like Tensorflow and Keras. First you statically define the architecture of your neural net, then you train and evaluate it (of course, PyTorch is a whole different beast where you can pretty much build and prototype your neural net dynamically).

The third style I would like to discuss is the functional.

The thing about functional programming — it’s not easy. You start googling it, and very soon you fall into a rabbit hole of contravariance, categories and functors. Did someone say monads? What the hell even are those?

A common belief is that you’re doing “functional programming” whenever you use constructs like map or reduce or foldLeft. It’s much more than that: those are simply design patterns borrowed from functional programming and do not represent the entirety of it.

Functional programming is based on lambda calculus, which was originally developed by this guy called Alonzo Church in the 30s. In essence, it is a way of reasoning about computation — a very mathematically rigorous way at that. For those who have gone through the blood, sweat, and toil of mathematical education, you’ll be glad to know (or not) that the correctness of programs written in a strict functional style can be verified through the machinery of said calculus. In other words, if you write your program in a very special way, you can write down a proof that it does what you want it to do. Without any bugs. Great, right?

Now, one of the tenets of functional programming specifically — especially the strict kind (i.e. not just using constructs like map and foldLeft) — involves the use of “pure” functions. These are functions that don’t cause any outside mutation. Practically, what it means is that these functions will not modify your database; they will not change any of the variables that you pass in by reference; they will simply take in some values and return some values — nothing more. Same input values — same output values. We may as well replace every occurrence of calls to these function with their outputs and run the program that way. The technical name is referential transparency, which sounds like something you’d hear in court. Let’s look at some pseudocode before I give you PTSD.

Consider this simple example:

function addToCart(item: String, sc: ShoppingCart) returns Nothing {
addToCart('Jawbreakers', oldCart)

This function is as impure as they get. In fact, it’s making me uncomfortable and somewhat nauseous. Not to mention that I fear for my jaw. Why is it impure? Because

  1. It doesn’t return anything
  2. It causes mutation to some external object — it changes its internal state

Consider another example:

function addToCart(item: String, 
oldCart: ShoppingCart) returns ShoppingCart {
return new ShoppingCart(oldCart, item)
newCart = addToCart('Jawbreakers', oldCart)

What’s the difference? The difference is that we can replace the call to addToCart with its output. For the same input, we will always get the same output. We might as well write the code this:

newCart = new ShoppingCart(oldCart, 'Jawbreakers')

So this is pretty nifty. And before your blood pressure starts to soar and you start foaming at your mouth mumbling something about “space inefficiency in functional programs,” let me explain why we’re even discussing this.

The fact of the matter is that, with enough finesse, any program, written in any style originally, can be converted into a functional program. Any functional program is a collection of pure functions. So, as a whole, a functional program can be treated as a black box that accepts some values and spits out some values.

Now, does that remind us of something? Black box? Same output for the same input? Neural nets. The argument is that, if there is a kind of programming style that’s the best candidate for migration to Software 2.0, it is functional.

Why is imperative programming an unsuitable candidate? Your functions don’t always return something. External state mutation is an issue too:


Can’t be a black box.

Why is declarative programming an unsuitable candidate? It mutates state:


The stream object enters a new state where it suddenly needs to remember to strip all the white space. Oh and there’s more stuff coming.

Can’t be a black box.

Functional programs are the closest match to the notion of a black box, and hence are the best starting place to begin exploring the migration to Software 2.0.

Now, while there is a multitude of functional languages like Haskell and Lisp, I’m not pretending to be a proponent of them. With diligence, you can rewrite your program in a functional style using a language of your preference, like Scala or Python.

The next step in this argument is to discuss a disciplined way of upgrading your existing software to version 2.0.

Enter: Microservices.

Transition to Software 2.0 Through Microservices

We shall borrow the migration approach from the emerging field of microservices. I hope you’re keeping a count of the buzz words. I’ve lost track, personally.

Traditionally, we would write software as one huge monolith. One huge application with a bunch of modules interacting with each other in somewhat unclear ways. I don’t think I need to explain why it’s a bad idea in some situations: some modules get updates faster than others; sometimes you only want to scale up a part of your application but not other parts; it’s difficult to track what’s talking to what. This is where microservices come in.

The microservice-based approach to software engineering involves breaking up your monolith into mini-applications. You chip away bits of code in a special way (specifically, in a way where closely related code stays together and such) and make each piece into its own mini-program.

These mini-programs (or microservices) then talk to each other to achieve the same goal as the original monolith but in a much more scalable and clean way. Often, a microservice-based architecture involves some messaging queue through which microservices exchange messages with each other — think WhatsApp. The messages don’t go do your friend’s phone directly — they go through a central message exchange server first.

Monolith vs Microservices Architecture

How would we now migrate to this architecture? A common tactic is to chip away some code into a microservice and just monitor it alongside your original code. For a period of time, you feed the microservice with the same data that your old code is getting — except we don’t do anything with the microservice’s output. We observe the behaviour of your newly-created microservice for a couple of months, and you only deprecate your old code for good once the behaviour is exactly the same. You only migrate fully to the new microservice once it fully matches the output of your old code for the same input.

Monolith Decomposition and Monitoring

But this is interesting. We’re talking about inputs and outputs again. Let’s do something similar for our Software 2.0 migration. Let’s rewrite our old code block in the functional style to make it more black box like and record all the incoming inputs as features and all the outgoing outputs as labels.

Training Dataset Collection

Correct me if I’m wrong, but I do believe we’ve just found a way of collecting an arbitrarily large training set for our future Software 2.0 model. Apart from a potential message exchange and maybe a couple of preprocessing modules, continue this process, and you’ll have migrated your monolith into software fully powered by Deep Learning.

Whichever language the original block of code is written in, I think functional programming can help us migrate it to a state that complies with our black box-like format in a disciplined way. If we slowly start introducing concepts like function purity and immutability into our old code, I think that after several iterations of refactoring, we can start recording the inputs and the outputs. Eventually, we’ll have ourselves a training dataset. What certainly helps is that there is already a wide array of migration techniques for functional programming, and so this should be a path devoid of unexpected surprises.

Data Scientific Considerations: Initial Modelling

What’s left to talk about is the Data Science behind this venture. We’ve talked a lot about code and migration techniques, and there was this bit about jawbreakers, which I still reminisce about with great warmth in my heart, but now we shall discuss how we’re going to set up and train this black box.

A major theme Machine Learning is representation, and often, how effectively you can solve the problem at hand depends on your chosen representation. A great example is NLP. How much information about a word do you want to capture? Simply its presence? Its normalised frequency? Its semantics? That will determine if you want to go with one-hot encoding, or tf-idf, or word2vec. All of these are representations, which make the training examples more digestible for our learning algorithm. For the purposes of this article, I’ll make a distinction between low-level representation and high-level representation and discuss their merits.

We’ll start with low level representation.

As discussed, a functional program in our view is one that accepts some values and returns some values — nothing more. Let’s dig into that statement. What are these values exactly at the most basic level? At the lowest level, if you will.(No, I’m not going to talk about voltages, go away)

Well, from basic Computer Science we know that these values are simply bits stored in your computer’s registers, and there are funny schemes developed by the IEEE that decide which bits correspond to which parts of the number. For example, the very first bit often tells you if the number is positive or negative. You can learn about these schemes here.

Consider the function below (which may for all intents and purposes be our whole program after converting it to a functional style):

function functional_program(var1: Double, 
var2: Integer) returns Boolean {

This program accepts a Double and an Integer, and then outputs a Boolean after some computation. We, as humans, prefer base ten numbers, but our computer crunches raw binary. In most programming languages, a Double takes up 8 bytes (64 bits), and an Integer takes up 4 bytes (32 bits). Let’s agree that a Boolean takes up 1 bit (even if most implementations insist on 1 byte or more for technical reasons).

Now let’s suppose you give this function numbers 223195843.5 and 25, and you get back True. Under the hood, the CPU sees that you’re giving it




and it gives back 1 in response.

For convenience, let’s concatenate the first two sequences to get a single contiguous input sequence.

Now our model maps the sequence


to this sequence:


With this view in mind, it seems to me that the most appropriate black box should come from the realm of sequence-to-sequence models.

Sequence-to-sequence models (such as Long Short-Term Memory models) are becoming so ubiquitous, I have no doubt you’ve used one before. Researchers are coming to a consensus that the most accurate way to translate text to or from a foreign language is through the use of a sequence-to-sequence model. Google Translate has been using it since 2016. Sentences are sequences of words. Data is sequences of bits.

Alright, suppose we’ve decided to map sequences of bits to different sequences of bits. We concatenate all our variables together because we want a contiguous sequence as input and a contiguous sequence as output instead of having separate sequences for each variable. Suppose we’ve even been clever enough to help our black box differentiate between different variables in the sequence by putting “stop symbols” in between each pair of variables; we could instil a convention whereby we put a 101 between each sequence. Perhaps, we could even use Huffman coding to further massage our representation for the black box. There are many design considerations we can take at this point.

Bit Sequence Translator

The main advantage is that this is the most general representation you can come up with. Period. Anything and everything is encoded as bits in your computer, so within theory you should be able to train your model on any types of data imaginable. The best thing is that it’s variable width. If your function expects a String type as argument, you can’t always anticipate its size. However, since we’re using a variable-width sequence representation, we don’t need to worry about whether the user passes in “Goodbye” or “Until the next time we lay our eyes upon each other.” The sequences will have different lengths, and it’s fine. They can still be used for supervised training.

The main disadvantage is that it’s difficult to make sense of such unstructured data. The modicum of structure we’ve agreed upon are stop symbols. Your model isn’t going to be too happy about that — and it will show that in the evaluation metrics.

Bit sequence translations give you a lot of power — but also a lot of problems.

If you’re fine with having less expressiveness but more structure, perhaps you could use a higher-level representation and go with plain old multidimensional regression.

This time, let’s entertain a much more sane scheme whereby we map fixed-sized vectors to other fixed-sized vectors through regression. It’ll look something like this:

Software 2.0 Through Regression

In this case, it’s somewhat less intuitive as to how the regression model would handle String inputs and other variable-width miscellaneous inputs that can be easily represented as a bit sequence; however, the model is relatively straightforward since we agree on fixed-width inputs. With some care, we can even deal with object inputs (as long as we treat them as key-value containers). We can expand them recursively:

Recursive Object Expansion to Fixed-Width Vectors

The final input vector will have a fixed width of six. This recursive expansion continues until we are left with nothing but primitive types (Integer, Float, Double etc). We can have this as an additional preprocessing step before regressing on the features.

As for the model itself, we can even start experimenting with vanilla multi-layer perceptron networks — anything that allows us to do regression. It’s unlikely that the regression model will output exactly 0 or 1 for expected Boolean outputs, so we’ll have to implement a natural threshold scheme through the use of a sigmoid function or otherwise. The same technique can be used in regards to Integers through the rounding of outputs.

Data Scientific Considerations: Architecture Search and Afterthoughts

Hopefully you’re still with me. We’re almost done.

We’ve discussed how we’re going to chip away a piece of our code and make it into a pure black box by using techniques from Functional Programming. We’ve discussed how we’re going to collect a training dataset using practises borrowed from Microservice Engineering. We’ve discussed a couple of models of varying degrees of expressiveness that will do the mapping for us. What’s left to discuss is how we’re going to optimise these models and a couple of theoretical considerations.

Undoubtedly, one of the major breakthrough in Deep Learning of the past years is an automated way of searching for the neural net architecture that would best solve your problem. Let’s be honest, determining how many layers you want to have in your neural net is a black art. You fiddle with it for a bit — add a few hidden neurons here, a few layers here — each time checking how well your experimental model performs (i.e. you check its fitness), and eventually you may or may not converge on an optimum architecture after much frustration and many broken keyboards. Up until recently, this experimentation has been manual. Now, however, agents searching the infinite space of possible architectures can be made virtual thanks to techniques from Reinforcement Learning. You can now delegate these architecture searches to tools like AutoKeras and AutoML.

As always, a major problem is overfitting. In the world of Software Development, overfitting neatly translates into a concept called memoisation (yes, without the r), whereby a function, instead of producing the result from the ground up, first looks in an answer bank (a memo) to see if it has computed the result for the given input before; if it has, it simply returns it, otherwise it goes through all the predefined steps. Our neural network would be no better than a function with a memoisation mechanism if it just remembered all the mappings. This is a real issue — even if we use an architecture search layer. The architecture search algorithm may find your model to be unsuitable to perform all the requested mappings, so it would add new hidden layer. And then another one. And another one. And another one. Each addition of a new hidden layer entails adding more weights, which is equivalent to adding more “memory” into our model. Eventually, the model would bloat up into a lookup table. A model that memorises all the mappings is not what Software 2.0 is all about.

Instead, we want a model that deduces patterns and rules from the supplied dataset. A model that determines if a number is divisible by two is going to take up much less space if it remembers that even numbers end in zero, two, four, six, and eight than a model that stores mappings like (2 -> even, 4 -> even, 6 -> even, 8 -> even, 10 -> even, 12 -> even, ...)


And this is it. In this article, we’ve ruminated on the practicalities of implementing Software 2.0. The proposed approach is to

  1. Convert a piece of your software into the functional style
  2. Monitor the code’s inputs and outputs. Record them for training
  3. Choose a low-level sequence-to-sequence model, or a more realistic high-level regression model
  4. Train it on the data from the logs
  5. Productionise it as a microservice
  6. Rinse and repeat for the rest of your Software 1.0

There are many more possible avenues to explore here. It is perfectly possible that there are much more sensible representations than ones I’ve proposed here. Perhaps sequence-to-sequence models don’t deal well with such “unnatural” data as bit sequences. Functional programs can accept other functions as inputs — how do we encode that in the regression model? How do we deal with variable-width inputs in the regression model? Perhaps converting your software into a functional style and removing all mutation is infeasible. Perhaps not.

Another question is, if Software 2.0 is so great, can we use it to produce hyper-efficient approximation algorithms to NP-complete problems?

Without a doubt, this is a fascinating field that requires further inquiry — perhaps even a series of articles by yours truly.