LLMs and Data Science: A transformative landscape awaits…

Published in

KERNEL-MCDS

8 min readJun 6, 2024

LLMs and Data Science

The Large Language Model (LLM) extends beyond the realms of traditional statistical tools and generative AI technologies, heralding a revolutionary paradigm in software engineering.

It wasn’t until I began to build that I realized the space of all LLM creations was actually more tightly constrained than I thought. Entering this seemingly unlimited space of potential projects, web-apps and money-making schemes can be dizzying, but only through experimentation do you begin to feel out the framework (the LLM OS) within which many good ideas lie.

This article delves into how the LLM transcends its roots in data science to redefine software development, while sharing reflections on how I recalibrated my software engineering mindset to navigate this transformative landscape.

The LLM OS

It is often forgotten that the field now known as computer science was originally termed ‘data science’ [1]. This historical perspective is important when considering the role of the LLM in the landscape of software. Data remains central to software. Writing good applications, particularly on the web, is fundamentally about the efficient storage and manipulation of data: text messages, images and videos, currency and transactions, likes and dislikes.

It would be inaccurate to claim that the LLM will reunite Data Science and Software Engineering. Nevertheless, to discard the LLM as a statistical toy and demarcate it from true software engineering would be limiting. Like most useful science originating in a ‘lab’, we should eagerly track its transition from theoretical science into practical engineering.

Andrej Karpathy is rapidly gaining recognition as a leading contemporary computer scientist. With hands-on experience developing innovative technologies at Tesla and OpenAI, he is now focusing more on pioneering through experimentation and intellectual leadership.

One of Kaparthy’s most fundamental ideas is that the LLM will become a central component of software engineering, proposing a new paradigm known as the ‘LLM OS’. Serving at the same level of importance as the database or the web server, Kaparthy posits the LLM as a pivotal component in the evolving landscape of modern software architecture.

THE DETAILS

Before any good engineering can be done with the LLM, it is important to critique it, beginning with its first principles.

Next word prediction

LLMs thrive on colossal datasets — language corpuses that provide the raw material from which they learn. LLMs read trillions of words, and eventually, they may be able to predict the next one. They do not reason, perform arithmetic, or answer questions. Their billions of parameters go towards optimizing one single function: predicting the next word, given the previous [2]. It is as a ‘side effect’ of predicting the next word* that these emergent behaviours are realised. Here is a fiery Reddit post I stumbled across, which I think explains it best (in response to a question ‘When will LLM’s figure out mathematics?’) [3].

‘…It [the LLM] does not explicitly model reasoning nor any part of the world, but since language itself has been used to describe reasoning and systems in the real world, it manages to capture the shadows of them in its probability density functions…’

Hallucination

Many do not realize the degree of unstructured-ness that begin the breeding of these LLMs, and when hallucinations inevitably arise, are quick to blame those responsible for training the LLM’s; or even worse, the LLM itself! I do not personally believe this blame should be dismissed — it plays an important role in holding the scientists and users of the LLM accountable. But we should be careful not to misdirect our anger. Best put colloquially by a dear friend (and Pediatrician), ‘it is a bloody LLM.. it has no brain. It is using predictive algorithms to try and sound smart!’ [10]

For added perspective, another legendary quote from Kaparthy goes:

‘I always struggle a bit with I’m asked about the ‘hallucination problem’ in LLMs. Because, in some sense, hallucination is all LLMs do. They are dream machines.

We direct the LLM’s dreams with prompts. The prompts start the dream, and based on the LLM’s hazy recollection of its training documents, most of the time the result goes someplace useful.

It’s only when the dreams go into deemed factually incorrect territory that we label it a ‘hallucination’. It looks like a bug, but it’s just the LLM doing what it always does.’

In engineering terms, hallucination is neither a bug, nor a ‘side effect’, but the strict function of the LLM.

Compute

The architectural breakthrough behind LLMs is the transformer architecture, introduced by Google researchers in the paper ‘Attention Is All You Need’ [4]. Put simply, this rethink in design allows the model to ‘read’ simultaneously, rather than word by word (as if you were to read every sentence of this article at the same time instead of top to bottom). Older models like the Long-Short Term Memory model did have to read top to bottom. This shift not only elevated the model’s ability to ‘understand’ language, but by eliminating this temporal constraint, corpuses of text could now be processed in parallel.

Thus, the real power of transformers is actually less flashy — and lies in their orchestration with modern GPUs and cloud computing infrastructures, which enable platforms like ChatGPT to deliver deceptively effortless user interactions. LLMs are the confluence of beautiful statistical models, and legendary compute which enables these predictions.

Indeed, we know that in retrospect, it was not necessarily Marvin Minsky’s XOR takedown of AI [11] that was responsible for the AI winters, but compute. And today, it is the scale afforded by the modern GPU which leads to the titan of a machine that is the LLM. Thus, in the future, it is energy, not compute that will be the main bottleneck to AI’s progress in the future [5].

Reconstruct

In my previous examination, it was necessary to demean the LLM and to expose its internal engine: a function which accepts text (input) and returns the text most likely to follow (output).
Any subsequent ‘intelligence’ is merely emergent behaviour, and all subsequent utility is merely an abstraction. So, in order to engineer the LLM into systems able to serve more inputs and outputs, it is necessary to nullify one’s expectations.

For example, even ChatGPT, a question-answer service, is just an exploitation of this ‘text continuation’ mechanism. Just try to build your own chatbot — you will quickly realise that you must re-insert the entire conversation history each time, just to get the next reply. And once you accomplish this, be cognizant that there are still many more hidden abstractions which enabled your chat bot. No LLM comes ‘out of the box’ with the ability to answer questions, or do anything other than only to predict the next word.

Only once we have understood all of the above may we permit ourselves to embrace these abstractions and harness the extraordinarily useful tool that is the LLM!

The LLM OS (Again)

If you have read this far, you have stuck by the LLM at its worst, so now you deserve it at its best. A common pattern emerging in LLM applications is that the LLM acts as an operating system. It executes instructions, determines control flow, and handles data storage (all of this, just from predicting the next word!) [6].

If you have, as in the previous section, successfully distilled your interpretation of the LLM down to being a next word predictor, it shouldn’t come as a surprise that many technologists (Kaparthy, Andreessen, to name a couple) believe it inaccurate to think of LLMs as merely a ‘chatbot’.

To Kaparthy, the LLM may become the ‘kernel process of an emerging operating system, coordinating resources such as memory or computational tools for problem solving’ [7]. This is a large upwards climb in abstraction, so bear with me.

We can build our mental wireframe by first drawing on the equivalences. Most of these equivalences lie in the analogy of the ‘memory hierarchy’. If the LLM is like the kernel process of an operating system, then…

Disk is to The Internet (which is accessible through browsing)

RAM is to context window (maximum number of words needed to predict the next word in a sequence)

I/O is to RAG [8], function calling, and multimodality (images, video, audio)

The analogy continues effortlessly into the emergence of proprietary vs open source LLMs**…

Proprietary: Windows and Mac is to OpenAI and Claude

Open-Source: Linux is to LLaMa, BERT, etc… [9]

Then there are the corollaries, such as the idea of user space vs. kernel space applying also to LLMs. What part of the LLM’s I/O should the user have privileged access to: i.e an enterprise LLM should never leak its internal prompts.

Building towards the LLM OS

To be honest, I didn’t truly understand the motivation behind the LLM OS, nor how to apply it to my own work, so I set out on a reckless journey of trial and error…

Trivial LLM applications utilize the LLM as the core of the product rather than the kernel process. Examples include ChatGPT or Github Copilot. These applications are lucrative, but their transparency
makes them highly competitive territory.

I. So, I entertained the low hanging fruit first — products which purely leveraged the LLM’s text generation and conversational abilities. This included applications with a similar flavour to Duolingo, or creative endeavours such as writing books. Yet, the end result always seemed to lack substance and fell short of actually being useful.

II. Letting my commercial brain rest for a second, my first creative LLM application was a small macro which sent an automated message to my girlfriend every day. Unbeknownst to me, this first attempt had landed very close to a realisation of an LLM OS. It pulled emojis and text into memory, augmented its generation using a weather fetching API, directed peripheral devices (my Macbook) in order to send messages, and integrated with software tooling (Scheduler) in order to execute more than once.

III. Mistakenly believing that I needed to hone my understanding of the fundamentals, I attempted to operate closer to the underlying weights and biases, fine tuning GPT-3 on my personal diary.

IV. I continued on this wild tangent, simulating a neural net with water simulations.

V. Getting back on track, I used OpenAI’s new function calling feature (another abstraction) to help me redesign my bedroom.

VI. At ML/AI Hack 2023, I got closer, using an LLM to interface with the Public Transport Victoria API.

VII. At UniHack 2024, I started to nail it, embedding an LLM with a project management suite to communicate with team members and prioritize tasks across the project, and taking home 1st place.

VIII. Eventually teaming up with a startup, Lyrebird Health — a medical data software provider — as the lead machine learning engineer, applying what I knew about the LLM OS.

If you had known about the App Store in 2008, you might have predicted the explosion of app-driven businesses, new models of software distribution, and the shift toward a mobile-first computing
environment. Understanding the LLM OS allows us to foresee a similarly transformative paradigm-shift in the next generation of software.

Footnote

[1] History of Data Science, Forbes

[2] word ≈ token

[3] This Reddit comment

[4] Attention is All You Need

[5] Mark Zuckerberg, Dwarkesh Patel

[6] Weng, 2023; Shen et al., 2024

[7] Intro to Large Language Models — Andrej Kaparthy [YouTube]

[8] Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

[9] Note that many ‘open source’ models aren’t truly open source, and
instead are more akin to “tossing over a binary for an operating
system.” — Andrej Kaparthy, at a Sequoia Capital talk

[10] Dr. Kim Drever

[11] The New XOR Takedown

— — Justin Lee

LLMs and Data Science: A transformative landscape awaits…

Written by Melbourne Centre for Data Science