AOAI Builders Newsletter Issue #11

Ozgur Guler
Microsoft Azure
Published in
9 min readJan 31, 2024

We are in a completely different LLM landscape.

It is looking more likely that data is not the moat. Frontier Models are being trained with so much high quality data that their zero-shot performance exceed fine-tuned models in vertical domains too (e.g. MS’s medprompt paper suggests with proper prompting gpt4 performance exceeds fine-tuned models).

LLM value is being compressed in the best model(s), creating a “winner takes all” dynamics in model space. OpenAI’s head start and the workflow stickiness MS copilots provide across huge MS SaaS portfolio seems to have taken hold although others are filling in gaps in LLM tech stack too.

Evaluation is becoming a big problem. With most LLM training data being contaminated with eval data, and new models eager to take spotlight claiming to approach GPT4 performance sometimes with rigged evals, LMSYS type of “Chatbot Arena” platforms crowdsourcing LLM eval for FM’s are the best we have. It seems arena evaluation is better for testing alignment rather than llm testing.

Agency is becoming big…Long awaited shift to Systems-2 thinking implies we will move on from today static LLM “flows” like we build in Azure ML, but towards a team of “agents” who can interact with their environments, and cooperate to solve problems and in the mean time adopt and learn themselves. Microsofts autogent2 which I cover in this issue, is an elegant first step towards agency with LLM’s

In the realm of app development, the brightest minds, once dedicated to capturing human attention — a feat that appears to have been mastered — are now turning their expertise towards hacking human intimacy. This trend is evident in the rising popularity of “virtual companion” apps, reminiscent of concepts seen in films like “Ready Player One” and “Blade Runner 2049,” where holographic partners and immersive, interactive features blur the lines between reality and the digital world as with character.ai and Codeway’s Genesia app.

No matter how the future will unfolds it is exciting and thrilling. Enjoy the ride!

Evolving Cloud Economics: The New Era of MaaS

midjourney — openai cloud scale

Last week, OpenAI announced a reduction in pricing for their models. The cost of the flagship model GPT-3.5 Turbo, has decreased to a quarter of what it was just 10 months ago. We don’t have visibility on the innovations in hardware and software deployed by OpenAI. However it is worth mentioning a key advantage OpenAI / AzureOpenAI has is: ‘Larger Batch Sizes.’ Essentially, this is the application of ‘economies of scale’ to Large Language Models (LLMs).

For LLM’s batch size’ refers to the number of input tokens that are processed in parallel in one forward and backward pass of the network. The batching process involves aggregating these data points into a very large matrix and feeding this matrix through the computational graph of the neural network. Larger batches mean that a more significant proportion of their parallel processing capabilities of GPU’s can be utilized. This decreases the ratio of overhead (such as memory transfers and kernel launches) to the actual computation, leading to a reduction in the time required per token processed.

This efficiency is particularly beneficial for Model as a Service (MaaS) offerings. Providers like OpenAI/AzureOpenAI, with their high volume of user requests, can aggregate more tokens into larger batches. This process reduces computational costs and cycle frequencies, making LLMs via MaaS more advantageous for both providers and users. Along with cost benefits, MaaS offerings also bring added values such as security and reliable service level agreements.

Azure AI Studio is already offering LLaMA2 as a MaaS service and plans to include MISTRAL models and JAISS soon.

Future Outlook — Adaptive Computing, Sparse Models & LLM Routers

In a recent interview, Sam Altman has highlighted the potential of ‘adaptive computing’ for LLM inference as a future direction. Adaptive computing works by prioritising the processing of the more important tokens (established by the self-attention scoring) through dedicated hardware.

Sparse models where not all nodes of the an LLM are activated for each query (sparse activation) are also becoming popular with the recent rise of Mixture of Experts like in Mixtral from Mistral. (There are already signs that GPT models may be MoE’s too — see the papers section).

LLM routers can dynamically routing between multiple models. Martian is a startup recently came out of stealth and can do llm mapping to optimise LLM inference costs.

Matryoshka Embeddings

midjourney — Openai Matryoshka embeddings

Last week OpenAI announced two new embedding models: a smaller and highly efficient text-embedding-3-small model, and a larger and more powerful text-embedding-3-large model. What is interesting with the new embeddings is that they can be “shortened”. e.g. if a vector db support emebeddings of a certain length you can shorten them without losing significant performance…

Embedding-03 models are built on MRL- Matryoshka Representation Learning, recently released at ICML. (OpenAI PM Owen Campbell-Moore confirmed that embedding03 is based on MRL on X and that an updated blogpost would be coming replacing the existing blog.)

The central idea of MRL is to learn representations that contain information at various levels of detail. Just as Matryoshka dolls have smaller dolls nested within, MRL embeddings pack multiple layers of information within a single high-dimensional vector. Each “layer” of this embedding corresponds to a different level of detail or granularity.

The adaptability of MRL means that a single representation vector can be used for multiple tasks with varying computational and statistical requirements. For instance, a lower-dimensional segment of the vector can be used for simpler tasks, while more complex tasks might utilize the full high-dimensional representation.

Usually the shorter the embeddings the better as longer vectors will require more vectordb storage as well as higher compute requirements during vector search. It is likely a shortened embedding03 will reduce your AI Search vectordb costs significantly so is advised to move onto embedding03.

From Prompt Engineering to FlowEngineering with Microsoft’s Agency Framework AutoGen2.0

midjourney — openai agents

We are moving from prompts to flows…(PromptFlow was ahead of its time it seems).

Let’s start with the flow itself… Using the flow definition from AzureML PromptFlow, a flow serves as an executable workflow that streamlines the development of your LLM-based AI application. A flow is a DAG — directed acyclic graph — for data flow and processing within your application. In PromptFlow / AI Studio you can use different kinds of tools, for example, LLM, Python, Serp API, Content Safety, etc. to build a “flow”. In that sense a conventional flow resembles “micro services”architecture where individual services that comes together forming an app.

Ok, so what is different with AutoGen? — With AutoGen each element in a flow to can be an “agent”. An agent is an entity that can interact with its environment, learn and adopt to carry out a task. This means every component, or “agent”, within a flow is not only capable of performing its designated task but can also make autonomous decisions, evolve, and learn new tasks over time. Look at the below example…

From

When you install and run AutoGen Studio (a UI wrapping autogen library — pip install autogenstudio and run with autogenstudio ui — port 8081), and let’s ask a specific task to be carried out AutoGen creates the code, runs it, when detects problems with the environment installs required packages and runs the code. This is a basic example of multiple agents interacting to carry out a task. This agency empowers each agent to dynamically adapt and respond to changes in the workflow or the environment, making the entire system more versatile and intelligent.

In this way, AutoGen brings together LLM’s, human inputs and tools (functions or API’s) together. However an AutoGen workflow is not a DAG per se but a multi-agent conversation where each “self-starter” agent assumes tasks does it parts and talks to other agents. Therefore an AutoGen workflow is very high level declarative task definition when compared to a PromptFlow DAG where we individually define how an app should work or tasks to be executed in which order.

Below is Satya Nadella ‘s vision of the future from a recent interview “…I think this idea that people will have agents, these agents will interoperate with each other, there will be some type of super apps that whoever cracks, there’ll be a few runtimes where naturally people will gravitate to, which will be these multi-agent frameworks.”

Andrej is a supporter for agency in llmapps too…

Andrej Karpathy

“…having AI systems that can interact with each other in pursuit of their owner’s goals, employing social skills such as cooperation, coordination, and negotiation while doing so. I remain convinced that this dream must be a central part of the future of AI.”

— Michael Wooldridge, Professor and Head of the Department of Computer Science at the University of Oxford and Program Director for AI at the Alan Turing Institute in London

Silly ML Genre

Silly ML Genre

There is a new genre of non-fiction books emerging which talks about endless list of horror stories about algorithmic biases, injustices with tragic consequences with no solution or a workaround, leaving you chilling like after a ghost story. ML Systems perpetuate human biases, there are algorithmic biases and ML systems in general are not interpretable or fair. (Kudos to Anthropic trying to reverse engineer NN systems through mechanistic interoperability.)

Ok so what’s a better use of your time? Roger Grosse from Anthropic is delivering an AI Alignment course at University of Ottowa — CSC2547: AI Alignment with slides and reading material open to everyone.

Interesting (amp; speculative) reads…

  • Non-determinism in GPT-4 may imply them being MoE models too [link]. This post argues GPT models being non-deterministic even when temperature=0 cannot only be explained by non-deterministic CUDA floating point operations. MoE better explains the speed, logprobs removal and non-determinism based on the original MoE paper for GPT models.
  • Self-rewarding Language Models
  • Sleeper Agents : Training deceptive LLM’s that persist through safety training — This paper raised further concerns with LLM safety showing that we can train models to have backdoors that, when triggered, involve switching from writing safe code to inserting code vulnerabilities. These backdoors are resistant to alignment and this robustness increases with model scale. They can be deceptive too looking aligned during training but will serve misaligned goals in deployment. Although reasons are not fully understood.

GPU church with 4k H100s in Barcelona — MareNostrum4

ChatGPT gives better results when you tip!

GenAI space is moving at warp speed, and it’s like trying to drink from a firehose! We hope this newsletter helps you sift through the flood and catch the highlights.

Your thoughts? We’d love to hear what you think and if there’s anything we can do better. Don’t forget to subscribe here

Some of my earlier write-ups on medium

(Disclaimer: The views expressed in this newsletter are solely my own and do not represent those of my employer or any other organization.)

Original posted@ linkedin….

--

--