Stories by AdMor on Medium

5 golden rules to deliver a successful GenAI application

AdMor — Tue, 17 Jun 2025 10:26:40 GMT

Deploying a GenAI app should look like this to you, generated with Flux.schnell

You can have dozen of years of experience in the Machine Learning industry but GenAI applications are just starting to appear on the market.
While building some myself, I discovered some pattern worth mentionning. So here are the 5 golden rules to building great GenAI applications.

🎳 1 — Defining what you are trying to solve has never been more important

This sentence has always been true in Machine Learning, but the generation space of GenAI makes these models harder to control.
Being clear on what you want to achieve allows better performance.

For a question like : how does the printing press from Gutenberg work ?
You can produce a concise answer in 2 lines or a fully detailed guide in 20 pages.
And a smarter Claude or GPT won’t solve this question for you

So you have to be very precise when defining the inputs and outputs of your AI agent-based system, especially for the output.
For outputs, the definition needs to be understandable by a human so that it can actually be implemented by a LLM.

Let’s say you work for building the best customer support agent.
In your chat, you decide that the agent should be joyful.
What exactly does that mean?
It could be many things: making jokes, being friendly, including little encouraging phrases for the user, using emoji.
Only you and your team can choose the right set of rules

However, this kind of responsibility isn’t necessarily the one of data scientists or ML engineers.
It should fall to domain expert teams, such as product or business teams.
The usually know the best what the product experience should be, exactly like traditional product specifications in traditional engineering.

One thing is different however : specification of an “intelligent” system is way harder than traditional software engineering and will require you time.
It will probably be an iterative effort while you learn about your problem space.

Here are some great references to get started or get inspiration for your use case :

A great introduction video from Galileo : they give 2 examples of evaluation in the coding assistant and generic agents space
KPIs that you need to define for your GenAI use case : https://cloud.google.com/transform/gen-ai-kpis-measuring-ai-success-deep-dive
An example of metrics used to measure the impact of code assistants on the productivity inside a company
A study on the GenAI use cases in the Procurement industry
The most surprising use case i’ve encountered it the AI optimized Cheetos. Event thought this is not GenAI

Once you have defined what you want your GenAI product to look like, you can start to have offline metric implementations by your ML team.

📏 2 — Make these definitions measurable with dozen of metrics

Having metrics for each of the criterias defined above will allow to know if you achieve your mission successfully.
By metrics, I mean offline metrics that will be measured on a dataset representative of the problem you are tackling.

Most probably, you will use tons of LLM as a judge.
Why ? Because there is no traditional metric to measure semantic similarity and it is almost as good as a human review.

There are great ressources out there to learn how to use LLM as a Judge to make the evaluation of your problem

Ragas is a great start : text comparison and rubrics based scoring can be implemented within minutes.
Google Cloud also provides a great list of LLMaJ metrics off the shelf : verbosity, safety and many more.
New metrics are also appearing like BLANC : it specializes on summary usefulness.

Let’s take an example

Use case : you have a chatbot responding to your users for Customer Support, you’ll define what tone it should use.
For instance, for a cordial tone, you might specify:

> always start with a greeting,

> address the person by their first name,

> always thank them at the end of the conversation,

> be able to handle angry users by showing empathy

> emoji usage are forbidden

This initial definition of cordiality, with these five criteria, can then be implemented as a set of “LLM as a judge” tools that can measure these criteria in your agent’s replies.

The implementation can be simple once you manage to define these measurements.

Example: implementation of “Always start with a greeting”
Implementation : Ragas Rubric
Rubric levels :
0 : the agent does not say hello to the user and answer directly
1 : The user says hello but does not add a reassurance element before answering
2: The agent says hello and add a sentence of reassurance before answering the question

With dozens of these implementation, you can now measure if your AI agent does a good job at serving your users.

👸 3 — If data is a king, synthethic data is your prince

To measure how well your AI agent scores on your freshly defined metrics, you need data. And a lot of it. But if you build a new product, you don’t have any…

One of the revolutions around GenAI is that you can generate your own dataset — and this is called synthetic data — but it’s still a process that is more or less custom for each project.
So you need to find the right way to generate this synthetic data.

The simplest approach is usually to ask an LLM to generate a line of your dataset, and then reuse the sample to test your AI Agent.
It might sound crazy — especially since you can use the same LLM to create the problem and to answer it — but it’s actually great baseline.

However, once you go further than this basic version, things start to get more complicated. There’s a lot of research being done on these topics right now.

There are approaches to transform non-GenAI-oriented datasets into, say, conversational datasets : AgentClinic transform a QA dataset into a conversational one, YourBench allows to generate more of a dataset
There are other datasets that try to create LLM traces for LLM fine-tuning from datasets not originally meant for that purpose. TxAgent creates a guided LLM trace to be used to later finetune a LLM for their specific use case.

On this step, you should spend a lot of time. These different approaches can give your application the edge to be good enough to collect real user data.

👨‍⚖ 4 — Get an expert-level rating with your LLM as a Judge

Up to here, we assumed our metrics were perfect.
This is usually common in traditional ML, there’s nothing to review about the definition of precision or recall.
But for LLM as a judge, which are based on a prompt, it’s not always guaranteed that your metric computation will work exactly as you’d like.
So it makes sense to look at the alignment between a LLMaJ and a domain expert review.

Here is a few things that you may need to fix in your LLMaJ implementations :

Non-deterministic : 2 runs on the same data can give different grade
Narcissistic Bias: LLMs may favor the answers generated by themselves.
More is More: LLM judges tend to prefer more verbose text over more concise ones.
Not-so-Fine-Grained Evaluation Scores: LLMs can be reliable judges when making high-level decisions. However, as the scoring scale becomes more detailed with finer intervals, LLMs are more likely to produce arbitrary scores.
Position Bias: When using LLM judges for pairwise comparisons, it may have preferred positions.

All these elements can make the judge deviate from what a domain expert would have chosen.
It can be useful to measure this gap as a sanity check but also to tune this judge as you would for a normal AI agent.

Once you reach a certain level of correlation between human and LLM, you successfully automated your domain expert and can run robust experiment to improve your AI agent.

🚼 5 — Start simple when building your AI agent

Everything is prepared to finally improve. your AI agent.
The recommendation here is simple: start simple and add components as your evaluation pipeline improves.

The baseline is generally prompt engineering, using one of the state-of-the-art LLM providers. For example, GPT-4.1 with the right prompt that precisely describes the task to perform is a good baseline.
Then, you can think about adding a RAG (Retrieval-Augmented Generation), which will provide contextual information depending on the task assigned to the agent.
From there, you can start moving in a more agentic direction: first, making tool calls based on context, and finally, truly letting the agent perform an undetermined number of tasks.

I think it is very important to remind that, without good evaluation, no matter how complex your systems are, their value cannot be proven without a dataset and a clear set of metric to measure its quality.

With a good base on your dataset and metrics, you can unlock automatic optimisation of your prompts and agents with frameworks like DSPy or Optuna.

🥁 Conclusion

For GenAI applications, you need to prepare a lot of steps before any AI agent development.
As the field is new, it can be tempting to do a lot of manual tuning without building the right foundations.

LlamaIndex, better than LangGraph and LangChain ?

AdMor — Mon, 23 Dec 2024 09:50:13 GMT

LlamaIndex, better than LangGraph and LangChain ?

How you could gain time by learning LlamaIndex

Point of view : you just built your “simple” LLM application, generated with Flux.schnell

1 — Applications with LLM apps will be complex

AI is on everyone’s lips, yet it is hard to see many meaningful adoptions of Large Language Models in a product.

But the potential exists, we count many examples of applications where automating humans is within reach, like customer support, note taking or maybe soon math and code problems [1].

For Natural Language Processing use cases, LLMs are also wonderful zero shot models. But their powers come with large challenges in controllability and performance measurement.

Some LLM agents frameworks offer to control some of that complexity at the cost of learning their abstraction.
Langchain and then Langgraph were among the first to offer a framework to code applications based on LLM.
In this post, we will focus on LlamaIndex. In my opinion, it offers the same value with a lower degree of complexity.

2 — LlamaIndex : more like a toolbox than a framework

My major concerns when I learned to use LangGraph were :
- You need to learn many custom operators
- Poor typing/classing of the data exchanged by the functions
- Function oriented processing lacks flexibility

In order to illustrate this, I will use a complex example from LangGraph tutorials [2].
Why ? Because with complexity, you discover the hidden design choices made for you by the framework.

😵 You need to learn many custom operators

Let’s have a look at the following piece of code.
It defines a MapReduce processing. You will see how heterogenous the definition can be.

summary_llm_chain = (
    summary_prompt | ChatAnthropic(model="claude-3-haiku-20240307") | StrOutputParser()
    # Customize the tracing name for easier organization
).with_config(run_name="GenerateSummary")
summary_chain = summary_llm_chain | parse_summary

# Now combine as a "map" operation in a map-reduce chain
# Input: state
# Output: state U summaries
# Processes docs in parallel
def get_content(state: TaxonomyGenerationState):
    docs = state["documents"]
    return [{"content": doc["content"]} for doc in docs]

map_step = RunnablePassthrough.assign(
    summaries=get_content
    # This effectively creates a "map" operation
    # Note you can make this more robust by handling individual errors
    | RunnableLambda(func=summary_chain.batch, afunc=summary_chain.abatch)
)

map_reduce_chain = map_step | reduce_summaries

If you did not get it or did not read it, it is normal. Too many things happen at once.

There is a lack of consistency : summary_llm_chain is build with pipe operators where summary_chain is build with yet another pipe operator summary_llm_chain and parse_summary : why not everything at once ?
You need extra operators to do the map reduce : a custom user-defined function get_content and glue functions from LangGraph RunnablePassthrough and RunnableLambda

So the purely functional approach is not the most adapted to maintain simple pipelines.

🔏 Poor typing/classing of the data exchanged by the functions

In LangGraph, the default way to defined state is through a TypedDict.
The state is kept during all the steps of a graph.
If new entities appear, you need to have them as elements of the State. So not all keys are filled from the start, so you can make mistakes when filling the state at any time.

Here is another extract illustrating the idea.
A state is defined with 3 different attributes but you only see access the document field in the examples.
The true format of the minibatches and clusters remains mysterious to the attentive reader.

class TaxonomyGenerationState(TypedDict):
    # The raw docs; we inject summaries within them in the first step
    documents: List[Doc]
    # Indices to be concise
    minibatches: List[List[int]]
    # Candidate Taxonomies (full trajectory)
    clusters: Annotated[List[List[dict]], operator.add]

def get_content(state: TaxonomyGenerationState):
    docs = state["documents"]
    return [{"content": doc["content"]} for doc in docs]

def reduce_summaries(combined: dict) -> TaxonomyGenerationState:
    summaries = combined["summaries"]
    documents = combined["documents"]
    return {
        "documents": [
            {
                "id": doc["id"],
                "content": doc["content"],
                "summary": summ_info["summary"],
                "explanation": summ_info["explanation"],
            }
            for doc, summ_info in zip(documents, summaries)
        ]
    }

Please note that the different functions of the graph feed themselves from the state by using keys !
This can be very error prone when you have a long list of processing.

Note : In Langgraph, you can define custom input and output states, but you need additional effort (link).

💀 This flavour of functional programming lacks flexibility

There are many minor examples that illustrate why the design pattern of LangGraph is not the most ideal.

> You need to still draw the edges of the graph yourself

You have to define the functions, the state.
But it won’t protect you from linking the function together

graph = StateGraph(TaxonomyGenerationState)
graph.add_node("summarize", map_reduce_chain)
graph.add_node("get_minibatches", get_minibatches)
graph.add_node("generate_taxonomy", generate_taxonomy)
graph.add_node("update_taxonomy", update_taxonomy)
graph.add_node("review_taxonomy", review_taxonomy)
graph.add_edge("summarize", "get_minibatches")
graph.add_edge("get_minibatches", "generate_taxonomy")
graph.add_edge("generate_taxonomy", "update_taxonomy")

> Doing a for loop over a set of samples for a MapReduce pattern

From the same example of TNT LLM seen before, the for loop is done in a strange manner… with a conditional edge.

def should_review(state: TaxonomyGenerationState) -> str:
    num_minibatches = len(state["minibatches"])
    num_revisions = len(state["clusters"])
    if num_revisions < num_minibatches:
        return "update_taxonomy"
    return "review_taxonomy"

graph.add_conditional_edges(
    "update_taxonomy",
    should_review,
    # Optional (but required for the diagram to be drawn correctly below)
    {"update_taxonomy": "update_taxonomy", "review_taxonomy": "review_taxonomy"},
)
graph.add_edge("review_taxonomy", END)

The reason seems to be for graph representation. But this reason alone is a smell that it could happens for more concerning reasons.

> Unit testing has revealed to be quite complex

If you want to test one of the chains defined in the articles by calling it, it may be hard to do without reading the code.

class TaxonomyGenerationState(TypedDict):
    # The raw docs; we inject summaries within them in the first step
    documents: List[Doc]
    # Indices to be concise
    minibatches: List[List[int]]
    # Candidate Taxonomies (full trajectory)
    clusters: Annotated[List[List[dict]], operator.add]

# update_taxonomy expects a TaxonomyGenerationState
rez = update_taxonomy({
    "documents": my_docs, 
    "minibatches": ????, 
    "clusters": ????}, 
    configurable)
# What is the expected format of minibatches and clusters ?

What is the format of the dict and the list[list[int]] ?
Well, you need to read and run the code to know.

In conclusion, there are clear drawbacks to use the Langgraph design patterns.
But can another challenger do it better ?

3 — LlamaIndex workflow : an object oriented programming way of doing LLM apps

LlamaIndex is really not that different from LangGraph.
Similar operators for LLM, RAG and other usual suspects. But the main way to build your apps is different : with workflows.

🚼 — Introduction of the Workflow class

Here is an example :

An app with 2 steps : starting with StartEvent and finishing with a StopEvent
A custom JokeEvent defines, in a joint manner, the routing between the steps of your graph and the content of the state.

class JokeEvent(Event):
    joke: str # This means generate_joke share a str named joke to critique_joke

class JokeFlow(Workflow):
    @step
    async def generate_joke(self, ev: StartEvent) -> JokeEvent:
        pass

    @step
    async def critique_joke(self, ev: JokeEvent) -> StopEvent:
        pass

The goal of this example is to show you that you can run your own experiment in 5 minutes rather than 1 hour.
But the most important is how it scales to a complex use case like the TNT LLM examples we have seen before.

👩‍🔬 — A complex use case with LlamaIndex

We can reproduce the TNT LLM code by learning only a subset of custom concepts :

Custom events : you define in a joint manner the routing between the steps of your graph and the content of the state
PromptTemplate : this one is actually shared by all frameworks, but the syntax changes between them

# If you have defined TAXONOMY_UPDATE_SYSTEM and TAXONOMY_UPDATE_USER as your 
# prompts, you can easily adapt it to the right format
tnt_taxonomy_update_template_msgs = [
    ChatMessage(
        role=MessageRole.SYSTEM,
        content=TAXONOMY_UPDATE_SYSTEM,
    ),
    ChatMessage(role=MessageRole.USER, content=TAXONOMY_UPDATE_USER),
]

tnt_taxo_update_template = ChatPromptTemplate.from_messages(tnt_taxonomy_update_template_msgs)
tnt_taxo_update_template

Map Reduce : it is composed of 2 functions ctx.send_event and ctx.collect_events to respectively send map requests and collect the results of map requests. Here is an example :

class BatchUpdateTaxnomyWorkflow(Workflow):
    @step
    async def map_dataset(self, ctx: Context, ev: StartEvent) -> MapEvent:
        documents = ev["documents"]
        ...
        for doc in documents:
            ctx.send_event(MapEvent(text=doc))

    @step
    async def reduce_fn(self, ctx: Context, ev: MapDoneEvent) -> StopEvent:
        ...
        results = ctx.collect_events(ev, [MapDoneEvent] * 5)
        ...

If you understood these concepts, this is all you need to implement the TNT-LLM paper in LlamaIndex.

For the curious, you can see the details of the workflow in the appendix section.

🔈 — Conlusion

The 3 types of concerns met with LangGraph can be mitigated by simples operators in LlamaIndex :

😵 — You need to learn many custom operators ==> Workflows in LamaIndex
🔏 — Poor typing/classing of the data exchanged by the functions ==> extending Events in LamaIndex
💀 — Function oriented processing lacks flexibility ==> the MapReduce pattern is a great example, it is quite simple to implement with LlamaIndex

I see many benefits in the design pattern of LlamaIndex and this is only a scratch of the use cases that can be met.

So give it a try.

Generated with flux.schnell

Appendix — The TNT LLM code done with LLamaIndex

The TNT LLM has 3 steps :

Summarize all records
By group of B records, build or update the taxonomy
Refine the final taxonomy

We build a workflow to do the first 2 steps.

class BatchUpdateTaxnomyWorkflow(Workflow):
    @step
    async def map_dataset(self, ctx: Context, ev: StartEvent) -> MapEvent:
        documents = ev["documents"]
        taxonomy = ev.get("taxonomy")
        batch_size = len(documents)
        _ = await ctx.set("batch_size", batch_size)
        _ = await ctx.set("taxonomy", taxonomy)
        for doc in documents:
            ctx.send_event(MapEvent(text=doc))

    @step
    async def map_fn(self, ev: MapEvent) -> SummaryEvent:
        tnt_prompt = tnt_template.format_messages(
            summary_length=20, explanation_length=30, content=ev.text
        )
        result = llm.chat(tnt_prompt)
        return SummaryEvent(summary=result.message.content)

    @step
    async def reduce_fn(self, ctx: Context, ev: SummaryEvent) -> StopEvent:
        batch_size = await ctx.get("batch_size")
        taxonomy = await ctx.get("taxonomy")
        results = ctx.collect_events(ev, [SummaryEvent] * batch_size)
        if results is None:
            return None
        summary_xml = format_docs([x.summary for x in results])
        if taxonomy is None:
            prompt = tnt_taxo_template.format_messages(
                content=summary_xml,
                nb_categories=15,
                cluster_name_length=15,
                cluster_description_length=30,
                explanation_length=30,
            )
        else:
            #taxonomy_xml = format_taxonomy(taxonomy)
            prompt = tnt_taxo_template.format_messages(
                data_xml=summary_xml,
                cluster_table_xml=taxonomy,
                nb_categories=15,
                cluster_name_length=15,
                cluster_description_length=30,
                explanation_length=30,
            )
            
        result = llm.chat(prompt)
        rez = parse_taxa(result.message.content)
        taxo = format_taxonomy(rez["clusters"])
        return StopEvent(result=taxo)

References

[1] — Cholet et al., ArcPrize 2024
[2] — LangChain, TNT LLM implementation tutorial
[3] — LlamaIndex, documentation reference

The AI streamer experiment

AdMor — Sat, 12 Oct 2024 12:41:37 GMT

A story of scaling a Video+LLM AI system to run on Twitch

Jesus as a streamer

In the past months, I developed AI Jesus, A LLM based system with interacts with the Twitch interface.
Its goal is to interact with people chatting in the stream chat by sending video responses of their questions.

An example of the app : people ask question on the right and a video response appears after a few seconds of processing

This experiment was inspired by other examples. In fact, this is not even the first AI Jesus on Twitch.

In this article, you will have an overview of some AI entertainment systems but also details about how I built the architecture of this AI Jesus.

1 — State of the art review

The AI live stream idea is not new. Many tests have been run.
I’ll present a few I found.

⛪ AskJesus

This idea was in fact already an existing one. Ask Jesus offer an english speaking avatar answering people in the chat.

Interesting to observe their claimed server costs : >10k$

Interestingly, there is actually a complete service behind this demo.

I believe several channels on twitch use the same backend. There could be a whole economics study of having AI avatar speaking about niche topics on Twitch.

👑 AI Melenchon

Created by Clad3815. This bot based on a french politician used ElevenLabs and GPT4.
It made me laugh on several occasions, I looked to understand the prompt used to make the bot actually funny but did not find all the details.

One thing I guessed is that using the word “sarcastic” in the prompt makes the bot very aggressive against the person asking the question. Thus providing some elements of surprise and fun.

One of the clip of the now stopped channel : summary of the program of the candidate using slang words

❔Nothing Forever

A very weird but captivating show.
It reproduces exchanges in the Seinfield sitcom but prompts different topics. On top of it, an animated 3D scene makes the video almost like you watch a real sitcom.

💬 NotebookLM

Even big players like Google started to play with this idea of AI generated content.

NotebookLM can create REALISTIC podcasts from any content. This way more advanced than the previous examples because they simulate a chain of interactions in a way that was hardly covered in the applications on Twitch.

2 — Building a AI streamer factory

Now that we have seen the world of possible applications, we can enter the details of how to build a simple one. And there is already quite a collections of challenges to face.

🏯 — High level Architecture

The general idea is simple, we need to transform an input text into a video that answer this input.

Multiple steps unfold in order to do this :

Read the chat message from Twitch as soon as it happens
Request a LLM to generate an answer to this question
Transform it into an audio with txt2speech
Use wav2lip to morph the template video response based on the audio of the response
Send the video to twitch

You can see these steps with the diagram below.

Main steps of the AI streamer : purple means cloud service, blue means local

It is possible to host most of the models on a local computer in order to reduce costs. Only the LLM part is using OpenAI gpt4o-mini in order to answer rapidly and at a low cost.

🎥 — Streaming

The largest challenge in this diagram was not using the model that are all off the shelf, but it was to stream the video.

You have 2 options : do the low level streaming yourself or use a tool with its associated software stack.
I fluctuated between one and the other but finally choose to use OBS, an open source tool and to learn how to use it to achieve the desired outcomes.

OBS — an open-source live video mixing tool

OBS is pretty popular in the world of streaming. You can create complex scenes with overlays and several signals mixing. But it is not a tool that automates very well.

a) On the fly video switch 📺

The first issue was to be able to launch the latest created video. There is no integrated way to do this in OBS. So you need to start doing custom config from the first minute you use it.

The first option I found was to enable a script that reads all video starting from the most recent and to switch that script with the OBS API thanks to an external watcher script.
It looks already convoluted, this is normal, the logic was insane to do such a simple thing. But it worked at first, until I started to industrialize the codebase.

b) Dockerization 🐳

The first version of the system used multiple scripts but switching to a docker compose with separate clean containers made everything easier.

Dockerizing the services in charge of the ML models was relatively ok.
They talk to each other through an API and this works simply with Docker compose.

But for the video streaming service, things were a bit more subtle.
OBS is GUI first, so automating the deployment was not officially supported.
Luckily, you can reverse engineer how the default config files work and get it to boot your config.
The Docker image also needs to have a UI interface, which is usually not the case when building with a base image of Ubuntu.

A virtual desktop container is needed to launch obs with a GUI and all its options

With this configuration, the setup is still not 100% automated, closer to 90% but this is ok as it allows to control the “going to production” part.

🏃 — Performance and cost control

Why not run everything on the cloud ? Because 💸

The service runs 24/7 so if I rent a GPU at 1$/hour, I won’t be able to support this app for long.

So running most of it locally is mandatory. Luckily, I’ve already invested in the GPU so the worker can work decently at home.

The other challenges are the following :

Reliability of the internet connection : with Wifi access, loss of connection seems to happen more often for the Streamer service
Everything needs to fit on 16Gb of VRAM : usage of a LLM service was the best of all worlds, not extra VRAM usage, good performance and simple upgrade once newer models come out.
Speed : users expect almost instantaneous response to their question. Telling them on the stream UI that their question is processed helped. Also the wav2lip model performance could be drastically tuned by reducing image quality. It allowed near real time.

The amount of VRAM used is relatively low, GPU usage rarely spikes high

Beyond this, monitoring can also be challenging.

Token refresh with Twitch was surprisingly difficult to implement. And it can lead to the system not being responsive to the user anymore.
Access to the virtual desktop can help to see what is actually broadcasted
Docker compose can display some of the error encountered, but if spread across multiple machines, it will be difficult to diagnose

Right now, this is the main blocker to reaching a larger scale of operations.

🗽 — Scalability & reliability

Once the goal became to scale out of a single machine, new elements to the diagram appeared :

S3 was needed to store the videos between the services
A message queue was also used to allow inter process communication

What explains this decomposition :

Service 1 is the worker node : it does all the ML processing. equipped with a strong GPU
Service 2 is the streaming node : it hosts the OBS instance. It should have a GPU but a budget one should be enough.

New bricks (S3 and message service) are added once decoupling becomes important. Purple is for cloud services, blue for locally hosted.

a) What benefits does this new architecture bring ? 💚
- The workers can scale independently
- 1 worker could serve several streaming services if you have multiple AI streamers
- The streaming service can benefit from a reliable internet connection. Having it in the cloud can be a good idea.

b) Remote video handling 📚

Scalable design impacted a lot how the videos are read.

If the two main services are shared across different machines, local video sharing is not possible anymore. The files must be hosted on a cloud storage like S3 in order to be shareable.

From there, you need an additional service responsible for the orchestration of the video files. The VideoServer is used to listen to a message queue about new videos and to download them locally for the streaming OBS service.
The VideoServer also owns the video display : when should we display old videos and when should we play the latest answer from a user.

Final words

I love this project, it started very simple and go quite complex on some very basic needs.
Scaling at the lowest possible cost is also something I find very enjoyable but that we don’t get to do so often in professional day-to-day life.

Foundational LLM pre-training on your hardware at home

AdMor — Sun, 15 Sep 2024 07:38:39 GMT

Foundational LLM pre-training on your cheap hardware at home 🏡

And it won’t cost you that much, I promise

What a successful pretraining will look like

Motivation

Why not play in the big league ? 👨‍🍳
Why not pretrain your own LLM on a large text dataset ?
So you can give it a silly name like CharlesGPT or FoxTerrier0.75B.

But it should be cheap. 💸
Let’s avoid the thousand $ bill from AWS for our “small” experiment.

Where should you start ? 😨
There are so many papers, models and more.

In this article, we decipher all of this for you 🤝.

Outline :

Exploring the datasets useful for LLM training
Hardware : what can we realistically train with consumer-grade equipment ?
Optimization : use the latest innovation to get the most compute / $

🏋 Large datasets

You know what everybody says : “No data = no AI”.
So this is a good starting point to know what is feasible or not.

The AI will be as good as its data.
But there is a lot of datasets out there.
What do the grown-ups use ? What is the state of art of good text datasets ?

We answer all these questions next 🤓.

Some references [1] can be found to list text datasets.

Common crawl : 500tb, the majority of the open web is there.
Project gutenbert : few gb 50k books
Arxiv 270gb
The pile 800gb https://pile.eleuther.ai/ a collection of text composed of several datasets
Massive text : 10.5 tb, https://paperswithcode.com/dataset/massivetext
Dolma : 3trillion tokens, https://github.com/allenai/dolma?tab=readme-ov-file, 20tb decompressed
C4 dataset
Redpajama : 20t

There is already too much data for out use case. What should we use ?
Luckily for us, HuggingFace has explored this question for us with their FinedWeb dataset.
They show that raw text is not the most optimal to train on.
Filtered datasets are usually better : more informative text, removal of explicit or racist content.

Fineweb is a way to reproduce Refineweb the dataset used to train the Falcom LLM.
It consists of : 15trillion tokens, 93tb uncompressed, 44tb

One direction explored with FineWeb is to create an educative subset. The educativeness is aimed to the boost learning of LLMs.

Finewev edu : 1.5t, 9tb, HuggingFaceFW/fineweb-edu

For our purpose, we consider fineweb-edu to be the right balance quality x size vs cost.

📏 Dataset references

What was the set of the datasets for other well known model ?

GPT2 was trained on 40 Gb of text
GPT3 approximately on 500T tokens
Phi3 was trained on 3T tokens

So a great order of magnitude is 1T tokens. Most probably, we will have to go lower because of budget issues.

Next, we check if this is indeed possible by doing back of the envelop calculations.
We use the Chinchilla scaling law to know how much processing power is needed for our big idea.

The formula is :

C = 6 * Nb_of_model_params * Dataset_tokens * epochs

We use :

Model size = 1B
Dataset size = 1T
Epoch = 1

C = 6 * 1e9 * 1 epoch * 1e12

C = 6e21 FLOPs

Next part will investigate what is needed to achieve this number hardware wise.

💾 Hardware needs

FineWeb Edu is our starting point. For this, we need 9Tb.
Luckily it is easy to find 10+Tb drives on the web.

Used HDD are a gamble but there are the way to be cheap.

We go for a 12 TB drive to store our dataset and models.

Now, the GPU choice

Model pre-training is a processing hungry task. GPU are in high demand. No miracle will be possible budget wise.
The GPU selected will be a RTX 4080.

According to HuggingFace, this is the right middle ground between being a GPU poor and a GPU rich.

It has 48 TFLOPS, this information is important to estimate how fast we can go through the dataset.

Alternatively, we consider the RTX 4060Ti 16Gb, which is a cheaper option. However much larger trade-off would be needed.

Cheaper option, but the FLOPS are 2 times smaller.

⏰ Math time : what can we afford to train ?

So our initial dataset would cost 6e21 FLOPs to train on.
We will need to compare it the processing power of our RTX 4080 of
4.874 e13 FLOPS.

Time needed to train = 6.21 e21 / 4.8E13 / 3600 / 24 = 1497.4 days of compute for 1 epoch.

Only 4 years of training

Ok, let’s redesign the size of everything :

model size = 100M
dataset = 0.1T

Training time now becomes 15 days for 1 epoch 🎉🎉🎉
This is long but we could stop it earlier if we want to.

⏰ Inference based estimation

There is another way to estimate the training time needed.
It would be in an empirical way. Run one batch of data on your network and see how long it takes.

This reference allows us to do it.

An alternative to check how much time will be needed for this pretraining.

We keep the assumptions of the previous computation. We use the GPT2 architecture as a way to approximate the training cost of one iteration.

We find :

374.80 ms for 10 batches of 1000 tokens

It gives 43 days of training for 1 epoch of this large dataset.

So the truth is between 15 and 45 days of training.
This is a bit long so we might consider shrinking even more the model size.

👾 8-bit model training

However in the previous calculation, we don’t take into account the possibility to train using 8-bit precision.

The capability exist for approximately 2 years and was enabled on Ada-generation GPUs.
Instead of training in 16bit, we can use a 8bit precision, enabling more data to be processed using the same hardware.

The exact speed up numbers provided by Nvidia are unclear.
From +20% to +100%, it will really depend on your implementation details.

Throughput gained for LLM training, various precision and hardware used, extracted from the Nvidia conference 2023

By using a raw pytorch implementation of the model, it is straightforward to use nvidia’s TransformerEngine autocast.

Extracted from the Nvidia conference 2023

I launched a benchmarking script to validate the speedup

root@e485ce1a22f2:/workspace/test# python3 min_example.py --dtype bf16 --depth 4
Mean time 253.556015625 ms per iteration (8911.3125 GB used)
root@e485ce1a22f2:/workspace/test# python3 min_example.py --dtype fp8 --depth 4
Mean time 191.7706640625 ms per iteration (9891.125 GB used)
root@e485ce1a22f2:/workspace/test# python3 min_example.py --dtype fp8 --depth 2
Mean time 93.829794921875 ms per iteration (5879.4375 GB used)
root@e485ce1a22f2:/workspace/test# python3 min_example.py --dtype bf16 --depth 2
Mean time 117.90201171875 ms per iteration (5059.3125 GB used)

There between 25 to 33% speed for an additional 1gb of space used.
So we can expect to reach a training time from 10 to 30 days instead.

🌜Conclusion

Even tough it is a difficult task to train a LLM at home, it is doable.

In fact, similar sized experiments have been tested by HuggingFace with their SmolLM models (but on the cloud).

In the next post, we will implement the configuration from this post and report on the results.

📗 References

Nvidia transformer engine : https://github.com/NVIDIA/TransformerEngine

💎 Bonus

>> How can I find a HDD at the right price ?

If you buy a brand new HDD, it could be very expensive. By using used HDD, you can greatly reduce the acquisition cost.
A good price is 10e / To.

Where should you go :
- Ebay : they may be the cheapeast.
- Amazon also works, you could get a better level of guarantee there

>> Faulty hdd

For the price listed above, you will have used HDD. It means that they can fail or die on the way to your place.

Be careful of these tricks used by reseller :
- 0-hour usage : it is usually not true.
- Disk dead at arrival. Transportation can break some HDD. Don’t be surprised. This is why having a guarantee is important.

With these few tricks, you should be able to find the right piece of hardware for your need.

KDD 2024 part 2 : How hard ML problems are framed in different industries

AdMor — Sun, 08 Sep 2024 15:22:41 GMT

KDD 2024 part 2 : How hard product and business problems are framed using machine learning in different tech industries

KDD is one of the largest conference for AI and Machine learning in the world.

I published a first post on trends seen at KDD, this second posts will be centered around what I learned about how major ML centered businesses are understanding their main problem and how they solve it using applied math methods.

What you will see in this post :

Autonomous agents according to Boeing AI Chief Technologist
Equilibrium in a 3-sided marketplace with Glovo
SOON : Why finding something on a map is different by Airbnb

1 — Building end-to-end decision & autonomous systems

This was an invited talk with Dragos Margineantu, AI chief technologist @ Boeing.
He summarized his presentation as “What i learned building end-to-end decision & autonomous systems”

1-a) High level overview

An autonomous system will perform tasks that require a high level of reasoning.
Dragos presented in this talk a generic design for this kind of AI. This first diagram summarizes it.
But the focus for Boeing is to build an autonomous plane which is challenging in today’s standards.

The controller is the main piece of software. I has to do :
- High level decision : go to the bakery store
- Low level decision : turn right or wait

To do so, it used an internal list of rules (knowledge base) but most importantly a perception engine (computer vision for exemple).

A more concrete example of the perception engine is shared for the case of image processing :
The use case is an autonomous car or plane case.

Autonomous systems are complex : the perception part is already composed of many pieces

🖼 — There are multiple levels of complexity to acknowledge :
Multiples modalities : different cameras on the plane, lidar
👫 — Redundancy : In the object detection box, you can see that several detectors are presented. It is a core aspect of these decision systems, failure will happen for sure, redundancy is not a luxury anymore.
🗼— 3 levels of abstraction before reasoning : raw processing (segmentation), concepts extraction (object detection), spatio temporal reasoning (tracking). The tracker output is the basis of high order reasoning.

1-b) Challenges of these systems

i — Robustness

Learn to say “I don’t know”

All the AI systems making decisions should have 2 outputs :
- value
- uncertainty

This approach is particularly relevant because you have an ensemble of models and not all models may fail at the same time.

“You don’t understand it if you understand it in 1 way”

There is a large direction of work in engineering redundancy.

Example :
They created “synthetic” data in places where they have little data.
They faked an incursion on the airport

ii — Trust

What took them a lot of time is to comply to the human expected actions or decision.
Example given : the avoidance maneuver. The AI would direct the plane to go in the direction of another plan to go beyond it, scaring the pilots at first.

This direction is very close to the known subject of explainable AI

iii — Anticipation

This topic was less detail but you can understand why it is really relevant here.
The high level reasoning module needs to make assumptions on the future.
For autonomous plane, maybe you can’t investigate directly to get more information. But for other autonomous system, it could make sense.

Their goal is :
Time series prediction ===> trajectory prediction

They published a paper on this topic : Generative methods for anticipating unknowns : normalizing flows

1-c) My take-aways

You want to build an intelligent AI assistant. You might end up using the following architecture principles in the future.
Why ?
Making your system more robust to edge cases and abuses will bring values to your customer, especially if the core technology like GPT-4 is commoditized.

2 — How do Uber, Deliveroo and Glovo are framing their business ?

Introduced by Glovo during the TSMO workshop on 2-sided marketplace : Simulation based Mixed Integer Linear Programming (MILP)

2-a) The use case

P1 is the order to deliver and D1 is its destination. The left rider is still working on D0 delivery.

Give 2 potential drivers, who should take order P1 ?

Right rider who is available but further.
or left rider who is already on an order, but once finished, will be closer to P1

This problem can be solved as an optimization under constraints.

2-a) The equation

How to understand the equation :
x = assignment
yo = order is assigned or not
fo = the cost of not assigning an order (cost of delay) ==> trick by creating a driver that is very far
c_ro = is the quality times revenue of the order

The term C is the core of the system, Glovo defines it with the folowing terms :
- cro = α0 · riderDistanceRo + α1 · riderDTRo + α2 · customerDTRo

🏃 — RiderDistanceRo = Estimated distance traveled by the rider.

🚴‍ — riderDTro = Estimated time the rider r will spend delivering
the order o. This term is used for controlling the quality of
our service at peak times, when the number of orders can
be considerable higher than the number of riders available.
🌯 — customerDTRo = Estimated delivery time of order o if it is
assigned to rider r, that is, the elapsed time since the order
is created until the order is delivered.

How do you improve the business ?

Finding the right alpha : the look for a reduction in delivery time with a limited driver distance increase (eg: <10%)
They use a simulator to estimate the probable values per city
Hypothesis testing is done by switch-back testing across different sections of the day in multiple cities.

2-c) My take-aways

To me, learning to recognize this class of problem and what tool can be used to solve it is an underrated skill.
It has been shadowed by more trendy topics like Deep learning.
For a company like Glovo, this Data science are a matter of life and death for the health of the business.
It might be worth for you as a readerto wonder if it can be applied in your case.

3 — How to display your results on a map with Airbnb

Soon

KDD 2024 — An overview of the emerging AI landscape

AdMor — Sun, 25 Aug 2024 19:14:05 GMT

KDD 2024 — Highlights of the emerging AI landscape

The place to explore the latest technical reports and new application fields

The KDD conference is known to be more applied compared to other A tier ML conferences.
For this reason, it is a great place to go to catchup on the latest technical trends but also discover emerging new topics.

In this post, I’ll present the top ideas that interested, surprised or at least felt novel to me.

So glad to be there :)

📝 — Detecting the AI pen

This workshop was about detecting text generated by AI.
It can have many different motivation but a major one is that AI can create more easily misinformation that we want to prevent.

The 5 main classes for AI-text detection : classes are not exclusive

The main take-aways :

Soft watermarks may be the most efficient to detect AI writing. Zero shot would be second.
But you need long chunks of text. With short ones, even human can be detected as AI.

Some examples

Zero-shot detection — Detect GPT

We can do a lot based on the distribution of the text generated. Usually human don’t produce text with the highest probability and variations of their text has also different properties.

x_fake generated by AI has specific properties when rephrased compared to x_real. This is the basis of many approaches.

Soft Watermark — The green / red algorithm [ref]

Split all words in 2 groups, there should be a green synonyms of red words. Red words can represent less than 50% of all words.
LLM inference will use more red words in order to be flagable.
Detection is done on the probability to have words only in 1 of the 2 sets
hypothesis testing is used to detect if a set of text with a lot of red words is AI

There can be different green-red lists that give better results

Why is it called “soft” ?
It is not always possible to green/red everything like the word “Obama”, as it has no synonym.
Efficiency is only reached on longer texts. In that scenario, it starts to be very hard for attacked to remove all traces of AI-ness.

🕴 — AI impact on the job market, talent management and recruiting

This workshop explore how new AI capabilities could shift the global job market. The topics discussed were in fact very wide.

>> Human learning direction

The first lecture from Professor Hui Xiong highlighted that an area of knowledge worker could quickly have a much lower added value : the describable knowledge.

Not describable knowledge is easier to illustrate than the opposite. People management was the first illustration given.

Describable knowledge is where AI could shine in the next years

Some use cases for application to the job market were interesting to think about (coming from “A compreghensive survey of Artificial Intelligence Techniques for Talent Analytics”)

Describe your curriculum and the LLM generate a CV
Given your CV, generate a job description and find the closest matched job THEN find the key skill that you need to learn
Mutli-modal LLMs could bring a simple structuration of all CV formats

>> LLM as novice qualitative research assistant, a talk from Talent management research from Amazon

Talent management research is the usage of science and data to equip employees with resource to best navigate their career.
Talent management works on either core research (what does promotion / good employee / etc looks like), product development or metrics and evaluation.

Their base material is an interview dataset. They don’t reveal their internal dataset but used a public one composed of 8 transcripts of 1 hour interview.
A RAG based model achieves close to human performance.

The key takeaway is especially that they consider the level of the RAG LLM at the level of a junior qualitative researcher.
But more realistically, the tool will mainly boost the work of the human more than replacing it. They mentioned the lack of replacement for bias and deduction from LLM.

🏭 — Preprocessing large multimodal dataset

Another great overview of the first day of this conference is the discovery of the data-juicer package.

It is aimed at preprocessing very large amount of multimodal data for LLM training. The maintainer explained the key differences that pushed them to develop a tool different than say Spark :

A model is an operator like any other one
The Data is AI-native, meaning it is intended for AI primarily. An example could be filtering partially a video

An example of using GPT3-based score to apply a filter on data from Common Crawl

I recommend to read the excellent blog post from HuggingFace on how they build the FineWeb dataset. Many of the custom operator mentioned in their post are present in DataJuicer.
The scale and cost of the pre-processing seems new to me as they were maybe more limited to text for the majority of companies.

💡 — Conclusion

This first day was great in discoveries.
I would recommend to attend multiple different session as you can often make random discoveries on topics that you know nothing about.

3 docker best practices to muscle up your game

AdMor — Thu, 08 Aug 2024 20:04:25 GMT

Not so long ago, I realised something disturbing.

I knew many best practices in different domains : python, how to train a good model, how to save money on the cloud.

But for Docker, things were not so clear. That was a clear indicator that I should probably learn a lot more about it.

After a few discussions and reading on the topic, I collected a small set of resources that should help everyone feel more confident that they use Docker properly

The essentials of Docker

This part is an intro for the best practices.
Without it, the article could feel incomplete and frustrating if you did not spend enough time on Docker.

Docker in a nutshell

Docker is a platform for containerization that had a transformative impact in the software world. Docker enables the packaging of applications and their dependencies into portable containers, which can run consistently across diverse computing environments.

Docker also simplifies application management, accelerates deployment, and optimizes resource utilization by sharing the host operating system’s kernel.

Why Docker ?

Docker started to be used around 2013 and became a key skill to have in the tech industry. However, this does not tell a lot on the “why” it became popular.

Docker gained widespread adoption because it provided a better alternative than other option at the time. Let’s see it in details :

Lighter than Virtual Machines:
Virtualization technologies, such as VMware and VirtualBox, were commonly used. VMs provide full virtualization of an entire operating system, allowing multiple applications with their dependencies to run on a single physical server. While effective, VMs are heavier than containers in terms of resource usage, and the startup time is typically slower.
Simpler than traditional configuration management tools:
Tools like Puppet, Chef, and Ansible were used for automating the configuration of servers and ensuring consistency across different environments. These tools focused on managing the software configuration on servers but didn’t provide the same level of isolation and portability as containers.
More automated deployment:
In many cases, deployment involved manual steps, where developers or system administrators configured servers and installed dependencies manually. This manual process often led to inconsistencies between development, testing, and production environments.

Docker 101

We have seen why Docker has become widespread, but not how to use it.

However, there are countless tutorials on Docker 1 2, so I won’t create a new one.

But for illustration purpose, here is what you can expect to see in Dockerfile. It will be useful when we will review the best practices.

# Use an official Python runtime as a base image
FROM python:3.9
# Set the working directory in the container
WORKDIR /usr/src/app
# Copy the application source code to the working directory
COPY . .
# Install app dependencies
RUN pip install -r requirements.txt
# Define the command to run the application
CMD ["python", "app.py"]

In a nutshell :

We start with a python:3.9 image
Create a directory in the system /usr/src/app that will be used for the next operations
The COPY command will copy all the current directory into the container
We proceed to dependencies installs
The default command when the container will run will be python app.py

Docker best practices

In the previous section, we have seen why Docker is useful. Now the question is how does it work exactly and how can I make it run more efficiently.

1 — Avoid unnecessary build time

You can increase the speed of consequent build by optimizing the order of the operations.

Let’s see an example of the consequences of an unoptimized Docker file (like the one of the previous section)

In this example, a COPY . . is done early in the build. Hence, after any modification of a file in the project, the build process will restart from the COPY and redo a potentially unnecessary install of the dependencies.

On the other hand, if only the requirements file is copied, one can skip the requirements install if no dependency changed.

2 — Multiple stage

Stages allow to build your Docker image in separate parts, usually into a builder and a runtime image.

Let’s see an example.

# Stage 1: Build Stage
FROM python:3.8 AS builder

WORKDIR /app
# Copy only the dependency files to leverage Docker cache
COPY pyproject.toml poetry.lock .
# Install build dependencies
RUN pip install --upgrade pip poetry && \
    poetry config virtualenvs.create false && \
    poetry install --no-interaction --no-ansi
# Stage 2: Server Stage
FROM python:3.8-slim as server
WORKDIR /app
# Copy installed dependencies from the builder stage
COPY --from=builder /usr/local/lib/python3.8/site-packages/ /usr/local/lib/python3.8/site-packages/
# Copy the rest of the application code
COPY . .
# Command to run your application
CMD ["python", "app.py"]
# Stage 3: Worker Stage
FROM python:3.8-slim as worker
WORKDIR /app
# Copy installed dependencies from the builder stage
COPY --from=builder /usr/local/lib/python3.8/site-packages/ /usr/local/lib/python3.8/site-packages/
# Copy the rest of the application code
COPY . .
# Command to run your application
CMD ["python", "worker.py"]

What are the benefits ?

No need to rebuild the builder stage if the dependencies stay the same
You can build with a full python image and only keep a slim image for runtime, gaining 300Mb of space
You can use the same set of dependencies once for 2 different runtimes : a server and a worker

3 — Add a baked-in healthcheck to your web app

You can add a health check to your images.

Why ? When your app depends on other services, you can know in real-time when things go wrong.

How to implement it ?

FROM nginx:latest
HEALTHCHECK CMD curl --fail http://localhost/api/healthcheck || exit 1

This can also be implemented in a docker compose file, where it makes more sense from a system perspective.

version: '3'
services:
  web:
    image: nginx:alpine
    healthcheck:
      test: ["CMD", "wget", "-qO-", "http://localhost/"]
      interval: 30s
      timeout: 3s
      retries: 3

Conclusion

I hope these tips will support you in building more efficient Docker apps.

Dora vs Lora : the new solution to finetune LLMs

AdMor — Sun, 30 Jun 2024 13:33:17 GMT

Dora vs Lora : the new solution to finetune LLMs

A new challenger enters the race !

💁 — Do you know Dora?

It’s the new replacement for Lora!

Lora stands for Low rank adapter, LLM’s ultra-popular finetuning method thanks to its efficiency.
Dora, which stands for Weight decomposed low rank adaptation, is said to be better in term of quality and just as effective as Lora.

Nvidia recently highlighted this method in one of its posts.
- It compares Dora and Lora on NLP and vision tasks
- Dora sometimes wins with very high margins
- This prompted Nvidia to include this method in their toolkits (Nemo and others).

So it might be the tool of the future to learn right now, right ?

🥇 — Why is it better?

Dora does not offer more freedom to tune models, but less.
So why does it work ?

Reminder about Lora:

Credit to lightning.ai — Common formulation of the finetuning problem, efficiently learn the DW

- We learn a weight increment dW relative to the main model W.
- This weight is learned using low-rank matrices to improve learning speed. dW = AB
- For transformers, it is often accepted that only the attention of the model should be tuned.
- The lower the rank, the faster and more efficient the learning will be than a complete finetuning.

Caption from the Dora paper : the decomposition of dW into magnitude and direction is visible.

For dora :
- dW is simplified, with only the direction of the vector being modified granularly
- The amplitude of the vector will be a multiplier with a lower dimensionality
- So there’s a smaller solution space, which may explain why convergence is more efficient.

🙋 — Ok I want to use it

It’s already available in HuggingFace’s diffuser lib:

Use the : — use_dora
https://github.com/huggingface/diffusers/blob/main/examples/dreambooth/README_sdxl.md?utm_source=pocket_reader#dora-training

You can find some example on the model space of HuggingFace and twitter :

Ref : https://x.com/linoy_tsaban/status/1760713331227381950

Example of model

Overall, the method lacks a wide adoption as of today.

But it is supported in the main txt-2-image framework, which shows that some precursors are preferring it over other finetuning methods.
You can find it in the 1.9.0 version of the sd-webui.

🤔 - I have a sudden doubt, why does it work so well?

There has been many attempts to improve over Lora, so why is this one the best ?

> In the paper, we show the performance gain as a function of rank r.
The biggest gain in Dora vs Lora occurs at low r ranks (<= 8).
This is understandable:
- Fewer parameters: In the low ranks, Lora lacks the parameters to represent well. In comparison, Dora concentrates on direction

Llama 7B perf improvement with dora vs lora : (caption from the paper)

> A different correlation between magnitude and direction.
Lora does not necessarily learn perfectly.
Direction and magnitude are correlated but do not have the same correlation profile as in full finetuning.
This implies a certain sub-optimality which motivated this research direction.

The correlation of magnitude M and direction D is truer to the original distribution with Dora (caption from the paper)

💯 — Conclusion

If Nvidia starts to include a given technic in its frameworks, it is usually a good time to have a look at it.

As it can work for both text and image, you won’t waste your time testing it.

Do you still do your diagrams yourself in the AI era ?

AdMor — Sun, 09 Jun 2024 13:02:47 GMT

Still doing your diagrams yourself in the AI era ?!?

Free productivity for you with these new tools.

🖼 — An image is worth a thousand words

Not very original, but very true when you write complex reports.
I find it true for diagrams especially : try to explain the following tree in words, your reader will have a hard time figuring it out despite the simplicity of the content.

I took 5 seconds to draw it. How much more would it be on an online tool ?

But diagrams are slow to build programmatically (using a UI or not).
I often use draw.io which is simple and get the job done.
But you need to have finished your design first, it is usually as complex to update than to create.

In the tree before, if D and E are finally attached to B rather than C, I can redraw it quickly. On a digram software, it could be more complex to update it.

📝 — The speed of the paper, the slickness of the machine

Could we get the best of both world ?

YES

I came across this wonderful open source demo : DeTikzify

It accepts many types of input : doodles from a canvas but also pictures.

It works out of the box for this simple use case

It started simple but it went very deep

⚙️ — TLDR : how does it work ?

They collected many diagrams from Arxiv to build their training set.

They use a LLava-based model to do the image/text embedding and code generation.

They use VerMCTS to make sure the code compiles.

🤔 — So could we do more ?

These authors also produced AutomaTikz.

The idea is to create a prompt that will build a template of diagram that you could update later on.

In the following example, you could imagine how much you would need if you wanted to do manually this simple perceptron diagram.

Prompt: Visual representation of a multi-layer perceptron: an interconnected network of nodes, showcasing the structure of input, hidden, and output layers that facilitate complex pattern recognition.

I tried to input a vague concept in the prompt and see what the model would output. Not 100% ready yet.

A test on a customer support scenario. This model is not really usable yet.

🤑 — We could become rich with this idea

Unfortunately, you are not the first to have this idea. There are many many AI diagram startups.

Let’s quote a few of them :

Text to diagram :

EdrawMax : prompt to diagram templates

ChatUML : more focused on a chat experience to iterate on the template

LLM chat way of building complex diagram templates

But some of these startups try to offer more than a tool, solution to business problems with their chart tools :

Database relationship designer : Softbuilder : you write the user stories and you get a tech specification document

Softbuilder offer : user need to tech specs

Decision tree building based on context : Flowcharts.ai. I would escribe it as an advanced GoogleForm

https://medium.com/media/d44e6ba4a0762a27fa4808cb0ec013af/href

Most probably, staying too close to the tooling positioning is a dangerous situation. One open-source model can challenge your market position.

Other applied tools are more complex to challenge. With the text-to-diagram example, you saw how limited the generation was when you explain a use case in a prompt.

🔮 — Prediction time : how will all of this evolve

For the productivity aspect of LLM based personal assistant, this direction of diagramming looks promising.

Many companies hire people to compile data into graphs or represent processes with beautiful decision trees. So doing this instantly is very valuable.

Given how good LLMs are at summarizing information and how compressed the information can be in a diagram, mixing both world would be game-breaker.

So this is my bet : for the future of Google Gemini, Microsoft Copilot and OpenAI GPT4, generating diagrams will be a big value addition.

The final conclusion of this article

GPT4o : Is it a breakthrough or an incremental innovation ?

AdMor — Sun, 19 May 2024 18:40:12 GMT

GPT4o : Is it a breakthrough or an incremental innovation ?

The demo culture of OpenAI is impressive

You may have seen the OpenAI demo of GPT4-o, which was impressive.
After the emotional shock, it is interesting to decompose what was visible to reverse engineer what is behind the product and how far ahead OpenAI is and their new key competitive advantage.

What are the key improvements of GPT4-o ?

📹 — Multimodal : video, audio, image. Everything will be understood.
⏰ — Improved latency and can be interrupted : we get much closer to a human conversation
🗣 — Improved text-to-speech : high quality voice, singing capacity

📹 — Going Further in the multi-modal world

ChatGPT has been multi-modal for some time now. It began with GPT4-V the vision+text model of GPT-4.

It was also possible to chat orally with ChatGPT since September 2023.

The main innovation here was the large reduction in latency. It was made possible by skipping the speech-to-text step previously done.

How do you get rid of the speech to text ?
The Model has to accept raw audio tokens as input. And this is not much different than having image tokens.

Multimodal tokenization — source : https://docs.google.com/presentation/d/1ZXFIhYczos679r70Yu8vV9uO6B1J0ztzeDxbnBxD1S0/edit

If you dig a bit in recent papers, many people started to train multi-modal networks from large amounts of unlabelled data:
- In Merlot, a multi-modal network is trained in a masked auto-regressive manner on Youtube videos. It discovers temporal common-sense. It also learns to link image with text.
- In VATT (for Video Audi Text Transformer), a multi modal network is trained on videos with a contrastive loss (similar to the idea of CLIP)

⏰ — Improved latency and interrupting the robot

The previous architecture allows to avoid the costly speech-to-text.
But another improvement makes the complete system more human friendly : a switch activated through a Voice Activity Detection.

Why ? If there is a voice while GPT speaks maybe it should stop and listen.

Voice Activity Detection (VAD) was initially designed to identify voice segments from noise in order to activate an agent (think Alexa).

With a VAD, you can identify when a person is speaking and when this is only background noise

But was it new ? Well no.

Some solo developer even implemented this logic in a local first implementation

https://medium.com/media/ffc7d73e137a74316836867c4f514735/href

Some people have also been doing estimations of the processing cost on OpenAI side and guessed the network size from there.

According to Microsoft it’s 12 times less cost to run 4o.

A smaller model can also explain why the experience can feel more natural.

🗣 — Improved text-to-speech

The least you can say about the GPT4-o demo is that it starts to be very human like to talk to an AI.

Much of it also comes from the voice which is very natural : emotion, laughter, singing.

Did some people achieve this before ? In fact YES

Bark from SunoAI released an open-source model capable of singing, generating background noise and more, about a year ago.

https://medium.com/media/b4d97da54579f67bd57a6cedfcbda4ac/href

So with enough data, you might actually be able to do like OpenAI and have you expressive voice model ready.

🤔 — Guesses about GPT4-o

So we have seen some probable GPT4-o innovations.
But other guesses are possible :

Is GPT4-o outputting text or directly audio tokens : it would save additional time when generating an answer to have audio token
A new closed-source dataset have been used to reach this level of performance on reasoning : probably ones based on educational content given the demo

🔚 — Conclusion time

When will open-source (or other companies) catch up with OpenAI ?

Some people (Research Scientist @ Meta AI (FAIR)) already have guesses.

Some others are more confident than post GPT4 release

All in all, we can believe the other GAFAM should be able to produce similar apps in the near future.

The core of the battle is as always the data to enable the training of the models.