Reflections on the 2024 AI Engineering World’s Fair

Martin Arroyo
99P Labs
Published in
13 min readJul 15, 2024
The view from the SkyView Lounge at the AI Engineering World’s Fair Opening Celebration

Introduction

This past June, I attended the 2nd annual AI Engineering World’s Fair. Created by the team at Latent.Space, it brings together leading AI companies, founders, VPs of AI, and AI Engineers for three days of workshops, showcases, and other talks centered around sharing ideas and insights on building and deploying AI-powered applications. Hundreds of developers, leaders, and representatives from company sponsors like Microsoft, Anthropic, OpenAI, and AWS converged on San Francisco to attend the conference to share and learn more from one another.

The workshops were divided by track, of which there were nine in total. The tracks that I focused on primarily were Retrieval Augmented Generation (RAG) Frameworks, Evaluations and LLMOps, and AI Leadership.

What is an AI Engineer?

A nascent role, the AI Engineer is considered “someone with conventional engineering skills in addition to knowledge of language models and prompt engineering, without being a full-fledged Machine Learning expert.” The name was coined by Latent.Space in their seminal article, “The Rise of the AI Engineer”.

“In numbers, there’s probably going to be significantly more AI Engineers than there are ML engineers / LLM engineers. One can be quite successful in this role without ever training anything.”Andrej Karpathy

After meeting some fellow attendees, I found that this description mostly held true. The attendees that I interacted with were generally a mix of software engineers with a keen interest in AI and some experience building LLM-based applications. There were also researchers who have been working with these technologies for decades in attendance who consider themselves to be AI Engineers as well. My own perspective is certainly closer to the former — that of a software engineer with an interest in AI and its applications.

Our Research Goals

I work as a Research Engineer for 99P Labs as part of the Software-defined Intelligence dteam. Our group mainly focuses on research related to software-defined mobility and data. We study mobility issues with an eye towards building a future hybrid, harmonious society — a blended world that integrates with technologies such as artificial intelligence.

Given the surge of interest in applied AI, I have spent the past several months surveying the current landscape and learning about the latest state-of-the-art techniques for building these applications at scale. As part of our research, we have built a chatbot that can answer questions about our blog. You can learn more about that project here, as well as the follow-up article where we discuss powering our chatbot with a knowledge graph.

Conference Breakdown

I will summarize each day of my experience at the conference, then distill the key insights and trends that I found overall. However, it should be noted that this is far from comprehensive — there were so many sessions that I simply did not have time to make it to or couldn’t attend because of scheduling conflicts. This section reflects my own personal journey through the conference offerings.

Registration desk for the AI Engineering World’s Fair

Day 1

The first day of the conference kicked off with sessions that were focused on building with RAG frameworks. These were longer sessions and very hands-on, each averaging about 2.5 hours in length.

In the first session, Lance Martin from Langchain discussed the importance of reliable agent frameworks, and introduced LangGraph for graph-based workflows to enhance flexibility and reliability. The session emphasized corrective RAG and the critical role of testing and monitoring in AI deployment. Having used the LangGraph framework for a recent project, I can attest to how flexible and reliable it can make your application. By the end of the session, we had built several small examples of corrective RAG pipelines.

The next session was run by Microsoft and highlighted their AI templates that help streamline the deployment of AI applications. This was focused particularly for startups. We used Azure and Github Codespaces to put together and deploy several RAG applications without writing a single line of code.

The day concluded with a workshop hosted by Neo4j on creating knowledge graphs from structured data, showcasing techniques for graph building, vector search, and semantic search. This was the most packed session of the day, as many people were interested in the potential of knowledge graphs. The team at Neo4j taught us a lot, and I was able to take what I learned and applied it directly to a project.

Learning about knowledge graphs with Neo4j at the AI Engineering World’s Fair

Day 2

Day two was packed full of shorter (yet impactful) sessions. I attended 11 in total, not including the keynote. The first two sessions from the AI Leadership track focused on creating frameworks for evaluating the ROI of AI in software development, as well as lessons learned from the team at Weights & Biases about productionizing AI models and applications.

The first keynote speech of the conference was given by Simon Willison, where he talked about the “GPT-4 Barrier,” which posits that reasoning quality and costs for powerful LLMs have had an inverse relationship over the last year. He believes this trend will only continue. The rapid changes in the AI landscape and the ongoing AI trust crisis were highlighted. Willison pointed out that customer trust in solutions is paramount moving forward.

He also highlighted issues like data privacy, prompt injection, and the “gullibility” of language models — essentially that a strength and a weakness of models is that they’ll believe whatever you tell them. His parting words were a call-to-action to not publish AI content that is unchecked and for creators to take accountability for the content they produce.

“It’s on us (AI Engineers) to establish patterns for how to use this stuff responsibly… and help get everyone else on board.” — Simon Willison

Following Simon was Mozilla, who introduced Llamafile - an open source project that lets you run language models on your local machine (or most other environments for that matter.) This development makes it easier for builders to iterate quickly, save on costs, and deploy on most devices.

I spent the rest of the day in sessions from the RAG Frameworks and Evaluations & LLMOps tracks. Nikhil Thota, an engineer from the team at Perplexity, gave an insightful session on building scalable RAG systems that emphasized the importance of optimizing latency, cost, and backend performance. He highlighted the need for human-in-the-loop evaluations and effective orchestration of system components to manage the complexities of production environments. A key insight here which was echoed throughout the conference was that, at scale, building production-grade LLM applications is largely a systems issue and much less about which model(s) to use.

Ian Webster from Discord shared valuable lessons from deploying Clyde AI to millions of users. He emphasized the importance of security, legal, and safety considerations, along with building a culture of evaluation and integrating it into your existing CI/CD pipelines. The session introduced tools like promptfoo for red-teaming LLMs and strategies for mitigating toxic inputs and outputs.

Salesforce then led a session on increasing efficiency in AI deployments where they provided strategies for choosing more efficient (and smaller) models, as well as a discussion on techniques to reduce the size of model weights like quantization and LoRA. Other sessions delved further into RAG optimization, domain-specific evaluation metrics, and innovative approaches like Extended Mind Transformers to enhance retrieval and context handling.

Overall, these sessions underscored the critical need for continuous monitoring, evaluation, and domain-specific customization.

The Expo Hall at the AI Engineering World’s Fair

Day 3

The final day began with a keynote talk by Chris Lattner from Modular discussing AI infrastructure and the introduction of Mojo, a Pythonic systems programming language. The talk focused on simplifying GenAI deployment and improving model efficiency. Other keynotes from the morning included insights from Amazon on integrating AI into software development and Anthropic’s latest advancements with the Claude model.

Alex Volkov from Weights & Biases (and ThursdAI) kicked off the workshop sessions with insights on building effective evaluation frameworks. He emphasized the importance of traceability, human-in-the-loop evaluations, and warned us to avoid premature fine-tuning. Volkov highlighted methods such as evaluation datasets, programmatic scoring, and LLM-as-a-judge to ensure robust performance assessments.

Following this, a session by Parlance Labs and Rechat provided a systematic approach to constructing domain-specific LLM evaluation systems. The presenters stressed the significance of assertions, logging, and aligning LLM judges with human evaluations. Practical steps included using synthetic data generation to create evaluation data sets and iterating fast with a human-in-the-loop approach.

After that, Kyle Corbitt from OpenPipe discussed the circumstances under which fine-tuning is necessary, outlining the costs, latency benefits, and quality improvements that come with it as well as outlining when it isn’t necessary. Essentially, whether or not to fine-tune comes down to your own performance and cost needs. Over time, fine-tuning may cost less than using pricier models while also providing a performance boost, as fine-tuned models generally perform better on domain-specific tasks than RAG systems.

Nathan Peck from Amazon showcased their new AI coding assistant, Q Developer. It streamlines software development by generating scripts, explaining code sections, and assisting with documentation. It is designed to “enhance rather than replace developers.”

Q Developer can currently handle around 75% of development tasks according to Nathan. The session highlighted the potential of integrating AI assistants into development workflows to improve speed and accuracy, demonstrating capabilities similar to but more advanced than tools like Copilot. It looks promising, and from the short amount of time I got to demo it, it seems like it could be a useful addition to my development toolset. I can definitely see this helping to improve my development speed and quality.

The final session on GraphRAG with Neo4j, led by Andreas Kollegger, delved into the integration of knowledge graphs for enhanced data retrieval and the practical application of combining structured and unstructured data. He explained how to create schemas that separate data sources by layers, enabling effective vector searches and entity extraction. Kollegger advised starting with a minimum viable graph and then building layers to support multiple uses. Throughout, he emphasized the potential of knowledge graphs to enhance contextual understanding and retrieval accuracy.

Closing Keynote

The closing keynote featured a demonstration of OpenAI’s latest multimodal model, GPT-4o (Omni). It integrates text, vision, and voice to facilitate natural human-computer interaction. GPT-4o is designed to be twice as fast, 50% cheaper, and support five times higher rate limits than previous models.

In the first demo (shown below) the speaker is interacting with GPT-4o in real-time, having a natural conversation with it, then showing it a hand-drawn picture of the Golden Gate Bridge, which the model then uses to infer that the conference is currently taking place in San Francisco:

In the next demo, we continue to see how well the visual reasoning of the model performs. Here, GPT-4o is shown a random page from a book and asked to both read and summarize it. The results are pretty impressive:

The keynote concluded with a talk by the team behind the publication, “What We’ve Learned From A Year of Building with LLMs” — a comprehensive review of lessons learned from a year of developing LLM applications. Strategically, it emphasized that the model itself is not the competitive advantage; rather, leveraging product expertise and focusing on niche applications are crucial. Building applications that are well-designed, solve real problems, and provide utility to users is essential. The importance of continuously iterating based on user feedback and setting clear evaluation objectives was emphasized throughout as a means to ensure that applications remain relevant and effective.

Operationally, they addressed common misconceptions about AI engineering roles and the importance of hiring the right talent at the right time. Misunderstandings about skills and inflated expectations can hinder progress. Emphasizing specific, applied roles and continuous learning can help manage these challenges. Tactically, the talk underscored the critical role of evaluations in the success of AI applications. Effective evaluations should be broken down into binary dimensions and assertion-based tests. Regularly analyzing data, minimizing skew, and implementing guardrails are essential for maintaining safety and reliability.

The talk concluded with acknowledging the difficulty of transitioning from a demo to a production-ready product, again emphasizing the need for robust evaluation frameworks and thorough planning throughout in order to make that transition.

“The model is not the moat… the system around it is”

Emerging Trends & Key Insights

The conference sessions covered a wide variety of topics, however, there are some key trends and insights that emerged over the course of the three days. They are:

Evaluations and monitoring are critical for taking projects from proof-of-concept to production

Consistently emphasized across sessions, robust evaluation frameworks to measure AI performance, reliability, and impact are crucial for success.
These evaluations need to be domain-specific and should ideally be built out first to enable faster iteration and incorporate user feedback. Methods include human-in-the-loop evaluations, model-based evaluations, and domain-specific metrics.

Ensuring traceability and monitoring throughout the AI lifecycle, from pre-training to production, is critical for maintaining performance and addressing issues.

We can count on foundation models to continue to get smarter and less expensive to use

The current best-in-class models like GPT-4o and Claude Sonnet are free to use in their respective chat interfaces, and API calls to these models are less expensive than similar models even 6 months ago. This inverse relationship between model intelligence and cost is expected to continue as the technology progresses forward.

RAG is here to stay (for a while, at least)

While there is still ongoing research into how to expand current context window sizes to enhance retrieval — Extended Mind Transformers show a lot of promise in this area — the consensus is that RAG is here to stay. Context windows are still limited, and fine-tuning is still expensive, so this technique won’t be going anywhere soon. And even in a world of infinitely large context windows, RAG will still have a place for practicality sake, especially for working with large amounts of data.

GraphRAG is emerging as a trend in developing RAG pipelines and representing data for retrieval. Many are curious about the possibilities, and Microsoft recently released their GraphRAG solution for extracting entities and relationships from unstructured data.

Use smaller models that perform well instead of larger models all of the time

There is a shift towards using smaller models to help reduce both latency and costs. Techniques like quantization and LoRA are being used to decrease the size of the model weights while retaining performance. Models with smaller weights are less expensive to fine-tune and have lower response latency.

Smaller models can also be deployed on more platforms. For example, Phi-3 is a small language model (SLM) developed by Microsoft that can outperform models twice its size (including GPT 3.5 Turbo) across a variety of language, reasoning, coding, and math benchmarks.

Tangentially related to this point, there is also an interest in optimizing inference on CPUs given the costs and other difficulties associated with GPU development. This is exemplified by projects like Llamafile.

Agentic architectures are emerging as the most prevalent system for building AI applications

There is significant interest in building agents into AI applications for more flexible and dynamic solutions. However, there are still challenges with using agents in production, namely reliability and consistency. Much of this is due to the non-deterministic nature of LLM outputs. Frameworks like CrewAI and LangGraph are emerging as different ways to help improve the reliability of agents.

AI is already being integrated into day-to-day workflows

Many teams/organizations are working on ways to integrate AI into their workflow. For engineers, this is taking the form of coding assistants that help with writing project code. For business users, this is often taking the form of chatbots that they can use to ask questions from some knowledgebase.

AI Applications are easy to demo and very difficult to productionize

Development of AI applications that can be used reliably at scale take time in order for issues to be ironed out. This is often on the order of weeks or months. However, it is worth taking the time to get things right.

AI is more accessible than ever, but is still hard to use

While AI is more accessible to a wider variety of users than ever before, it still is relatively difficult to use and tends to reward power users. For instance, the concept of prompt engineering and understanding different nuances to prompting between models can be difficult for non-technical users. Even uploading PDFs to something like ChatGPT requires understanding how to structure the PDF file in order to get the best response possible. Accessibility does not imply ease of use, and this is especially true with AI currently.

It’s on us (AI Engineers) to establish patterns for how to use this stuff responsibly… and help get everyone else on board

Establishing trust and safety guardrails for AI is crucial for its continued adoption and use. This includes handling issues of data privacy, prompt inject, toxic inputs, and overall ethical usage. AI Engineers are in a unique position to help establish and model best practices.

The model is not the moat — the system around it is

Given the rapid pace of innovation, teams that aren’t building models should focus their efforts on what’s going to provide lasting value, such as evaluation frameworks, guardrails to prevent undesired outputs, reducing latency and costs, and powering the iterative improvement of all the aforementioned.

Summary and final thoughts

I had a great time and learned a lot from attending the AI Engineering World’s Fair. There was so much rich content from enthusiastic vendors and developers that I had to bookmark talks I didn’t get to attend just so I can watch the recordings when they are released. There was a lot of excitement in the air as all of this technology is still so new and everyone is still trying to figure out the best practices that lead to success. It seems as some companies have come close, but we are still not at a place where standards can be defined just yet, although we are approaching this rather quickly.

We value your interest in 99P Labs and appreciate your time spent reading our blog. If you have any questions, concerns, or would like to discuss potential collaborations, we encourage you to reach out to us. You can connect with us on LinkedIn or Medium to stay updated on our latest research and innovations. Additionally, you can email us at research@99plabs.com to initiate a conversation. We are always excited to engage in meaningful discussions and explore exciting opportunities.

Thank you for your support, and we look forward to hearing from you.

--

--

Martin Arroyo
99P Labs

Data Engineer @ 99P Labs | Data Analytics Instructor @ COOP Careers