LLMs, GPT-4, LangChain…Are we there yet for industrialized AI?

Date: April 2023
Opinions expressed in this post are solely my own and do not represent the views or opinions of my employer.

This is not an AI-generated image. This is a photo of a real object at the Museum of Illusions, Austin TX.

Large Language Models (LLMs), such as GPT-4 and Llama (and many others), are revolutionizing the world, with a huge potential to transform software engineering, business applications, and various other facets of our life. These models are evolving at a pace that surpasses the ability of the market and end-users to adapt. A blog post like this one, in conjunction with countless other online articles, news stories, forum discussions and even AI-generated content like images, videos and fake interviews (example), will eventually become part of the vast training data for LLMs in the wild.

We are only scratching the surface on how generative AI models will change the landscape of industry applications in the near future. There is a lot of hype and excitement around the success of Open AI’s GPT-4.0 and similar LLMs. With the proper regulations in place, LLMs will ultimately change how we develop data products today with a clear advantage of lowering cost, level of effort, and time-to-market for various industry use cases.

The utilization of Large Language Models (LLMs) in business applications can be achieved through various methods, as we understand it today (The level of effort, cost, risk, and complexity increase from 1 to 3):

  1. Developing a AI-powered, data-driven application (eventually AI agents) featuring a user interface (UI) that enables the storing of OpenAI API responses in the back-end database, or directly presenting them to the user. The user experience can be further enhanced by using techniques such as prompt engineering, measuring the similarity between embeddings, and even integrating additional approaches like knowledge graphs. Streamlit, which is the Python-based data application framework in the Snowflake Data Cloud, plays a huge role in this space because most LLM developer tools and frameworks are Python-based. Also rapidly evolving frameworks like LangChain allows developers to create agentic and data-aware applications to create autonomous agents and chatbots.
  2. Fine-tuning the LLMs, such as GPT-4, by providing a select set of labeled examples to enhance its performance and adaptability in specific contexts. Fine-tuning a Large Language Model (LLM) entails refining a pre-trained model on specific tasks or domains by adjusting its weights with additional training. This process allows the model to become more specialized, making it better suited for the desired context. For example, fine-tuning an LLM for the healthcare domain involves adapting the model to understand and process labeled medical terminology, and context.
  3. Training a brand-new LLM, potentially smaller in scale than a foundation model, tailored to suit the unique requirements and constraints of a specific business applications with a combination of raw and curated datasets. These models can be trained using proprietary data within organization’s boundaries in a secure and governed way.

Snowflake Data Cloud’s innovative TILT model by Applica, is a multi-modal LLM for document intelligence for organizations to fine-tune to their documents. (More to read in the recent Snowflake blog post by Torsten Grabs.) Also here is a great blog post for integrating Open AI into your SQL queries in Snowflake via external functions.

A significant number of early adopters, who display commendable courage given uncertainties and risks around the use of LLMs, have chosen to explore and implement innovative methods for showcasing the capabilities of large language models in a variety of fields. Some applications like Github Copilot and Notion AI already leverage OpenAI. Nonetheless, there needs to be a lot more work done before LLMs can be integrated into mission-critical industry solutions while maintaining realistic Service Level Agreements (SLAs) concerning explainability, security, and reliability. As we encounter announcements about the new LLMs from companies or open-source communities, it is crucial to consider the neccessities around fundamental building blocks including rules and regulations that will shape the successful implementation and adoption of these powerful models.

The Role of Data and Data Gravity

“Garbage in, garbage out” also applies to LLMs. As a daily user of ChatGPT, I am very impressed with the immense knowledge, memory and understanding of GPT-4 that helps immensely with text generation and summarization. However, one thing to remember is that GPT models are trained on many (undisclosed) datasets like news articles, books, social data and forums like Twitter and Reddit, which can contain uncertified and inaccurate information. Furthermore, LLMs are still known to have “hallucinations” returning fake information.

https://openai.com/research/gpt-4

Snowflake Data Cloud offers a wealth of opportunities for large enterprises and the advancement of industrialized AI. First, Snowflake’s ability to bring new compute resources to data (aka “data gravity”) further enhances its potential, making it an ideal platform for organizations seeking to harness the power of AI-driven solutions. Second, Snowflake’s extensive Marketplace contains a diverse collection of trusted datasets. Additionally, Snowflake’s robust security and governance features to protect internal data for training and deploying enterprise-ready LLMs are huge opportunities to create reliable AI-powered industry solutions for enterprises.

“Don’t share any of your secrets with LLMs” — even when asking questions or prompting

Security and data privacy concerns surrounding LLMs have become increasingly relevant despite all the hype. As powerful AI-driven tools continue to advance, they are capable of processing vast amounts of data, including sensitive and private information for training. Many researchers, developers and product companies are working hard to capitalize on the trend as quickly as possible. As a result, the potential for misuse or unintended consequences pose a huge risk for organizations. It is essential that developers, researchers, and users of LLMs take more steady steps building AI applications while prioritizing data protection and privacy and adopting robust security measures, such as data anonymization and encryption.

A recent Analytics Engineering podcast from dbt Labs talks about some of the techniques, such as differential privacy and data cleanrooms that can potentially help with some of the most serious concerns around data privacy concerns.

Analytics Engineering podcast

Other Considerations

As LLMs become more integrated into decision-making and customer-facing applications, explainability and “black box” nature of these advanced AI systems is another big concern. The lack of transparency can create difficulties in ensuring compliance with regulations, maintaining accountability, and credibility among stakeholders as well as customers. Furthermore, the black box nature of LLMs can lead to unintended consequences or biases in their outputs.

Additional challenges encompass the constraints related to the scalability of training LLMs and the latency limitations during model inference, particularly in scenarios that demand high concurrency.

All of these constraints necessitate the development of a comprehensive ecosystem of tools and technologies to transition from mere “hype” to practical industry applications that are backed by AI-agents.

What’s there for the developers?

Whether you are a data engineer, data analyst or a data scientist, it is good to start learning about Transformers (check out the free course from HuggingFace). If you have an Open AI API key, start experimenting with the new python packages in the LLM eco-system building sample Streamlit applications using public datasets. For more advanced use cases, incorporating embeddings and vector similarity matching are good techniques to experiment with.

Code generation and documentation with GPT are some of the relatively less risky uses of GPT use. VSCode Code GPT plugin is one of the good ones to cautiously experiment with for boosting your productivity.

If you are new to LLMs, here is a great session by Diego Oppenheimer on the DevTools for Language Models: https://youtu.be/tkL2c-16fXc

As we step into the realm of industrialized AI, the widespread adoption and integration of LLMs are closer than ever. However, we still need mature and rich ecosystem of enterprise tools, frameworks and infrastructure as well as rules and regulations around data protection.

Please register for the Snowflake Summit 2023 in Las Vegas for exciting announcements related to all this!

--

--

Eda Johnson
Snowflake Builders Blog: Data Engineers, App Developers, AI/ML, & Data Science

NVIDIA | AWS Machine Learning Specialty | Azure | Databricks | GCP | Snowflake Advanced Architect | Terraform certified Principal Product Architect