Gen AI Series: Alternatives

This article is a continuation of the Gen AI series with particular attention to Large Language Models (LLM). This article discusses alternatives to Gen AI and the questions organizations should be asking before implementing a Gen AI capability. The first question organizations should ask themselves when evaluating ML solutions is “Are there simpler solutions available to me?” Like most organizations, you are probably getting bombarded with decks that have an image of a robot looking at a computer as “slide 1.” (Because nothing signals AI knowledge better than a pic of a Terminator-style robot. Please see the image below). These decks look savvy, dazzling, and full of incredible action-word jargon. But does your organization really need powerful LLM tools? Do you really need that proverbial Ferrari to run errands around town when a Honda Civic can get you to the grocery store in the same amount of time and with much less overhead?

99% guarantee that a Gen AI slide deck will have a similar robot on “slide 1”. Notice the AI action words! Source

Why not “Gen AI All things?”

LLMs such as Google’s PaLM, Azure’s ChatGPT, and AWS JumpStart’s use of Cohere, HuggingFace, and AI21’s models make using Gen AI appear straightforward and simple. Especially when using a managed service that allocates computed resources, ties in storage, seamlessly links IAM and security credentials, and lays template code to put pipelines in place. Managed services are fantastic! But managed services are expensive. If you are a smaller business and your use case can be deployed on a traditional ML solution, then it may be time to evaluate your options.

There are terrific resources for text extraction, classification with text inputs, and other solutions leveraging LLMs. For example, AWS offers a Bedrock, which is using an LLM under the hood to classify text. Time will tell if these models outperform more traditional document intelligence solutions like Optical Character Recognition (OCR) coupled with a classification layer using Naive Bayes or a tree-based solution like XGBoost. Or if AWS’s Comprehend platform for document intelligence is a better solution to using AWS Bedrock. As companies adopt one or the other, more specific use cases will evolve to which platform, traditional ML or managed LLM services, are better suited.

These are the questions that organizations should be thinking about when deciding whether to leverage an LLM over traditional machine learning models:

  1. How risk-averse is your organization?
  2. What is the model’s latency?
  3. What is the marginal cost per inference call?
  4. Understand the difference between inference for traditional ML vs LLMs
  5. Is explainability important to you?
  6. What is the scope of the AI application?
  7. What is the time to value?

Can I solve this problem with a traditional ML approach?

It is important to define why your organization is using machine learning to solve a business use case. In the world of machine learning, the risk is often associated with the degree one’s model gets answers wrong. This article is not geared to a deep dive into the trade-offs between recall and precision, but for those interested in these topics where false negatives and false positives drive decision-making, I highly recommend the blog series from Dr. Jason Brownlee on “Machine Learning Mastery.”

If a model’s accuracy is paramount to your organization and are leading a risk-averse solution, such as making healthcare decisions that impact people’s lives, then your decision-making will be different than an organization that is more comfortable with risk. Companies that work in fraud detection for credit cards tend to be comfortable with a model that throws false positives, allowing the company to reach out to alert users where there is little downside other than an annoying text. Whereas companies that deal with HIPAA-related information, such as health insurance, will be more prone to avoid false positives that can have life-impacting results if a client is notified.

What is the model’s latency?

When evaluating ML solutions, how quickly do you need to get responses from the user? I recall from a client engagement in the entertainment industry, a VP saying that 5 seconds is an eternity for a user to wait for a dashboard to load. Websites that leverage HTML, CSS, and JavaScript can return an API call in nanoseconds, ensuring that the ML model’s inference time (the time it takes to send the user-generated input to the model weights, via API), calculates a prediction, and send it back to the frontend for the user. This takes time. Time makes the model look ineffective.

Did someone say “latency”? Source

More complex models require longer time at inference, thus longer wait times for clients. If the model is producing predictions in timed intervals (batch processing, aka offline inference) then this doesn’t matter. You can schedule the batch predictions to run during low-traffic periods such as automating actions during graveyard shifts. But if your predictions need to happen in real-time (online inference), then model latency matters. To further add to this dilemma, the longer the text output is in an LLM, the longer the wait for results. Again, more complex models take more time. Your organization needs to determine how time is factored into the user experience.

Currently, there is little information from companies developing LLMs as to how Service-Level Agreement (SLA) will be produced to give explicit promises to latency. At the time of this writing, OpenAI has not provided an SLA for their GPT-powered models.

What is the marginal cost per inference call?

Each time the user calls a model’s API, it costs money. If you are hitting an API that is a simple ML algorithm, such as a tree-based model like random forest or XGBoost, then the cost is minimal. You simply have to take the model weights and perform a simple calculation to get the prediction. If you are using a clustering algorithm that needs to use the entire dataset at the point of inference (called a “greedy” algorithm), then your costs will be higher since you are scanning the dataset at each inference point.

If you are calling an LLM, then you could be paying as high as $0.25 for each inference call. Adding to that, with prompt chaining, many inference calls are a collection of lower-level API calls, acting in an additive manner to create one larger call to the model. Include prompt management for in-context learning, such as LangChain, throw in vector database storage using open-source solutions such as Zilliz (Milvus), and ChromaDB, paid solutions such as PineCone and DeepLake, or managed solutions like GCP Matching Engine and AWS’s Kendra, then these all add up.

Understand the difference between inference for traditional ML vs LLMs

As previously stated, when a traditional classification ML model is pushed to production, the model weights are saved in a serialized file, such as a pickle file. These files are stored in a place that is easily accessible for the model to access and run in a reproducible manner, such as a Docker container. When the model is called for prediction, the inputs go through a data transformation that allows the data to “talk” to the model weights. The model (algorithm) then multiplies those weights by the transformed user inputs. Once those calculations are complete, then the model sends those results back to the user interface, normally through a RestAPI. The cost for the inference call is straightforward: data transforms, weight calculations, and data ingress/egress fees.

Traditional supervised ML models have different behavior at the point of inference that can have different costs. When users call the API via a frontend interface, the data still needs to go through a transformation so the data inputs to the model can interact with the algo’s calculations. The divergence between classifications is there are no model weights sitting in a file for the model to hit. Furthermore, unsupervised ML models are known to be greedy, meaning they need to take all the training data, calculate the clusters/groupings, and then spit out a prediction. By scanning the data and recalibrating groupings, the new prediction takes more time, and thus, more cost. The predictions are sent back to the user interface through the API and presented to the user. These affect the cost of the prediction returned.

Large Language Models (LLMs) typically are trained for days, weeks, and months depending on the size of the training dataset. (For more information on training datasets used in LLM architecture, please refer to the following Medium post). When the user inputs a prompt for prediction, the LLM has a few moving parts that diverge from traditional methods. The user prompt text is sent to an embedding model. This model takes the word strings and converts them to character tokens. These tokens are in the form of float numbers nested between 0 and 1. The embeddings are then compared to the vector database embeddings where the trained model artifacts are located. Often referred to as semantic space, the vector database holds embeddings in the form of vectors which can be queried to look for similarities by leveraging a clustering algorithm most notably Approximate Nearest Neighbors (ANN). Approximate is used instead of explicit because scanning the entire vector database for exact matches would not be time and cost-efficient.

Semantic space (vector DB) looks like it came from Tron. Source

Once a match is found in this semantic space, a result is produced to the model. The model then decides which response to select based on a predicted probability score. If the user is chaining prompts together, such as through LangChain, the model will store these prompts as “context” to call in local memory, similar to how the vector database returns results. Once the model has returned predictions, they are returned to the user in the form of text. Since the LLM’s inference architecture is more complex and the training process more costly, the marginal cost of inference is much higher.

Is explainability important to you?

Some businesses are required to explain how models work, specifically those in the insurance and financial sectors. With new laws developing around both traditional and generative ML, model explainability is an area that many companies are keenly interested in. However, there are businesses that may not need explainability, particularly when it comes to decisions that don’t directly affect people’s lives. Model explainability becomes more difficult as model complexity increases, often convoluting the cost of model development through new explainability tools. This should be an area that the business should be aware of.

Model explainability on some LLMs. Source

What is the scope of the AI application?

Bringing an LLM to production takes time, similar to a traditional machine learning solution. When done, it is just a minor part of a well-oiled process within an MLOps pipeline (machine learning operations). Building both traditional and Gen AI solutions within an MLOps infrastructure allows for ease of scaling, repeatability of results, fast iterations for data scientists to experiment, and the lowering of technical debt. However, MLOps take time. Organizations should understand the type of solution they want to build first. The following are product scopes that need to be determined before building the traditional ML or Gen AI solution. Needless to say, one should always have a well-defined process for adding features to the scope of the AI application to avoid scope creep.

Maintain clear definitions of what the product will look like before the build phase to avoid scope creep. Source
  • Proof of Concept: The POC is a toy example of a working AI solution. Normally this runs in a local environment, such as on a data scientist’s local browser window, and lives in either a Jupyter Notebook or an undeployed snippet of code. The POC shows general functionality but will lack many of the features that are desired for organizational deployment.
  • Minimal Viable Product: The MVP is a deployable version of the POC. The MVP should be used for gathering buy-in in an organization to sponsor the development of a fully-built solution that can be taken to production. Key elements in an MVP are the ability for users to interact with it in the organization’s cloud environment, local network, or VPC.
  • Soft Launch Product: Soft launches are when a development team pre-selects a group of users to test the application in an organization’s environment. Sometimes referred to as the Quality Assurance (QA) portion of the development, the QA process is a great time to collect user feedback and improve on the application. The time this takes can range from days to months.
  • Deployed: If following proper MLOps guidelines, the AI application should be deployable via a simple RestAPI. However, security and infrastructure teams need to be actively aiding in the deployment of the application since they will be the gatekeepers to future use.

Needless to say, each of these steps takes time. There are areas that further complicate timelines by how involved the data science portion, infrastructure development, data engineering tasks, and organizational governance constraints are. These can include, but are not limited to, the following:

  • Data engineering (ETL, processing, querying, prompt engineering, prompt templates, feature engineering)
  • Feature store curation
  • Modeling (traditional ML algorithms, fine-tuning LLMs, hyperparameter tuning)
  • Repository management
  • Container registration and orchestration
  • Model housing (model registration and tracking)
  • Model performance tracking
  • Governance templates
  • Infrastructure as Code (IaC) components
  • Pipeline orchestration
  • Security: API endpoint and VPC network (subnets per IPs)
Where does MLOps fit? Source

What is the time to value?

Taking into consideration the previous scale of your organization’s AI application, how long does it need to run to yield results? Going further, how long will it take before those results yield value for the organization? Normally, time-to-value encompasses two areas:

  1. Time to train
  2. Time to results

Time to train refers to how long the model takes to complete its training and tuning operations and be ready for use. Time to results refers to how long the model needs to sit in production to gather user data and feedback, in the form of API calls, to provide business value. For example, when using cloud provider AI platforms, such as Google’s Discovery Retail API (a real-time recommendation engine), the AI app needs to be collecting user data for at least one month to yield accurate predictions. Furthermore, the model needs to train for 2 to 4 days before being deployed. Many of these larger cloud-based models will require specific time frames to collect enough user data to make predictions. On the other hand, smaller ML models can be trained in minutes, but with far less accuracy and for smaller use cases. Depending on whether your organization is deploying a robust, managed cloud AI solution through AWS, Google, or Microsoft, or creating the infrastructure for deployment in-house, time-to-value calculations should be taken into account well before work begins.

Last Word

Deciding when and where to use Gen AI, particularly LLMs, is a large task for most organizations. Companies need to think about a variety of different factors when considering the gains from LLMs. While not all use cases are appropriate for LLMs, there are certain areas that shine when leveraging Gen AI. Most notably, those that involve language and text inputs. These areas are developing at a rapid pace and the benefits, often at scale, that are occurring in the textual space are at a logarithmic scale.

Special thanks to Anurag Bhatia, Steven Pais, and Rishi Sheth for the technical feedback during the writing process.

Author’s note: This article was not written by Gen AI.

--

--

Nicholas Beaudoin
Eviden Data Science and Engineering Community

Nicholas is an accomplished data scientist with 10 years in federal and commercial consulting practice. He specializes in ML operations (MLOps).