ML vs DL vs LLM, from the perspective of uncertainty quantification (UQ)

8 min readMay 3, 2023

Motivation

In the past two decades, the machine learning (ML) community has witnessed dramatic shifts in mainstream techniques. Before the 2010s, most research and projects revolved around classic ML algorithms, like logistic regression (LR) and Support Vector Machine (SVM). Since 2010, AlexNet and other deep learning (DL) models carried the world into the DL age. Ever since then, though with their controversy, DL models have been increasingly being deployed to practical scenarios. Finally, just a decade after the rise of DL, we are beginning to witness large language models (LLM) capturing everyone’s attention. Models (systems) like GPT-4, CLIP, and DALL-E outperform their DL counterparts and enable new capabilities like zero-shot classification.

From the uncertainty quantification (UQ) perspective, techniques in these three eras pose different challenges. While much work has been devoted to the first two eras, especially to DL models, there are unique challenges with LLMs, like hallucination, which existing techniques cannot trivially solve due to the training process, training data, target tasks, and other differences between DL and LLM. These differences drive us to explicitly identify characteristics of each era, which could provide thoughts on suitable uncertainty quantification techniques for the LLM era.

Classic ML Era: Principled but Less Powerful

Let’s start with the classic ML era. The classic ML era is the “golden age” of uncertainty quantification. Models were designed from the first principle or reasonable heuristics. Consequently, users mostly understood what features were used and how decisions were made. Take SVM, for example. The original SVM was just a linear classifier maximizing the margin between the positive and negative classes. Features for the linear classifier were usually designed based on domain knowledge; the optimal weights were “learned” via constrained optimization. For most classic ML models, we can describe them in a similarly simple but persuasive way.

The first characteristic of these models is that the decision boundary is explicit. For linear models (e.g., LR and SVM), the decision boundary is just a hyperplane specified by the model parameters; for K-Nearest Neighbor (KNN), the decision boundary(s) is manifold with equal distance to corresponding different clusters; for decision tree (DT), the decision boundary(s) is the manifold specified by the branchings. Second, these models work with small datasets. Usually, these datasets are so small that we expect minimal generalization ability from trained models. Furthermore, these models tend to have much fewer parameters, dispelling concerns about computation complexity.

In this age, UQ techniques were straightforward, as we could measure the decision uncertainty by the distance between the newly observed point and the decision boundary. Further, computationally demanding methods, like the Bayesian method, can be used because of the limited parameter size. Last but not least, these models needed to be sufficiently powerful due to their limited size and training data. As a result, attention was mostly paid to promoting the performance.

DL Era: Supervised and Specialized

In the DL era, we started to obtain models with human-level performance in specialized tasks. Training these DL models start with collecting a curated labeled dataset. To get good performance, this dataset should be large. Then, according to the underlying task, a DL architecture should be selected or developed; next, the DL model is trained using gradient-based methods (most of the time) on a training set. After the training process, DL models are evaluated on a testing set, which is expected to be sampled from the same distribution of the training set.

First, compared to classic ML, DL models are often regarded as “black boxes.” Specifically, learned features are implicit and usually hard to interpret. Furthermore, the decision boundary(s) are not explicit, making it almost impossible to measure the distance between observed data points and the decision boundary(s). Another characteristic of DL models is the dataset. Compared to the classic ML era, datasets in the DL era are usually larger but still have the notion of “in-distribution” (iD), the distribution where data points were sampled. DL models perform well on in-distribution data points and badly on out-of-distribution (OOD) data points. Moreover, while DL models are complicated, they still target a single task and always require extra training to adapt to new tasks. For instance, to adapt a CNN model trained on an image classification dataset (e.g., ImageNet) to object detection, some extra neural networks, like the classification and localization headers in Faster-RCNN, need to be trained. Third, it has been empirically demonstrated that DL models are ill-calibrated, meaning that their original estimated confidence does not necessarily indicate their true confidence. Last, the most successful DL models are discriminative models, estimating the label distribution given the current observation, i.e., \(\Pr(y \mid x)\).

In general, UQ techniques have been widely developed in the DL era. Most of these UQ techniques are postdoc and supervised. Specifically, the first characteristic of UQ in the DL era is distribution-free, meaning that the UQ results hold no matter what the underlying distribution is. This is critical in the DL age because the estimated \(\Pr(y \mid x)\) is not a closed form, making any assumption about the distribution impossible to verify. Some representative UQ techniques in this age include conformal prediction, temperature/Platt scaling, histogram calibration, etc. The second characteristic is supervised. In particular, UQ techniques in this age always require a hold-on calibration set, which is supposed to be sampled from the in-distribution. The third characteristic is the independent and identically distributed (i.i.d.) or exchangeability assumption. Specifically, they assume that the calibration and testing set share the same distribution (i.e., in-distribution) (training distribution could be different but will degrade the UQ efficiency.) There are techniques extending the i.i.d./exchangeability to distributional shift or time series scenarios. But still, they need to be labeled data from the shifted distribution.

One limitation of these UQ techniques is the labeled data. As these labeled data are costly, scaling these UQ techniques via collecting more data points is beyond possibility. Furthermore, these “supervised” UQ techniques can only work for distributions already observed. These techniques might fail drastically when generalizing to unseen distributions. Another limitation is *marginal vs. conditional**. Specifically, while these techniques provide guarantees over their UQ performance, the guarantees are usually over the whole data distribution. However, in scenarios (e.g., medical, autonomous driving) considering UQ, the most desired property would be “conditional,” meaning the UQ results hold for each data point.

LLM Era: self-supervised and versatile

We finally get the large language model era, where everybody is excited. Training LLMs usually involve three stages, namely pre-training, fine-tuning, and in-context learning. In the pre-training stage, because of the self-supervised learning, in either a causal or non-causal manner, the training process can utilize many unlabeled data. Intuitively, pre-training targets to learn general (possible semantic) features helpful in all related tasks. In the fine-tuning stage, a relatively smaller labeled dataset is usually used so that the LLM can achieve superior performance on a specific task. Finally, in the in-context learning stage, a limited number of demonstrations (labeled data) are used. Hopefully, the LLM can learn the pattern within the few demonstrations and output answers similarly.

The first characteristic of LLMs is their datasets. Because of the self-supervised learning process, LLMs are exposed to much more data points than their previous counterparts. On the one hand, observing more data points enables LLMs to have a more general in-distribution. On the other hand, as these data points are unlabeled, the in-distribution notion in the LLM era differs from that in the DL era, where both observation and label are presented. Therefore, it remains an open question on how the observed unlabeled data benefit the downstream task (or how to utilize them best to obtain good downstream task performance.) One interesting and related concept is parametric memory. It regards the LLMs as a big “memorizer,” with knowledge being the observed data points. The second characteristic is the model size. LLMs are drastically larger than previous models, making it almost impossible to use computationally complex techniques. Third, LLMs are not specialized to a single task. LLMs, usually closely mentioned with foundation models, are expected to learn general features from unlabeled data. After observing a huge amount of such data, the learned features are potentially general/semantic, so they are helpful for all related downstream tasks. Take the Segment Anything Model (SAM) from Meta for an example. After self-supervised learning, extracted features from SAM are regarded as semantic in that they can segment out semantic objects without any label. More powerful features from LLMs always enable new capabilities like zero-shot learning, where an image classification can be obtained by solely comparing the features of an image to the features of a word (describing an object class like “a cat.”) Fourth, LLMs are primarily generative models instead of discriminative models in the DL era. Last but not least, it has been empirically shown that LLMs with large sizes are well-calibrated on fixed choice problems.

There is ongoing UQ research on LLMs. For now, there are no persuasive techniques. From one perspective, we can divide them into supervised and unsupervised.

Generally, supervised methods collect answers from LLMs and feedback from humans. Supervised methods can calibrate the LLMs using these collected data, but only on specific tasks represented by the data. The advantage of supervised methods is that they are basically the same as UQ techniques in the DL era, so mature techniques from the DL era can be utilized. If the underlying task is well specified, e.g., answering GRE questions, data can be more purposefully collected, and better UQ performance can be expected. The downsides with supervised methods are the same as with supervised learning. Specifically, data is costly to collect, making the technique hard to scale; further, UQ only works on the distribution specified by the collected data, not being able to generalize to unseen tasks.

There exist limited unsupervised UQ techniques for LLMs. The advantage of unsupervised UQ is that they are easy to run and can generalize to different tasks. However, they have a huge limitation, which is only being able to quantify subjective uncertainty. Specifically, using unsupervised methods, the quantified uncertainty is solely based on LLM memory, which could store wrong information. Without access to true information, the quantified subjective uncertainty might deviate from the objective uncertainty, which quantifies uncertainty-based “ground truth” information. For instance, LLMs might collect several fake news about a celebrity from the internet. Then, when asked a related question about the celebrity, even with perfect unsupervised methods, LLMs would output the wrong answer with low uncertainty as the answer is consistent with LLMs’ memory.

If we imagine what UQ techniques are desirable in the LLM era, I would first consider scalability, as this is the property that distinguishes LLMs from their previous counterparts. Regarding UQ, scalability means that as we observe more (unlabeled) data, UQ performance should improve or even enable new capabilities like conditional calibration significantly.

From my understanding, there are two extremes in “scalability.” To one extreme, methods from first principles can scale. They can work no matter how large the data is. The challenge is that such principled methods would be extremely difficult to identify or induce huge computation complexity. Moreover, most UQ derived from first principles require labeled data, which makes them impossible to scale. To another extreme, methods that are purely data-driven might be scalable. This requires the UQ techniques to utilize information embedded within unlabeled data points. But again, the deviation between subjective and objective uncertainties might still be encountered.

Conclusion

I briefly compared three important eras of machine learning development, namely classic ML, DL, and LLM. We summarized their characteristics basically from the dataset, model size, and specialization. Furthermore, I analyzed the Uncertainty Quantification techniques and their characteristics in each era. Last, I fantasize about the desired properties of UQ in the LLM era. Though there are so many uncertainties with the LLM era development, there is always one certain thing — A better future is on the horizon！