Standardized Tests: What We Learn from GPT-4
Standardized tests are used to screen applicants for work training programs and colleges; they also are used for licensure for many roles. The performance of GPT-4 on standardized tests raises questions. First, are its high scores as predictive of its capabilities as such scores would be for humans? Second, does its performance tell us we need to rethink our reliance on predictive validity and search for measures that directly assess competence rather than correlates of that competence. A path to such methods is appearing.
Among the first reports of GPT-4’s capability has been an account of its ability to score well on a variety of standardized tests, including the Uniform Bar Exam and a number of AP tests.[1] This has been touted by OpenAI as an indication of how powerful the new large learning model (LLM) is. However, it also raises some questions about those standardized tests and what they measure.
Background
Historically, standardized tests have been a product of psychometric research, and the focus has been on reliability and predictive validity. By reliability, we mean that two people with the same competence should receive the same score on the test. Standardized test items are evaluated to make sure this is so. Partly, this is done using a technique called differential item functioning analysis[2] to evaluate candidate items for inclusion in the test. To be a fair and reliable predictor, the correlation between performance on a specific test item with overall performance on the complete test should be the same for all identifiable groups of test takers. So, for example, a test developer would reject a test item that correlated highly with overall test score for say white males but did not correlate with overall score very well for African American females, or vice versa.
More important is validity, whether the score a person gets on the test is an accurate index of the skills or competence the test purports to measure. Historically, standardized tests have been anchored in predictive validity, whether one’s score on the test predicts a competence that can be measured either at the time the test is taken or later. For example, a major requirement of the SAT exam is that it predict a student’s grades in the freshman year of college, which it does quite well, through better for some populations than for others (among other things, some groups have different expectations of how well they will perform on the test, and that affects their scores[3]).
Note that predictive validity does not mean that the test directly measures the quality of performances it is meant to predict. In fact, most standardized tests rely upon the fact that people who have mastered a given body of knowledge tend to have learned a number of details about that domain. For example, someone who can tell you the value of π is more likely to have some math competence than a person who cannot (though the emergence of more web traffic about Pi Day likely has changed that). Someone who can recognize key elements of the plot for Hound of the Baskervilles is more likely to have read and digested Sherlock Holmes stories than someone who cannot. And, someone who can recognize which of several legal outcomes relates to Plessy v. Ferguson is more likely to know a bit about how the law relates to segregation than someone who cannot.
A test is more reliable if it has more items. That way, if a particular student missed a topic in class due to illness but has learned the tested domain overall, their score will not be lowered excessively because of an unlucky item selection. Having more items also makes it easier to decrease the bias that might come from one group being just a bit more familiar with the item content than another group. This is a challenge because the time needed to complete a standardized test ordinarily is limited. When the test is given in a school, it takes up time that could be used for learning. If given at a testing center, there is cost attached to the time needed for the test — payments to test proctors, rental of the testing location, etc. To have many items completed in a limited amount of time requires making each item short, often multiple choice or short answer format.
Some Characteristics of GPT Performance Today
The performance of GPT-4 on standardized tests, which is impressive,[4] raises a couple of questions. First, are its high scores as predictive of its capabilities as such scores would be for humans? Second, does its performance tell us we need to rethink our reliance on predictive validity and search for measures that directly assess competence rather than correlates of that competence. Such new measures are quite attainable, as discussed below.
We all know about people who talk a good game but may not always be able to play a good game. When GPT-4 can pass a standardized test, it may or may not be able to do the performance for which that test is meant to predict success. In some cases, it will do fine. For example, there remain many first-year college courses that give exams requiring only the regurgitation of bits of knowledge. ChatGPT should do fine on many such tests. Table 1 provides an example of its response to a question one could imagine on a psychology test. Plenty of other examples could be given.
However, sometimes performance on a test may, for humans, predict performance capabilities but the prediction might not hold for a large language model. Consider the response of ChatGPT to being asked how to turn an airplane while flying, shown in Table 2. It is complete, and if a pilot followed the directions ChatGPT gives and had the necessary sensorimotor experience, the turn would be successful. However, I am not sure we would be ready to fly with ChatGPT running the plane. We would need to know that it knows how to enact each of the actions it lists. If we asked this question of a human who had enough hours of flying experience with an instructor, the response would be predictive of the actual ability to make the turn. However, whether from a large language model or a smart person who searched the Internet for flying instructions, being able to answer the question might not adequately predict being able to do the task.
What we can learn from these examples is that predictive validity is context-bound. Being able to play a good game is only predicted by being able to talk a good game if there is evidence of sufficient practice doing what was described in the response to a question.
Of course, some college courses live in a space where the talk is the game. For example, a literary criticism course involves learning how to ask certain kinds of questions of a text and how to write clearly and in an engaging way about the answers to those questions. For that kind of knowledge domain, ChatGPT performance on course assessments will probably be as good as its performance on standardized admission tests.
Sometimes, a query gets handled by ChatGPT in a way that is sort of what is needed but maybe not exactly. Consider, for example, Table 3. ChatGPT was asked how to determine the area of a farm with irregular boundaries. It mostly listed devices to do that task, though the third option it gave was a brief description of how one could do the task without devices, using the rid approach. It could have given more detail on how to use the grid approach, but it didn’t. So, we cannot be sure ChatGPT could actually figure out the farm’s area. Still, ChatGPT likely would provide the extra detail if asked in a follow-up question.
Overall, ChatGPT can answer a lot of questions, which is why it can do so well on various standardized tests. Where it has struggled, at least until the larger GPT-4 was produced, has been with tasks like simple arithmetic or mapping of a statement into a different formalism or a different ontological base, as sometimes is required on standardized reading and math tests. At present, it does remarkably well most of the time, occasionally producing only a statement about why it is not providing the requested response.
On the other hand, some standardized test items might work precisely because the test taker does not know too much. When applying to graduate school in a previous century, I was required to submit my score on the Miller Analogies Test. The version of the test that I took would be seen as culturally biased today. I still remember one of its items that today would fail the screening for differential item functioning. It was a fill-in-the-blank analogy item: Napoleon is to brandy as Caesar is to _____. I did get the item correct, with the answer of salad. Interestingly, ChatGPT, at least today, would fail this item, not because it lacks analogical capability but because it knows too much and “knows” some things that are not true. Examine the conversation in Table 4. ChatGPT’s knowledge of a range of facts convinces it that the Caesar salad response is imperfect. That is, it shows more capability than humans for this specialized purpose rather than less. At the same time, while it knows that champagne was invented long after Caesar was dead (see Table 5), it initially answers the analogy question using reasoning based upon Caesar having enjoyed champagne.
Overall, we can see that while ChatGPT can do quite well on many standardized tests, it does not exhibit quite the same performance as a competent human might. This prompts me to consider some issues that merit more exploration.
Issues Meriting Further Research
Talking a good game vs. playing a good game. Clearly, ChatGPT is rapidly becoming able to pass all kinds of standardized tests. The big question is whether it can perform as well as its scores would predict. To some extent, the answer is that it can. There is no question that it is ready to comprehend all sorts of large lecture classes and to perform well on the knowledge-regurgitating tests often given in such courses. What is less clear is how well it could apply what it knows to real world situations without an intermediary interpreter.
It is clear, for example, that its ability to specify the steps needed to turn an aircraft would likely not predict its ability to actually complete the turn. If nothing else, no one has built a motor interface for GPT-4 yet, nor is there any reason to believe that with such an interface GPT-4 would have the right sense of feel to complete the turn successfully. Existing tests for pilots in training can assess whether the future pilot knows what to do in principle, but to know that the pilot can actually fly the plane, real flying performance, or performance in a realistic simulator, is required. Still, ChatGPT might in time be able to participate in performance testing and especially in scoring those performances.
Knowing what humans know at a given point in life does not predict what GPT-4 could learn next. The area measurement case is a bit more complex. It would, at least, be necessary to have further dialogue with ChatGPT to determine that it knows how to perform the relevant math, but conceivably it does or soon will. Table 3 does raise an interesting issue, though. ChatGPT knows how to use available tools to determine the plot’s area, and indeed humans generally do not need to resort to the grid scheme given the tools available. At the same time, as humans progress through the learning of mathematics, there likely are times when their ability to carry out the grid method is predictive of their readiness for additional math learning.
More broadly, one purpose of standardized tests is to determine how far along a curriculum path a student has progressed. GPT-4 has learned what it knows by an entirely different path than humans take in school. Consequently, we cannot count on its standardized test performance to predict what it could learn next or how that learning might occur. For example, in the US, the K-12 math curriculum tends to follow a fixed order. Generally, knowing that a person can do all the math taught by Grade 6 is a good reason to believe that a person can quickly learn or already knows how to do Grade 7 math. ChatGPT learns in a very different way, so its performance on a standardized test may not predict how it would do on a test for a later grade.
Not everything ChatGPT “knows” is true. As mentioned above, sometimes ChatGPT knows things that are not true. Consider Table 5, which shows the next part of my conversation with ChatGPT after that in Table 4. What is interesting is that with appropriate probing, one generally can get ChatGPT to acknowledge that a false claim it has made is indeed false. Still, unless its attention is focused, it can make semantic generalizations that at some level it knows are wrong.
At least for now, ChatGPT behaves like what Piaget would call a preoperational child, except that it knows much more than children do at that stage. Currently, ChatGPT lacks the functionality that we gain from our prefrontal cortex and other brain structures. We can self-check our thinking. So, even though Caesar likely drank a lot of expensive beverages and champagne is an expensive beverage, we can check the association between those two facts so we do not conclude that Caesar drank champagne. It would not be all that difficult to build a “wrapper” for ChatGPT that did such self-checking, but that is not the current state of that particular large language model. Bing Chat, Microsoft’s implementation of a large language model, does provide references for some of the claims it makes, which is a step in the right direction. Just like many of my grade school math teachers, I sometimes want to scream at ChatGPT to “show your work.”
Generative AI Provide New Testing Opportunities
Given the power of large language models and related generative AI, it is time to rethink the whole idea of predictive validity. Standardized tests try to predict how well a person will do in a real-life situation from their performance on a number of small test items. To keep tests to reasonable length, those items tend to be answerable in a minute or so — or even less — so there can be enough items to make up for any accidents of experience or performance that might reduce the reliability of the test. Because the current technology was developed in the age of paper and pencil testing and even now is delivered either that way or on a relatively small screen, the items tend to be “one shot” in nature. The test asks a question, and the testee quickly answers it.
Real-time dialogical standardized tests. However, ChatGPT and similar tools have dialogical capability. They can follow up earlier interactions. Currently, we use that to test ChatGPT. We ask it questions and then follow up to find out more about what it really knows. An exciting possible future for large learning models like ChatGPT will be to do standardized testing by having the models engage in dialogues with the testee rather than getting short responses to quick questions. In addition, it might be possible to use generative AI tools to interact with ongoing testee performances and score them.
Until the psychometric era, academic performance characteristically was tested in two ways. Performance products, such as theses and dissertations, were evaluated by committees of experts, and oral examinations by experts also were used. Both approaches remain to this day.
However, because such evaluations are demanding of considerable time and effort by both the examiners and the testee, they tend to occur mainly as capstone evaluations at the end of a lengthy period of training or study.[5] Generative AI, to the extent that it can do the evaluating, might open the door to shorter and more frequent use of both products and dialogues as testing devices. One could imagine a student having a dialogue with an intelligent tutor/tester daily or weekly. An intelligent system might be able to ask questions, evaluate answers, and then probe more deeply to establish whether a student has mastered particular curriculum elements.
Some tiny bits of such an approach are already in use. For example, a company called Amira Learning[6] already offers a tool for coaching early reading in young children. It is based on work done starting two decades ago by Jack Mostow.[7] The initial offering is based on a simple idea. If the computer already knows what text a child is trying to read aloud, it can process that child’s efforts and note what words or even symbol-sound correspondences are still a problem and then both provide coaching and give practice that addresses those challenges.
This approach can be generalized substantially to include dialogues around all sorts of knowledge and performance. And, it might involve artifacts as well. One could readily imagine, for example, a dialogue between an intelligent system backed by a large language model that showed medical data like lab tests and radiological images to a medical student and engaged in a dialogue about what might be wrong with a real or simulated patient. Such an approach would require speech-based interaction, which already is quite doable, a large language model to assure that variations in the exact language the student uses can be handled, a means of evaluating the quality of the student’s diagnosis of the patient, and a means of learning to predict how performance variations on one diagnosis task might predict what further tutoring is needed and how competent the student is becoming.
As just mentioned, the first two requirements — spoken dialogue capability and large language models — already exist and are gaining perfection quickly. The other two needs, scoring capability and the ability to predict future performances and to select optimal next learning opportunities, are coming along nicely, too.
One example of efficient production of scoring capability is expert policy capture.[8] The basic idea is that trainees are given a set of tasks that are the target of a training program. They do these tasks, and a transcript is developed of their performance. Experts in the domain are asked to rank the various performances according to how adequate they are. Then, they are asked to justify the relative rankings of different performances. The justifications are used to develop scoring rules that can be applied to performance records, with each rule assessing some number of points, depending on the severity of the imperfection in performance. The rules also signal what a coach needs to address with a trainee, so they can become the initial basis for extending the assessment dialogue to include coaching to improve performance. This approach has been used successfully in intelligent tutoring systems for a few decades,[9] and it is becoming much more feasible to implement when the interaction is supported by large language models and automated speech processing and generation.
Knowing what a student or trainee does not yet know, which the expert policy capture approach supports, is a step in the right direction, but it is quickly becoming possible to do a bit better. With deep learning and related approaches, it should be possible to learn the best next issues to coach based upon what scoring rules were triggered by the student/trainee’s most recent performances. This then would enable both advising to teachers based upon machine dialogues with the student and direct coaching of the student by the machine. It is likely that different learning-promoting organizations will split the work between humans and machines in different ways, but regardless of how much teaching such systems do, they certainly will be extremely useful in making it easier and more successful to help students learn.
Performance tests. As already has been hinted in the above discussion, the future of machine-assisted assessment need not be limited to dialogical assessment schemes. However, it turns out that many of the capabilities we want people to acquire are, in a sense, dialogical, whether verbal or not. A competent person can interact with the environment in positive ways. Sometimes, that interaction is a verbal dialogue, while other times it involves a person carrying out nonverbal actions, noting how the environment responds, and then acting further based upon the response. This is one reason ChatGPT creating the list of steps in Table 2 does not mean that it knows how to turn the plane. Some of the needed competence is sensorimotor and can only be assessed by looking at how the plane, or a high-fidelity simulator, responds to the testee’s actions.
With time, it will become even more possible to describe and represent such environmental responses for many areas of competence. Once that capability exists, then the same approach to assessment proposed above for dialogue can be applied to other interactions as well. Some of this already is possible, and much more will become possible much more quickly than we expect. Hopefully, learning-providing organizations will leverage the new capabilities to become more efficient in assessment and more capable of continual assessment of interactions with students or trainees.
In the US, we spend 6% to 7% of our gross domestic product on education and more on corporate internal efforts that may not be included in that estimate. Given that it often seems like our efforts are incomplete and inadequate, it will be important to leverage the new capabilities afforded by the rapid emergence of the machine capabilities described above: speech processing, large language models, and machine learning capability. We will make mistakes, and we may even fail some populations more than others initially. However, we can keep improving the systems built to support learning, and whatever biases they might have initially are almost certainly less damaging than those that remain in our highly fragmented systems of education today. Exciting improvements in education and training are becoming possible on a scale that can mean a lot, especially to those most threatened by the displacement of some human roles by machines.
Finally, generative AI and related tools offer the possibility of integrating job orientation and predictive licensure testing. In many areas, including law and nursing, passing of licensure tests and initial on-the-job orientation and training are completely separate requirements, often requiring very different preparation. As we become more able to directly assess rather than simply predict competence, those two separate barriers to readiness for full work status can become much more connected — and unified.
[1] https://openai.com/research/gpt-4.
[2] Holland, P. W., & Wainer, H. (Eds.). (1993). Differential item functioning. Psychology Press.
[3] Steele, C. M., & Aronson, J. (1995). Stereotype threat and the intellectual test performance of African Americans. Journal of personality and social psychology, 69(5), 797.
[4] See OpenAI’s technical report on GPT-4, available at https://arxiv.org/pdf/2303.08774.pdf.
[5] There remain exceptions, such as the periodic meetings of students with their tutors at Oxford and Cambridge, but the approach taken there is costly and not generalizable to the range of evaluation currently provided by standardized testing.
[6] https://www.amiralearning.com/.
[7] Mostow, J., & Aist, G. (2001). Evaluating tutors that listen: An overview of Project LISTEN. In K. D. Forbus & P. J. Feltovich (Eds.), Smart machines in education: The coming revolution in educational technology (pp. 169–234). The MIT Press.
[8] Pokorny, R.A., Lesgold, A. M., Haynes, J.A., & Holt, L.S., (2021). Expert policy capture for assessing performance in simulation-based training and games. In H. F. O’Neil, E. L. Baker, R. S. Perez, & S. E. Watson (Eds.), Theoretical Issues of Using Simulations and Games in Educational Assessment: Applications in School and Workplace Contexts. New York, NY: Routledge/Taylor and Francis. Also,
[9] See Pokorny et al. reference in Footnote 8 and the articles it references.
