Intelligence, Complexity, and the Failed “Science” of IQ
A Conceptual Tour of Why Intelligence Testing Fails in Both Scientific Validity and Real-World Utility
UPDATE - September 25, 2019
I have removed the A Short History of IQ section. As with any history people often get caught up in contesting the minute details of historical accounts. This section was only written to provide some historical background. History cannot influence current claims on validity or practical utility and this article has always been a scientific and technical assessment of IQ. Thus to keep its focus there the history section has been removed.
I have removed Medium comments since the majority ranged from non-arguments to racial propaganda, which I won’t support. All genuine arguments will be addressed on Twitter, the original medium of this debate, so bring your arguments there. Twitter is more suited to focused discussion since its diminutive nature restricts one’s comments to its arguments, whereas Medium responses allow for long diatribes that often obscure arguments (if they exist). Please see my Rebuttals Welcome* section at the end on how to construct an argument. If your arguments contain fallacies (e.g. straw men, circular reasoning, etc.) I will label them as such, and if you harass you will get blocked.
The debate over the validity of IQ is nothing new. Every once in a while a fresh report pops up against the background of ongoing intelligence research, either defending or attacking its foundations. Proponents of IQ argue the research behind the intelligence quotient is sound and its utility proven. Opponents argue IQ studies have never rested on good science, exaggerates claims, and has done far more damage than good.
Recently this debate has arrived on social media by way of Twitter. 2 major voices in this debate are Nassim Nicholas Taleb (author of Incerto) and Claire Lehmann (founding editor of Quillette). Taleb argues that intelligence testing is pseudoscience swindle, lacking the mathematical and statistical rigor required of any scientific measure, while Lehmann argues it’s a valid form of research with real-world utility.
With no less fervor than the debates of the past, people choose their sides and argue vehemently. Most arguments are baseless involving more ad hominem than reasoned articulation, but a few stand out as insightful.
Regardless of which side you’re on, the IQ debate is deeply important. The idea that human beings can be ranked by their mental worth strikes at something fundamental. Taking sides often says more about one’s world view than their knowledge of science. Intelligence has never been well-defined, and it’s for that reason such intense debates continue. Its mystery both invites speculation and precludes obvious answers, opening the door to countless interpretations.
Yet intelligence research finds its way into our institutions and our policies. It impacts real lives, opening or closing doors to opportunity depending on the individual. Passing this debate off as just another online quarrel misses a chance to educate people on why such matters are controversial.
I myself weighed-in on the debate with the following set of tweets:
Forcing a statistically convenient model onto something overtly complex is not science. This is caused by “physics-envy” where those studying complex phenomena want to anchor their narrative on simplistic models that are easy to understand (and influence policy with).
Fitting a model to data and discovering a model from data are 2 very different things. Taking something like intelligence, which lacks any cohesive definition, and shows the hallmarks of emergence, cannot be understood by mapping simplistic trend lines to scattered data.
Forcing nature to adhere to a naive, convenient story isn’t science. Neither is making circular predictions that show scoring well on tests makes you better at scoring well on tests.
You can tell which side of the debate I am on. I find the notion of IQ entirely unscientific, but explaining why cannot be done through a medium like Twitter. Spewing tidbits of scientific wisdom only get you so far, and are often lost on those untrained in science.
But one’s lack of training should never exclude them from contributing to a conversation this important. Everyone has the right to voice their opinion, but they also have the obligation to try and understand the issue objectively. Doing so requires grasping a number of fundamental aspects of how science works, and how things change when our subject of interest increases in complexity.
Providing a pedagogical overview of the foundation needed to discuss IQ intelligently is a nontrivial task. But dismissing those who lack the scientific literacy necessary to form an educated opinion about IQ only reinforces the very ignorance many of us wish to fight.
And so I wrote this article to be a conceptual overview helping others understand why intelligence testing doesn’t stand up to the rigors of science. While many will still choose to disagree I encourage readers to approach this article as objectively as possible.
My article is framed under the 2 overarching pillars of all science, explanation and prediction, covered as distinct pieces in parts I and II.
PART 1: WHAT ARE WE MEASURING?
Every field of scientific study requires measurement and interpretation. Measurement allows us to capture information from our experiments, while interpretation is how we draw conclusions from the resulting data. In PART I we look at what it means to interpret measurements, and how interpretation not only fuels our chase for why in science but also underscores the scientific validity of our theories.
Measurement and Interpretation in Science
At the heart of science is the scientific method. This is the empirical approach we use to acquire knowledge. It’s how we follow up on the observations we find scientifically interesting so we can create theories regarding how things work. We observe, ask questions, research, form hypotheses, run experiments, collect data and draw conclusions, all in an iterative fashion. Doing this will either produce a model explaining and/or predicting the phenomenon of interest, or produce nothing.
A more succinct way to think about this effort is as an attempt to understand some unseen function driving what we observe. Think of nature as a “black box” that converts inputs to outputs. The outputs (Y) are what we observe, and the inputs (X) are whatever goes into nature’s black box.
Any observation we make in nature (colors in a rainbow, black holes in the center of galaxies, dissolving salts in solutions, tides in the ocean, behavior of birds, etc.) come about because nature turns a set of available resources into something we observe. Sunlight is dispersed, massive stars are collapsed, ionic solids are dissolved, tides bulge, and birds aggregate. In every case some underlying function is converting inputs into outputs and the purpose of science is to understand how that conversion works.
Both Figure 3 and its black box analogy tell us that measurement is a core part of the scientific method. Measurement is how we gather data during an experiment so we can draw conclusions about what goes on inside the black box. Drawing conclusions is interpretation.
A necessary condition for the scientific method to work is the connection between our measurement and our interpretation.
By “connection” I mean any interpretation of the data we collect must be reasonably aligned to the results of the measurement. Let’s use an example.
I catch a train to work where an old bedsheet hangs on the opposite side of the tracks. Every once in a while the bedsheet moves, which I originally assumed was due to wind, but I’ve also noticed it happening on days with no wind. People are not allowed to cross the tracks so I cannot simply observe the other side of the bedsheet. What is causing the bedsheet to move?
Despite my observation that it moves on non-windy days I am going to propose it’s the wind. I believe wind can occur on only one side of the tracks due to the surrounding geography. Also, I’ve never heard animal noises in the area. Since we can’t see the cause of the bedsheet moving I must come up with a way to test my proposed explanation for the moving bedsheet. My experiment will involve throwing a ball the next time I see the bedsheet move and observing the outcome.
2 things can happen. Either the ball falls to the ground or the ball shoots off in some direction.
I throw the ball and it bounces off as in the 2nd scenario.
Here’s the key. I have to decide what I think that deflection represents. What is it we are measuring? Nature does not give up her secrets directly, so I must interpret what I am seeing.
I decide to interpret the bounce as being caused by an animal, and not the wind. This explanation is reasonable, since if it were wind I would expect the ball to drop down (no hard surface to hit) whereas an animal’s body would presumably deflect the ball.
We are not calling this a “fact”, since I can continue these experiments and will undoubtedly see the ball bound off in different directions. I might even see it fall to the ground on a few attempts. But we would still agree I have provided a reasonable connection between my interpretation and the results of my measurement.
If you think this example is contrived it is analogous to how the nucleus of an atom was discovered. After bombarding a piece of gold foil with a beam of alpha particles the deflections were interpreted as evidence for the modern structure of the atom.
To reiterate, there must exist a connection between measurement and explanation such that we can construct a reasonable story about how the conversion takes place inside nature’s black box.
Measurement as Proxy
As shown by our previous example measurements in science are never direct. We cannot directly toggle the function inside nature’s black box, we can only look at how inputs get converted into outputs and reason about what causes that conversion to happen the way it does.
This means measurements in science are proxies to whatever underlying function is driving the patterns we observe. In our bedsheet example we observed the deflected ball, we didn’t observe whatever it was behind the bedsheet.
But what if we were measuring tree growth? Isn’t measuring the height of a tree a direct measurement? But tree growth is an observed phenomenon, involving the addition of living tissue on top of older layers, changes in shape, extending root systems and the transport of carbohydrates, all requiring the complicated process of photosynthesis. Thus even measuring the height of a tree in an attempt to explain tree growth is hardly direct.
Understanding measurement as proxy is important because the “distance” between our measurement and the thing we think we are measuring (the underlying function driving the phenomenon) increases as the system of study becomes more complex. Complexity is a core subject of this article so we need to understand what we mean by this term before diving deeper.
We hear the word “complexity” all the time, but what exactly does it mean? While the term gets used in all kinds of contexts there is a simple yet operative definition. We face complexity whenever we look at systems with a large number of components whose properties are determined by the interaction of those components.
This means that what we observe when looking at complex systems is only seen by virtue of the system’s aggregate behavior, not by its individual components.
Look at the following starling “murmuration”:
Those patterns only exist when thousands of individual starlings interact. We cannot describe these patterns by inspecting the individual behavior of starlings. The same applies to the shimmering defense of giant honeybees. The patterns we observe are not seen in individual bees. Only in their aggregate behavior:
Other examples of complex systems include the climate, the power grid, communication systems, and living cells. Each of these exhibit properties that can only be understood in a non-reductionist fashion, understood not via inspection of its components but rather by studying “from the top.”
But isn’t it true we can describe the individual interactions of each starling and honeybee? After all, we can outline the flocking rules of starlings in terms of their separation, alignment and cohesion. And we can observe individual giant honeybees flipping their abdomens in the air just prior to the group’s shimmering defense. Yes, but this isn’t the same as describing an unbroken logical chain of cause and effect between individual components and group behavior. To understand this distinction we need to look at the defining characteristics of complexity.
The Hallmarks of Complexity
Complex systems have distinct hallmarks that differentiate them from simpler systems. 3 core attributes of complexity are nonlinearity, opacity and emergence:
Nonlinearity means the change we observe in the output is not proportional to the change in the input (a small change in a single variable can lead to a massive change in the entire system, or vice versa). Linear systems usually have clear and simple solutions, allowing for obvious predictability, whereas nonlinear systems either have no solutions or require we redefine what we mean by “solution.”
Thus the distinction between linear and nonlinear defines a boundary between strictly knowable and frustratingly elusive. Take as an example 2 versions of an oscillator, the single and double pendulum. On the left we have the simple behavior of a single pendulum and on the right we have the not-so-simple behavior of a double pendulum.
The single pendulum is an example of a simple linear system (for small displacements). The math describing the single pendulum allows us to determine the exact location of the mass at any given time. We can arrive at that math by breaking an equation into pieces, solving each piece, and then combining the partial solutions into one (a technique called superposition). Linear systems have solutions that are the sum of their parts.
Nonlinear systems like the double pendulum exhibit wild, surprising behavior that cannot be described using simple equations. Double pendulums exhibit complicated dynamics that are impervious to reductionist descriptions (we cannot break and recombine math to arrive at its solution). In nonlinear systems the whole is greater than the sum of its parts.
What is the relationship between this example and complexity? The nonlinearity that occurs when we move from the single pendulum to the double is an example of a chaotic system. Chaotic systems have few interacting pieces that interact to produce intricate dynamics. While chaotic systems are not the same as complex systems they give us a sense of where the nonlinearity in complex systems comes from.
Like chaotic systems complex systems exhibit sensitive dependence on initial conditions (commonly referred to as the Butterfly Effect). This means small changes to initial conditions can lead to dramatically different outcomes, making complexity extremely hard to model.
While chaotic systems are sometimes referred to as “predictable” the reality is they are only predictable out to very short horizons. The inability to know the initial state of a system with perfect accuracy precludes long-range prediction in chaotic systems.
Where complexity differs relative to chaotic systems is in its large number of interacting parts. Those interactions make the system non-deterministic, making prediction exceedingly difficult if not impossible.
Those interactions also dictate the properties and behavior of the system as a whole. Importantly, these properties are independent of its microscopic details. We know this because complex systems are “multiply realizable” (54). For example, many different configurations of the same substance can generate the same temperature.
It’s important to realize that nonlinearity doesn’t make it harder to solve complex systems using linear approaches, it makes them impossible to solve using linear approaches.
Opacity relates to our inability to know how individual components lead to complex behavior. Again, this doesn’t mean we cannot describe how components in a complex system interact. It means we cannot construct an unbroken logical chain of cause and effect between individual components and group behavior; we cannot see the story behind how complex behavior arises.
This is a consequence of the fuzziness of nonlinear solutions described above. As the number of interacting components increases we lose the ability to explain observed phenomena in simple reductionist terms. As with nonlinearity, this opacity is fundamental.
Emergence occurs when traits of a system result from its interactions, which are not apparent from its components in isolation.
Take ant colonies as an example. A single ant has limited ability to reason or accomplish complex tasks. But the colony can work in concert to migrate, allocate labor, move cohesively and sense their environment. Ant colonies can even make collective decisions about what to do next.
Individual water molecules are not wet, wetness is an aggregate property that emerges after many water molecules coalesce. Herd behavior in large groups only occurs when many people group together. The properties we observe in schools of fish, hurricanes, crystal symmetry, etc. are all byproducts of emergence.
These 3 hallmarks of complexity set limits on how well we can describe change and infer knowledge regarding a complex system. We will see the implications of these limits throughout this article.
These hallmarks do not represent an exhaustive list. Complex systems also exhibit self-organization, openness, feedback and adaptation.
Complexity and the Burden of Interpretation
With an operative definition of complexity we can now look at what happens to measurement and interpretation as the complexity of phenomena increases.
Proxy Distance and Complexity
We already discussed how measurements in science are proxies to some underlying function driving the patterns we observe. But the level of indirectness in our measurements is not the same between simple and complex systems. For example, if we decided to measure the dimensions of a dinosaur fossil we could be fairly confident we were quantifying some aspect of this species’ morphology. This is because we have a strong understanding of morphology in general, across many different ranks in the animal kingdom.
But what about temperature? What does temperature measure? Obviously how hot or cold something is, but exactly is hot and cold? We can present reasonable arguments to suggest temperature measures the average kinetic energy of particles inside an object. Our arguments will be supported by supplemental theories such as atomic theory and kinetics. But is this as direct as taking a set of calipers to a fossil to measure its shape, structure and size?
What about climate? Climate includes temperature, but also humidity, pressure, wind, and rainfall. This presents a dramatic increase in complexity relative to the amount of energy present in a single object, let alone a static fossil. How might we measure climate? We can obviously determine rainfall, air pressure, wind speed etc. but how direct would any of these measurements be to the phenomenon?
In science, the “distance” between our measurement and the underlying phenomenon we are attempting to measure increases with complexity.
If your system of interest is considered complex you must assume any measurement made in the pursuit of understanding that system is highly indirect. This is a consequence of the complexity hallmarks we looked at above.
Another way to think about proxy distance is in terms of the number of competing explanations for a given phenomenon. The bedsheet example towards the beginning could only have so many explanations. Same with the particle beam experiment. But what if we were looking to explain how ant colonies or herd behavior work? How many explanations could we come up with to explain our observations?
When the complexity of the system we are interested in grows our measurements become increasingly indirect to the phenomenon of interest.
Recall that complex systems are “multiply realizable” meaning many different configurations of the system can produce the same emergent property. Multiple competing explanations are thus fully expected when looking to describe complex systems. Proxy distance increases because our measurement’s tether to nature’s underlying function becomes unstable under complexity.
The Burden of Interpretation
With a larger proxy distance comes a corresponding increase in the due diligence we must apply to any proposed interpretation. This is because with complexity we cannot assume our measure is anchored to the underlying function that drives what we observe. We must work (much) harder to support any proposed explanation.
A critical point to realize is that interpretation cannot come from the measurement itself. The morphology example above was reasonable because we already had a good understanding of the biological situation.
At this point we have discussed the scientific method, the relationship between measurement and interpretation, reviewed the hallmarks of complexity, framed measurement in terms proxy distance, and saw how the burden of interpretation increases with complexity. Let’s now move on to the topic of intelligence.
Intelligence and Complexity
The Complexity Spectrum
We’ve seen how studied phenomena differ in terms of complexity. We could arrange these phenomena as such, on a “spectrum of complexity.” This wouldn’t be a rigid, well-defined ordering but it would find fairly good agreement among the scientific community.
So, where does intelligence sit on the complexity spectrum? Most would agree that human intelligence would sit at the extreme far end as shown in Figure 14.
In fact, most scientists tend to agree that intelligence, or whatever substrate we assign to its origin (e.g. brain) is THE most complex phenomenon we know of; period. While this sounds like a bold statement it actually makes perfect sense. As discussed earlier, complexity occurs in systems with a large number of components that interact, and whose properties are dictated primarily by those interactions. The human brain is comprised of ~100 billion neurons, which interact in highly complex patterns (23). Its position on the complexity spectrum is easily justified.
The brain’s complexity begets its mystery. Like any other complex system intelligence bears the hallmarks of complexity. But the puzzle of intelligence isn’t some insurmountable marvel impervious to scientific inquiry. Like any complex system it is worthy of study. But for intelligence to give up its secrets it must condescend to measurement, and this means it must be quantified.
Complexity Lives in Higher Dimensions
We know complexity occurs in systems with a large number of interacting components. But what does this mean when it comes to modeling complexity? Specifically, how can complexity be quantified into something that lends itself to interpretation and prediction?
To build a model of something is to create a simplified approximation. Models are abstractions that anchor our ideas about how we think the underlying function inside nature’s black box works. But as discussed in the previous section not all phenomena can be described equally. Higher complexity demands more intricate descriptions, and when it comes to modeling this means the use of more dimensions.
Dimensionality can be thought of as the number of axes we need to describe a system. We are all familiar with thinking about systems in terms of 2 and 3 dimensions:
2 and 3 dimensions are easy to visualize. Let me rephrase that. They are possible to visualize. Everyday common sense shows us that we can move left-right, up-down, and forward-back (3 dimensions). This is how we perceive the world. It’s also how we tend to measure things. If I want to know where something is, or how fast it’s moving, or how big it is, I can plot my measurements across 3 axes.
This isn’t just for spatial dimensions. Any system we are interested in can be described by pretending the measurements occupy space. For example, if I wanted to measure the flavor of my cappuccino I could plot its bitterness level, sweetness level, and creaminess level:
My cappuccino could thus be described in “flavor space” as occupying some position dictated by the levels of its measured flavors.
But what if we wanted to study something involving more than 3 dimensions? Say we wanted to model the housing market, looking to explain why some homes are more expensive than others? We immediately know house prices are determined by the age of the home, the location in the city, the number of rooms, and inspection reports. That’s already more than 3 dimensions. This doesn’t take into account interest rates, GDP, and possible government subsidies.
The underlying function that determines the price of a home is more complex than anything we can capture using some low-dimensional description. But if high complexity demands more intricate descriptions, and those descriptions require more dimensionality than we can visualize, how do we keep science in the game when things become complex?
We build models using mathematics because math allows us to understand quantity, structure, space, and change. Pretty useful. While not perfect, math has proven massively beneficial to the description of reality. A critical aspect of math is that it can operate in more than 3 dimensions. This means math provides a window into a reality we otherwise wouldn’t have access to.
One way to use math in higher dimensions is to construct a feature space. A feature space is what we created when we made our “flavor space” for my cappuccino. It’s the “space” defined by the axes used to describe our system.
We can think of a feature space as being constructed from the columns of a dataset. So if we’re interested in car accidents we would gather data on say vehicle speed and number of accidents:
So a good way to think about the dimensionality of any system is the number of columns we must have in our data to represent the problem. If we wanted to consider more attributes we would add more columns. Our feature space for car accidents is wanting; I imagine accidents depend on more than just speed. Let’s add a number for traffic congestion:
We can continue this approach, adding columns to our dataset to try and capture the complexity of the situation. We can no longer draw ourselves the feature space, but rest assured it exists.
Since mathematics must operate inside this feature space it means our analysis will be limited by the number of measurements we can make. If we don’t express our problem with enough dimensionality the analytical approach won’t matter. Math cannot tap into higher-dimensional information without having a feature space that is commensurate with the complexity of the problem.
This now raises the obvious question. How many columns of data would you need to construct a feature space to represent human intelligence?
Keep in mind the complexity spectrum in Figure 14. Human intelligence is at the extreme far end. To even come close to capturing something this complex would seem to demand a near-infinite number of dimensions. Otherwise any mathematical approach used to model this phenomenon wouldn’t have access to the dimensionality needed to create its approximation.
But hold on. Does this make modeling intelligence impossible? Surely we must be able to build some kind of model, even if it makes drastic approximations, and has limited access to the dimensionality of the problem. Is it true math cannot tap into any higher-dimensional information unless it has access to an exceedingly rich feature space? Haven’t I seen simple linear models of certain human behavior? And what about AI researchers? They are obviously tapping into some kind of intelligence mimicry with their models.
This brings us to a critical point in understanding how models are created in science.
Forcing versus Finding the Narrative
When it comes to anything in life we can either force a narrative onto reality or we can let that narrative emerge from reality. For example, I can hold opinions about money based on how I was raised, or I can fashion my thoughts from real-world experience in investing. Similarly, I can write a book by adhering to a pre-constructed outline or I can let the outline emerge from ad hoc writing.
This dual route to working out situations appears in science as well, by way of how we choose to model a phenomenon. Figure 21 shows these 2 approaches:
Note the difference. Both are attempting to use data to confirm a model. But they operate in different directions and make very different assumptions. On the left we are starting with a pre-constructed model and attempting to use data to confirm its validity. On the right we are instead starting with data and attempting to predict what we will see next. Any errors in our prediction are used to adjust our model.
The approach on the right is based on trial-and-error and is led by data, not the model. Since our model represents our guess of the underlying function (inside nature’s black box), failed predictions force us to change that function until we produce outputs that align with observations. The approach on the left is led by the model. The objective here is to show that data we observe has been generated by something akin to our notion of the situation.
At first blush both approaches appear equally valid. As long as we are willing to change the model when data doesn’t confirm its validity we are inline with the scientific method. But there is an unfortunate truth about using data to validate models that can creep into the modeling process; any model can be “validated” with the use of data.
This unrigorous statement may seem surprising, but less so when we think about how biases creep into our daily lives. It’s easy to “confirm” what we believe if we look hard enough. In statistics we call this “torturing the data until they confess” meaning we can repeatedly gather and interpret source data until they align with whatever pre-conceived model we have in possession.
In most areas of science there are mechanisms in place to guard against these kinds of problems. But these mechanisms are much more common in sciences that either explicitly (and properly) control the design of their study, or base new models off of established work that has been extensively reproduced.
But much research relies on observational data that was collected some time in the past. This precludes the possibility of designing the study, and rarely has an extensive foundation of reproduced work to stand on (e.g. novel data-driven applications being developed for the first time).
Without the chance to design the study, and with proof of reproducibility yet to come (hopefully), how do those building statistical models based off observational data validate their work? The answer to this question brings us to a deep divide between 2 camps that use data and modeling very differently.
The 2 Cultures
The distinction between the 2 approaches shown in Figure 21 has caused a rift within the scientific community. There are now 2 different cultures of statistical modeling. One culture assumes data we observe from nature’s black box are generated by something akin to their pre-constructed model, while the other uses learning algorithms to arrive at models via prediction / trial-and-error.
If you want to know which camp you or anyone else falls into simply look at the type of models you use in your work. The 2 options are data models and algorithmic models. Figure 22 shows the same picture as Figure 21, except the predictions on the right-side approach are now being done by a learning algorithm.
The statistics community has been committed almost exclusively to data models, while algorithmic modeling developed in fields outside statistics, by computer scientists, physicists, some engineers, and a few aging statisticians (21). Machine learning is the most prominent form of algorithmic modeling today.
The reason algorithmic models were developed is because they can be used on large complex data sets, and usually lead to more accurate models. Data models on the other hand fail to approximate problems with an appreciable level of complexity.
Relying on data models to approximate complex systems leads to irrelevant theory and questionable conclusions. — Leo Breiman
Techniques like linear regression and logistic regression fall into the data modeling camp. Validating these models relies on “goodness of fit” rather than on the prediction of unseen data. If you’ve ever fitted a line inside a scatter plot and found its R-squared value then you’re familiar with the data modeling approach.
This enterprise has at its heart the belief that a statistician, by imagination and by looking at the data, can invent a reasonably good parametric class of models for a complex mechanism devised by nature. Then parameters are estimated and conclusions are drawn. — Leo Breiman
The algorithmic modeling camp instead uses its learning algorithms to adjust parameters and continually change the model until it predicts. These are algorithms like decision trees and neural networks. These algorithms use massive iteration and parameter tweaking to arrive at a function that approximates the one inside nature’s black box.
An important distinction between these 2 cultures is that the algorithmic modeling camp treats the data mechanism as unknown.
One Culture Works with Complexity
We already discussed the hallmarks of complexity and how these engender phenomena with nonlinearity, opacity and emergence. How likely is it a pre-constructed model of a complex system will approximate that system’s behavior, particularly if you’re unwilling to change it?
There are 2 problems with trying to use data models to model the complex:
- it’s easier to torture data with a data model;
- data models are too simple to approximate complex behavior.
Let’s start with the first point. This relates to the susceptibility of data models to false narratives, confirmation bias, and convenience over truth. Extremely simple problems might justify data models because simple problems have much less mystery about how they work. If we already have a good understanding of the situation (remember above) then we can construct a reasonable story about our model’s validity. But as complexity increases this is no longer the case.
The problem with data models is that the conclusions are about the model’s mechanism, not about nature’s mechanism. We can see the circularity here. It is understood in science that if the model is a poor estimate of nature’s mechanism it is to be updated or discarded, but the use of data models encourages the opposite behavior.
The belief in the infallibility of data models [is] almost religious. It is a strange phenomenon-once a model is made, then it becomes truth and the conclusions from it are infallible. — Leo Breiman
With data modeling we can use all kinds of elegant tests of hypotheses, confidence intervals and distributions of the residual sums-of-squares. This will make the model attractive in terms of the mathematics involved, but this has little consideration regarding whether the data on hand could have been generated by a linear model.
Thousands of articles have been published simply because they claim “proof” via the infamous 5% significance level. Side note, there is currently a movement to remove the term “statistically significant” from science because of the way it has been abused (55).
“The whole area of guided regression is fraught with intellectual, statistical, computational, and subject matter difficulties”. — Mosteller and Tukey
Unfortunately there are few published critiques of the uncritical use of data models. It is known that standard tests of goodness-of-fit in regression analysis do not reject linearity until the nonlinearity is extreme. The use of residual analysis to check lack of fit should be confined to data sets with only two or three variables. Remember our discussion on how complexity lives in higher dimensions?
“Nobody really believes that multivariate data is multivariate normal, but that data model occupies a large number of pages in every graduate textbook on multivariate statistical analysis.” — Leo Breiman
The 2nd point from our above list is more fundamental. Data models are far too simple to capture the complexity of highly intricate situations. Just as our feature space needs to capture complexity through its dimensionality, our model must approximate the complexity of our system by itself being complex.
This is precisely why the algorithmic modeling camp was born. If we look at the most complex problems being solved in AI (e.g. facial recognition) these models are highly complex, and bear themselves the hallmarks of complexity; AI models are nonlinear, opaque, and produce outputs one could argue are bordering on “emergent.”
“With data gathered from uncontrolled observations on complex systems involving unknown physical, chemical, or biological mechanisms, the a priori assumption that nature would generate the data through a parametric model selected by the statistician can result in questionable conclusions that cannot be substantiated by appeal to goodness-of-fit tests and residual analysis.” — Leo Breiman
The truth is there is little to no justification for the use of techniques like linear regression to model complex phenomena. If you are using data models to approximate complexity you are doing so out of convenience, not scientific rigor.
One “argument” in favor of data models is that simpler models should be favored because they are more interpretable. For example, linear regression might tell us more about how inputs are related to outputs since linear regression is easy to understand. But choosing a model due to its convenient transparency does absolutely nothing to defend its validity. A model’s purpose is to reflect reality, not provide convenient interpretation.
Occam’s Razor Can’t Guide You Through Complexity
The history of science is based largely on reductionism. To say the scientific enterprise has benefited from thinking in terms of elegance and simplicity would be a massive understatement. This has placed the famous principle known as Occam’s razor front and center as a guide to doing research. Occam’s razor is formally defined as “entities should not be multiplied without necessity” and more colloquially as “the simplest solution is most likely the right one.”
This is a philosophical principle, not a scientific one. It’s useful when the systems we study are simple. But the science of true complexity rarely benefits from simple descriptions. Importantly, we know the hallmarks of complexity don’t allow complex systems to be described in a purely reductionist fashion.
Simpler models lack the expressive richness required to approximate intricate systems, and suffer from the data model issues outlined above. While the search for simple solutions is a nice guide for those modeling relatively straightforward systems, those looking to model complex phenomena should assume complex models are more appropriate. History shows us that Occam’s razor quickly dulls under complexity.
The Curse, and Blessing, of Dimensionality
The extra dimensionally used to capture the description of complex systems comes at a cost. Known as “the curse of dimensionality” this refers to the dramatic increase in “volume” that occurs when our feature space is constructed by a large number of dimensions. It’s a curse because data inside our feature space become sparse relative to the volume it lives in.
If data are sparse it makes it more difficult to describe the system. But it’s not correct to say the thinning out of data equates to a loss of information. The sparsity of data is itself valuable information. Take cluster analysis as an example. In clustering we look to find groups inside a high-dimensional feature space. It’s exceedingly difficult for a group to form under sparsity, but if it does form, it can mean the cluster is important.
Take our housing example above. Any 2 homes that are found close together in feature space suggests those homes are similar across many dimensions. In other words, they are very similar in potentially important ways. If it were easy for 2 homes to be found in the same space this information would be less valuable.
This means dimensionally is both a curse and a blessing. While high-dimensional spaces make analyses difficult they also provide much richer information when something is found. Complexity is hard, and we shouldn’t expect it to reveal its secrets in low-dimensions.
What does IQ Measure?
We now have the background needed to have a proper conversation about whether or not IQ should be considered a valid science. In this final section of PART I we discuss whether or not IQ stands up to measurement, interpretation and complexity as they relate to drawing conclusions from data.
Quantifying Trends in Data
Part of drawing conclusions in science involves observing trends. In our bedsheet experiment towards the beginning we observed balls deflecting, and this trend helped confirm our proposed interpretation. There was a connection between our interpretation and our measurements because our proposed explanation could be used to describe the behavior of deflecting balls.
So how are the measurements used by IQ proponents connected to the underlying function that drives intelligence? Specifically, how do those studying IQ attempt to align their measurements (IQ tests) to their explanation (test scores reflect innate cognitive ability)?
IQ studies rest squarely on correlation, one of the more common ways to quantify trends in data. Correlation is defined as any statistical relationship between two random variables. If A moves when B moves then we say A and B are correlated.
Correlation is quantified using the correlation coefficient. The correlation coefficient ranges from -1 (perfect negative correlation) to +1 (perfect positive correlation). We can think of the correlation coefficient as measuring the shape of an ellipse after plotting points on a graph.
As we all know correlation does not imply causation. While trends in data can encourage further analysis correlation alone cannot provide an interpretation to the cause. Ever.
We know this because there are countless examples of things being correlated that are entirely meaningless. We call these spurious correlations. My age correlates with the distance between earth and Halley’s Comet. It also correlates with the increase in tuition across college campuses. Cheese consumption correlates with the number of engineering doctorates awarded, and drownings tend to correlate with the marriage rate in Kentucky (check out the spurious correlations website for more examples).
Not only are spurious correlations common they represent almost all correlations we can measure. The informational value of correlation is somewhere between exceedingly unlikely and non-existent. Despite its lack of informational value correlation is used in all kinds of studies. The reason is that correlation can encourage us to look deeper into a phenomenon. If 2 things are moving together then perhaps it is worthy of increased scrutiny.
The IQ Feature Space
We discussed the construction of feature spaces previously. Collecting columns of data is our chance to try and capture the dimensionality of the problem, and construct a space that can be analyzed.
So what are the dimensions for IQ? We can look at the most popular IQ test as an example. The Wechsler Adult Intelligence Scale is the most commonly used “intelligence” test, comprising the following dimensions:
Imagine each of these dimensions administered as an individual test. What would our dataset look like? Figure 26 depicts 14 intelligence tests for 5 individuals. We can envision many more rows representing additional testees.
In this case our IQ feature space would be constructed out of 14 measurements.
We already know we cannot visualize this feature space but we also know we can treat it mathematically.
Higher Dimensions and Correlation
With a high-dimensional feature space in hand, and the mathematical rules of correlation at their disposal, researchers studying IQ can attempt to find correlation inside the 14-dimensional space created via their IQ test measurements.
This is where factor analysis comes in. Factor analysis is a mathematical technique for reducing a high-dimensional system into fewer dimensions. The primary reason for using dimensionality reduction techniques is to make difficult computations more tractable. When there are too many dimensions (columns in our dataset) it can demand more computational resources than what we have available. Less columns means less work.
The idea behind factor analysis is simple; attempt to find the directions inside a feature space where things change the most. In other words, find where data points are exhibiting the most variation. This makes intuitive sense. If we wanted to discover data trends inside a high-dimensional space we should use math to discover where in that space data are trending.
Recall from Figure 24 that correlation captures the variation seen in data (the ellipse around data points). We can represent this variation with an arrow positioned in the same direction as the ellipse.
But this ellipse only shows variation in one direction. We can draw another arrow that cuts through our ellipse at 90 degrees to the first arrow:
This captures even more variation in our data. Factor analysis provides the mathematical utility to “draw” arrows (factors) inside high-dimensional spaces finding where variation exists. Here is my attempt to convey this visually using a “hypercube” to depict a high-dimensional IQ feature space:
The “longest” arrow captures the most variation, with all other arrows ranked accordingly. The arrow that captures the most variation is called the principal component.
So what does all this have to do with trying to validate IQ?
She’s a Seductress, and her Name is Reification
We have arrived at the pivotal section for the supposed scientific validation of IQ. When people are defending the notion that IQ can truly measure intelligence this is the foundation of their argument (whether they know it or not).
Remember, PART I is not about predictive utility. We will address the utility issue in PART II. Here we are looking at whether or not IQ tests measure the phenomenon of human intelligence.
As we discussed in A Short History of IQ the psychologist Charles Spearman invented factor analysis in an attempt to reduce a high-dimensional problem down to something simple. I referred to this as dimensionality reduction and said the primary reason for using these techniques was to make difficult computations tractable. But when it comes to how Spearman used his factor analysis this definition isn’t quite fitting.
Spearman knew his factors (those arrows we looked at above) did something more than just help summarize high-dimensional data. They also gave psychologists a way to point at a condensed version of the data. Just as someone might point to an average number in order to summarize information, Spearman could point to his factors. But Spearman didn’t stop there. He decided to “reify” his factors as though they were something real. Specifically, Spearman pointed to his principal component (arrow with the most variation) calling it “g” for “the general factor in human intelligence.”
By using factor analysis to “discover” the correlational structure inside high-dimensional spaces carved out by intelligence tests Spearman has supposedly identified the elusive underlying function that drives intelligence. At last, just as physicists could measure particles and use rigorous descriptive models to describe phenomena, psychologist too had achieved the status a real science.
But hold on (record scratch). Factor analysis is based on correlation. Didn’t we say correlation cannot be used to show causality? If correlation cannot show causality how can it be used to help connect IQ tests to some proposed interpretation of intelligence?
Is factor analysis somehow superior to the use of correlation in lower dimensions? Nope. Factor analysis is correlation. Whether you’re looking for the correlational structure in low or high dimensions the same lack of causal implications apply.
To reify something is to treat an abstraction (belief or hypothetical construct) as if it were a concrete real event or physical entity. This is a fallacy, and something to be avoided in science at all costs. We have mechanisms in place to avoid reification, all stemming from adherence to the scientific method.
The foremost of those mechanisms is the connection between a proposed explanation and its measurements. We don’t just measure things in science, we must see if measurements align with our interpretations of how nature works. We must offer a reasonable explanation of the phenomenon that could produce the trends we observe in data.
So, does IQ research satisfy the above? Can you see the connection between the researcher’s proposed explanation of intelligence and the correlation they discovered using factor analysis? If you’re trying really hard and feel confused there’s a reason. There is no proposed interpretation.
The “reason” or “cause” for tests correlating is … correlation.
Imagine I showed you a graph of cheese consumption correlating with degrees awarded. You then ask why might this correlation exist, and I tell you I have an explanation. The explanation this correlation exists is because the correlation exists. Satisfied?
The use of factor analysis doesn’t involve any proposed explanation for intelligence, it merely discovers some apparent correlation inside a feature space created using intelligence tests, and points at that correlation as intelligence. An explanation isn’t being connected to a trend in data, the explanation is the trend in data.
It shouldn’t come as a surprise that science tends to frown on circular reasoning. But how could something so blatantly irrational pass the safeguards of science?
Keep in mind I have taken the time to describe the issue pedagogically. You benefit from seeing this analysis stripped of its pretense. To understand the deception more formally we can look at how arguments degrade when muddled with imprecise language. If I tell you whatever is less dense than water will float, because whatever is less dense than water will float you’ll rightfully give me a dumb look. If I instead said whatever is less dense than water will float, because such objects won’t sink in water. This statement begins to mask its circularity despite being equally fallacious. If I use my conclusion as a stated or unstated premise we should all agree this cannot constitute rational thought, let alone proper science.
Dressing a story in elegant mathematics has a way of masking drastic assumptions and even downright deception. We have seen this recently with the use of elaborate financial instruments that conceal their dangerous assumptions about the market. If we don’t know better, we tend to accept things that appear rigorous, assuming real “experts” will filter the BS.
Psychology has never been coupled to the hard sciences. You won’t find many physicists walking with psychologists down the halls of academia discussing joint research projects. Psychology grew on its own accord. Growing in isolation, with a deep desire to be accepted as a quantified discipline, and a dressed-up mathematical approach that hides a fundamental fallacy in basic reasoning is the unfortunate but true history of psychology.
But Wait, What about Those Genetics Studies?
In the previous section we discussed the inherent circularity of using correlation to “explain” correlation. The problem with IQ isn’t that it rests on shaky ground, it’s that it arguably rests on no ground. But what if there was a way to show a biological basis for intelligence? What if researchers could point to some genetic cause to explain the trends in their IQ data?
We could enter the debate as to whether a “gene” as defined really exists. Some have argued this alone is another example of reifying an abstract construct. But dismantling defective science doesn’t require philosophical debates. We can simply use the same adherence to proper scientific convention we’ve been using throughout this article.
What comes to mind when I say “science has discovered which genes influence IQ?”
There’s a good chance, perhaps without realizing it, you took this to mean science has discovered the genes that influence intelligence. But this assumes IQ measures intelligence, and we’ve already seen how unjustified that notion is. IQ is a test score, whose tether to some underlying driver of cognition is feeble at best (remember the relationship between proxy distance and complexity).
When papers publish results saying they’ve “discovered” some genetic cause for a given trait what they’re saying is they found correlations between one or more genes and that trait. If I was researching the CFTR gene and found that its mutations correlated with instances of cystic fibrosis I might report these findings as an apparent determination of the underlying cause of cystic fibrosis.
We already know the problems with relying solely on correlation. There might be all kinds of reasons the mutated CTFR gene correlates with the disease. Correlations cannot make a theory they can only encourage further investigation. For this reason, proper study design and excessive supporting evidence must be brought to bear on the claim that correlations between genes and traits mean something.
Remember above when we said any interpretation we apply must come from outside the measurements themselves. There must exist an understanding of the biological situation to argue mutated genes can cause observed trends in disease, and thus be correlated. Otherwise there is no reason to accept the proposed explanation.
Prior to CTFR experiments gene mutation had already been studied. Add to this the reasonable suggestion that a permanent alteration in a DNA sequence would lead to altered information on the other end (gene expression). The explanation for the underlying cause of cystic fibrosis is accepted because that explanation is reasonable, and supported by additional and accepted understandings of the biological situation.
Observing a trend in an established disease is not the same as observing trends in an invented metric that quite possibly points to nothing. Disease has an accepted biological definition. IQ does not. Yes, researchers have found genuine correlations between genes and IQ, but what IQ itself represents is not established science.
This doesn’t even touch on the issue that identifying genetic “locations” as a cause is highly problematic. Most traits are not caused by individual genes. Genetic influences on traits occur in complicated ways involving many interactions. Genes work in combination, and likely involve complex interactions between each other and with the environment.
The use of factor analysis by psychology to reduce the ephemeral notion of intelligence down to something quantifiable is perhaps the most famous case of physics envy. Physics envy occurs when a field outside physics attempts to treat its phenomena the way physics does; by coming up with fundamental models that explain nature in terms of simplifying laws and particles.
It would be great if all fields of science could have such models, but the reason they don’t is obvious. There are way (way!) more variables to consider when we move from physics to chemistry, chemistry to biology, and biology to psychology. In other words, the complexity of phenomena increase dramatically when sciences becomes “softer.”
An XKCD comic shows this idea well.
In Figure 32 math is considered the most “pure” since the other sciences use math to greater or lesser extents depending on their field of study. XKCD didn’t quite get this right since saying one science is more “pure” than another suggests they are all diluted versions of the same ultimate science. In reality, the demarcation between sciences is due to the increasing complexity of the phenomena each field studies. The drop-off in math as we move from physics to sociology is a direct consequence of this increased complexity.
In short, biologists don’t use less math out of some distaste for numerics and computation. They use less math because its utility in explaining and predicting the complexity seen in living organisms has historically been limited. If biologists were to rest their theories solely on mathematics they would stifle much of their progress in the study of living systems.
This doesn’t mean math hasn’t or won’t continue to add value to the field of biology. It merely shows us that “simpler” physical systems are more readily modeled using “purely” mathematical descriptions.
The issue is actually more nuanced than the previous statement. It’s less about math losing utility as complexity increases and more about using the right math. As we saw earlier, math used to explain simple linear systems doesn’t work to describe chaotic ones. That first simple jump from simple to chaotic demands we redefine what “solution” even means.
We can see this redefining of “solution” in the first step after physics in Figure 32. While physicists can find exact solutions to describe the hydrogen atom, no such solution has ever been found for the helium atom (only one additional electron).
The simplifying laws of physics already begin to degrade the moment one additional electron is added to the simplest atom. Let’s continue moving left on Figure 32. Quantum chemistry is still deeply mathematical but it is tasked with modeling entire molecules and their interactions. The math used to approximate these systems must leverage approximations and heuristics to reach solutions. Chemists cannot rely on a “pure” physics-style solution because they don’t work. It would be great if they could. It might even be tempting to try. But it fails.
Continuing our trek through the sciences we can look to biology. It’s one thing to find “solutions” to interacting molecules, but what about to life itself? How many more variables are we adding to the system when moving from chemistry to biology? Would you expect the optimization techniques used on molecules to work well on dynamical diseases? How about using those models to describe anxiety as a topic in psychology? Can you see the inherent problem with physics envy? More variables demand different methods.
But what if someone found a solution from physics that could be applied to psychology? After all, doesn’t science benefit from the cross-fertilization of ideas? Absolutely. But finding useful links between disparate areas of study cannot circumvent the burden of interpretation. If you choose to study the complex you must contend with the corresponding increase in proxy distance.
Physics envy causes softer sciences to choose a model that is too simple. Simple models are clean, easier to describe, and often convenient. Their concreteness adds a sense of authority to a practitioner’s analysis. But that concreteness is misplaced when the system of study is too complex to be described in simple terms.
We began PART I by asking What are We Measuring? We discussed the role of measurement and interpretation, defined complexity, and considered the burden of interpretation. We saw where intelligence sat on the complexity spectrum, and the high-dimensionality required to describe complex systems. We also reviewed the 2 cultures of modeling and highlighted the problems associated with using simple approaches to approximate the complex. Finally, we looked at how reification and circular reasoning was and is used to defend the notion that IQ truly measures human intelligence.
In this final section of PART I we summarize the transgressions against science precipitated by IQ.
Circular Reasoning instead of a Proposed Explanation
Science depends on the ability to refute any proposed explanation of a phenomenon. If our conclusions appear in our stated or unstated premises then there is nothing to refute. Using correlation alone, regardless of how you choose to use it, cannot stand on its own merits to defend a causal explanation. Any “argument” using the correlations found in IQ studies as “evidence” of something other than correlation is by definition circular.
Reifying a Mathematical Abstraction as Something “Real”
Mathematical abstractions cannot be treated as real things unless there are considerable evidential reasons to do so. IQ departs from science at a fundamental level because it “identifies” intelligence by pointing to mere correlation found across IQ tests. Dressing up correlation in high-dimensional analysis is still correlation. It cannot be used to distill an idea into existence.
Not Adhering to the Burden of Interpretation
Recall our discussion on proxy distance and complexity. The burden of interpretation increases dramatically when we attempt to model intricate systems with an enormous number of possible explanations. In addition, the hallmarks of complexity (nonlinearity, opacity, emergence) preclude trivial solutions. Proponents of IQ withdraw from this responsibility by investing all their efforts in simplistic, convenient models that have no justifiable reason to be used on complex phenomena.
Analyzing the Complex using Deficient Feature Spaces
Complexity is captured through dimensionality. Recall Figure 20. How many columns would we need in our dataset to describe human intelligence? The factor analysis used to express intelligence as a unitary “thing” loses information the moment it strikes its components through the IQ feature space. Whatever feature space exists prior to dimensionality reduction is expected to be a rich representation of our phenomena, since our analyses cannot create information out of nothing.
Making a Ruler out of the Same Stuff You’re Measuring
When we make measurements in science we must be careful to use a measuring instrument that is divorced from the phenomenon we are measuring. Failing to do this would make it impossible to decouple the measurement from our subject of study. IQ tests are constructed to match the cognitive processes we believe occur in our scholastic and professional settings. Finding correlations between IQ tests is entirely unsurprising, and a properly trained scientists would be expected to take this as scientifically meaningless.
Generate 2 lists of numbers randomly and see if they correlate. No? Do it again. Again. Eventually those lists will correlate. This little trick manifests itself in more insidious ways throughout science, usually in the form of “p-hacking.” We discussed in the Forcing versus Finding the Narrative section an unfortunate truth about using data to validate models. If we torture data long enough it will confess. If we are unwilling to change our model in the face of contrary evidence the data will eventually agree with our model.
This same problem can occur in high-dimensional spaces when our “model” is a simple correlation cutting through feature space. We saw how factor analysis can resolve variation inside high-dimensional spaces using components. But factor analysis does not guarantee the best cuts. As depicted in Figure 33 we can choose any arrangement of components:
This means there is more than one way to skin a high-dimensional cat. The arrangement we choose can drastically alter the way we interpret what we are seeing. For example, methods have been known since the 1930s that enable rotating the axes of our feature space. It can be shown that these rotations can easily make the “general factor” of intelligence disappear altogether, with no loss of information. The orientation that does resolve “g” holds no privileged status out of the many possible ways to position axes.
Proposing Interventions Under Complexity
Psychologists propose changes to policy. This represents an intervention into a complex system. The burden of interpretation should always be taken into account whenever interventions are proposed. This makes intuitive sense. If the proxy distance between my measurements and the underlying function is large I should be exceedingly wary of turning my findings into a recommended policy or promoted piece of technology.
We already discussed how psychologists have been proposing the use of intelligence ranking in our schools, businesses, and various policies since the invention of IQ. Today, we are seeing companies promote technology that permits parents to screen embryos for intelligence during the process of in vitro fertilization (56). This introduces an unnatural filtering mechanism into society based on defective “science.”
If we are to enforce anything it should be the scientific training required by those with the ear of government, and the power to bring life-changing products to market.
PART II: WHAT ARE WE PREDICTING?
In Part I we looked at whether IQ truly measures human intelligence. That question relates to the scientific validity of IQ, relying on the ability to form a reasonable explanation of observed data. But providing explanations is only half of science. The other half relates to prediction. Prediction has the final say when it comes to a model’s validity because there is no way you can consistently predict something with a completely wrong model. Predicting the future state of some complex system, however poorly, means you are in possession of some non-random approximation of that system.
We can even put science aside and argue prediction alone is enough to justify the use of a tool. The entire AI industry is based off practitioners predicting faces, voices, and a plethora of other things companies find important. You might argue these predictions are too narrow to justify the hype, but that’s irrelevant to this discussion. The point here is that IF prediction is indeed possible we can argue its usefulness justifies the work.
For this reason we need to look at prediction for its own sake. IQ proponents will argue their tests are predictive of future success, specifically educational achievement, occupational level and job performance. If these predictions are indeed possible then they could find genuine utility in today’s society.
In Figure 21 we saw the 2 approaches to building models with data. We can either pre-construct a model and use data to “confirm” it, or we can attempt to predict unseen observations. The latter involves tweaking our model until it predicts, and searching for a different one if it doesn’t. We also discussed how the former approach has fundamental problems when it comes to modeling complexity.
Those using traditional (data modeling) methods typically define “prediction” very differently than those using algorithmic models. As long as a data model “fits” existing data it is deemed predictive. With algorithmic modeling, prediction only occurs when a model can demonstrably guess what comes next, confirmed by predicting the majority of values in a test set the trained model has never seen.
As we discussed earlier, studies involving IQ rest squarely on correlation. Whereas PART I looked at the misuse of correlation in an attempt to validate IQ, here correlation is being used to “predict” important outcomes. In this scenario, the discovery of correlation between IQ and some outcome underlies the statement that “IQ can predict certain aspects of success.”
We already know the problems with choosing convenience over rigor, and how there is little to no justification for using simplistic models under complexity. But to move forward with our discussion we need to give IQ proponents the benefit of the doubt. Let’s assume simple correlation really does predict future success and that the outstanding question is this: how strong are these correlations?
For reasons we will discuss later proponents of IQ rest their prediction arguments primarily on the correlations between IQ and job performance. We will therefore focus our conversation there.
It’s important to realize that the correlations reported in defense of IQ and job performance are not the original correlations found. These are increased by the use of “corrections” using an analysis technique called meta-analysis (3).
Meta-analyses combine the results of multiple studies in order to estimate the “true” value of the overall population. Since individual studies are mere samples of the overall population, combining them into a meta-analysis is done to reduce the effects of errors found in any one study. By combining studies we achieve a pooled (and weighted) estimate of a value, meant to reflect the actual value for the entire population. These pooled correlations are the cornerstone of almost all claims in defense of IQ being useful for predicting future success (10).
Meta-analysis isn’t lacking criticism. It’s been called the “statistical alchemy for the 21st century” (57) and suffers from a range of issues related to publication bias (58). But the majority of criticisms aren’t directed at the meta-analysis technique itself, rather its misuse. For this reason we will assume meta-analysis, when done properly, is a valid approach to estimating some value of the true population.
The majority of original studies (those that get pooled) used in the meta- analyses for IQ are pre-70s, with hundreds of them reporting original correlations between IQ and job performance around 0.2-0.3 (24). Recall from earlier that correlations range from -1 to +1, with 0 meaning no correlation. As per meta-analysis protocol, those studying IQ interpret these as attenuated due to statistical artifacts such as sampling error, instrument unreliability, and range restriction.
Applying meta analysis to IQ studies doubles originally reported IQ correlations to around 0.5–0.6.
Let’s take a closer look at how meta-analysis works, and how closely IQ studies adhere to its criteria.
When we do statistical studies we rarely have access to the entire population. For example, if I wanted to know the average height of people in the U.S. I would have to take a sample of the entire U.S. populace. It makes intuitive sense that I should expect errors in my sample. There is no reason to believe my sample average would have the exact same value as the true population average. Sampling errors inevitably occur whenever the statistical characteristics of a population are estimated from a subset of that population.
The same holds for studies involving correlation. We should thus expect sample correlations to deviate from the true population correlation by some unknown degree. When researchers use meta analyses they are trying to recover the true correlation by correcting for these sampling errors.
In addition to lowering correlation in samples, sampling error also inflates the variance around the mean of the correlation estimated in meta-analysis. Think of variance as the uncertainty in the measurement.
Meta-analysis attempts to estimate the variance in the “true” population by correcting the variance in the samples using the sampling error. The intuition is that the total variance across our sample studies comes from whatever original variance existed in the true population plus the variance that came from the sampling error. By subtracting the sampling error variance from the total variance across our sample studies meta-analysis produces an estimate of the true population variance. By filtering out the variance originating from sampling errors the only variance left is expected to be the “true” population variance.
If your head hurts from keeping track of which variance belongs to what don’t worry. The take-home message is that meta-analysis is being used to calculate the “true” correlation of the entire population, by removing the influence sampling errors have on arriving at that value.
The most widely-cited meta-analyses in support of IQ-job performance prediction estimates that 70% of the variance consisted of sampling error variance (10, 11).
So what’s the problem?
As mentioned above, the statistical procedure of meta-analysis has a number of challenges and is only accepted when strict research criteria are met. It’s the meeting of those criteria where IQ studies become highly suspect.
There are 3 core assumptions made in meta-analyses, which IQ studies can hardly be expected to defend:
- all study samples are assumed to come from the same reference population;
- it is assumed that individual primary studies are random samples from this hypothetical population;
- it is assumed the reference population exists.
The meta-analyses of IQ involves estimating the true correlation by combining individual studies and using their sampling errors to recover the correlation of the overall population. But if those individual studies don’t come from the same reference population then obviously they cannot be used to recuperate some ultimate value from whence they never came.
How likely is it that individual IQ studies come from the same population? How many employers are willing to have their employees tested, let alone supervisors willing to rate them? This bias will undoubtedly happen with some jobs more than others (19). For the pooled estimate of some true correlation to be valid the individual studies would either need to represent the entire universe of jobs, or represent one very specific type of job; neither of which can be defended in these studies.
Keep in mind, sampling error does not account for error that comes from doing the sample incorrectly. It accounts for the discrepancy between observed and true values caused by a lack of access to the whole population. If samples are coming from different “true” populations then no amount of meta-analysis can reflect a true value.
This further assumes the primary studies that are combined in meta-analysis are random samples. But we know IQ studies are conducted on an as available basis, rather than carefully-planned random designs (38).
Finally, a deeper yet more obvious assumption here is there must exist a general population with that true value we are estimating. This assumes there is a single true underlying IQ-job performance correlation in the overall population.
In addition to the above assumptions there is also the issue of when the IQ meta-analysis corrections are done. Meta analysis involves calculating the estimated true correlation by computing the average of the individual observed study correlations. Corrections for sampling error should be done before this averaging occurs. The criteria is that meta analyses should be done on fully corrected samples.
The most widely-cited IQ studies do their corrections after the averaging. This is known to introduce inaccuracies such as reduced observed variance and exaggerated sampling error variance (13, 14).
Measurement error is an inherent part of measurement and is due to the unreliability of measuring. Just like sample error, this causes discrepancies between a sampled value and a true value. These discrepancies are expected to increase due to the unreliability of measurement.
In IQ testing, measurement error can affect both the test and job performance assessment. Just as sample error can attenuate correlation, measurement error can depress the correlation between IQ and job performance in a given study. This encourages the correction of measurement error in meta-analyses.
The way correlations get corrected across multiple studies is by increasing the observed correlations in proportion to the unreliability of the measure. The more unreliable the measure the bigger the upward correction. As long as reliability is well-established attenuation can be corrected in advance of the meta-analysis.
So what’s the problem?
Here we have 4 major issues with respect to IQ studies correcting for measurement error:
- reliabilities of the measures used are only sporadically available, and done on different occasions;
- systematic differences are not accounted for in the statistical model used for corrections;
- correcting for measurement error effects variances of the observed correlation coefficients.
Widely-cited studies supporting IQ-job correlation only gathered reliability information for a subset of studies for which that information was available (15). Reporting all available reliability information should be practiced for proper meta-analysis. Additionally, reliabilities are often estimated on different occasions(25). The problem is we know intra-individual variation can be greater than inter-individual variation in job performance (16). The differences in estimates could easily be due to performance differences rather than measurement error.
“…there are numerous theoretical reasons fur urging caution when correcting the magnitude of the correlation coefficients for measurement error” … and “it is of dubious merit in many situations.” — Richard DeShon
There is also the issue of the statistical model used for the corrections. There are 2 types of models that can be used in meta-analysis; fixed-effects models and random-effects models (59). The model used to apply corrections in IQ studies use a random-effects model, but this may not take into account the systematic errors that arise from the unreliability of supervisor ratings (17).
Systematic error is non-random error, and always affects the results of an experiment in a predictable direction.
Examples of systematic differences among testees are gender, ethnic background, social class and self-confidence. The assumption that these would have no impact on the correlations between IQ and job performance is an extreme one.
It is also known that correcting for measurement error after averaging can exaggerate sampling error variance (3). Recall how sampling error variance is used in meta-analysis corrections. The higher the sampling error variance the lower the estimated variance of the “true” population correlation. Widely-cited IQ studies improperly correct for measurement error after averaging, which artificially reduces the variance of the estimated correlation.
One final note about measurement error. We looked at the concept of proxy distance in PART I. Any measure applied to a complex phenomenon will be weakly tethered to that phenomenon at best. Correcting for measurement error assumes the measurement itself is sound. If the measure is meaningless so are its corrections.
Sample correlations can also vary due to range restriction in the samples. Specifically, correlations are attenuated due to reduced variability in the subset of data relative to the entire population. Figure 39 shows how selecting a smaller range of data lowers the measured correlation coefficient.
Range restriction arises in IQ studies since job performance ratings can only be provided for those who are actually in the job and have been IQ tested. Ratings are not available for all possible workers, including applicants who did not get the job.
Further restriction occurs since only certain people bother to apply for the job. Those who applied are likely to have certain experience, abilities, levels of confidence, etc. This self-selecting would be another source of deviation between sampled values and true correlations.
The statistical approach to correct for range restriction involves taking the ratio of the observed standard deviation in the restricted sample to that in the unrestricted population. A ratio of 0.5 would double the sample correlation.
So what’s the problem?
3 major issues relate to correcting for range restriction in IQ studies:
- Few primary studies report their range restrictions, and instead rely on extrapolations;
- there’s no way to identify the variance for the appropriate reference population;
- non-normal data with outliers can mean corrections actually decrease rather than increase the correlation.
Few primary studies used in IQ meta-analyses report their range restrictions (3). This is likely because knowing to what degree a sample is restricted would be near-impossible when it comes intelligence testing. The self-selection problem alluded to above means it’s exceedingly difficult if not impossible to assess whether applicants belong to a random applicant pool. This problem becomes much worse with smaller samples. How small are the sample sizes in IQ studies? The average sample-size in a highly-cited study was 68 (27).
Proper corrections for range restriction also depend on accurate estimates of both sample and population variances. We already discussed how ill-defined the hypothetical reference population is for IQ-job performance. The true reference population would be ALL applicants of the job, all of which would have to be IQ tested.
One of the most-cited methods for correcting range restriction in IQ studies assumes 515 jobs represent the entire universe of job applicants. 515 jobs used to represent the application pool for each and every job (26).
Finally, it has been shown that non-normal data with outliers can actually make corrections decrease rather than increase the correlation (18).
We looked at the corrections used to increase reported IQ-job performance correlations. We did this because this is where the “IQ predicts job performance” argument stems from. While meta-analysis can be criticized for its own lack of scientific rigor, we pushed forward assuming meta-analysis done properly is a legitimate way to estimate a more accurate population correlation. But that legitimacy comes by way of a set of criteria that must be met. We’ve seen how the reported correlations touted by IQ proponents rest on an analysis littered with bad practice and massive, unjustified assumptions.
Are there any recent studies that have tried to address the deficiencies discussed in this section? Sure. 264 newer studies use larger average sample sizes, which lowers sample errors, lowers variations, lowers range restrictions, and thus requires less correction. These studies show a diminution in the reported correlations. In short, recent studies find the ability to predict job performance with IQ is falling (31).
Using an Ambiguous Measured Variable
An unstated assumption in all arguments supporting the supposed IQ-job performance correlation is that job performance is well-defined. But what exactly is job performance? If you asked 3 different workplace supervisors to rate the same 3 workers you would get different ideas of what constitutes “performance.” Supervisor ratings are highly subjective, even within the same domain (43). Treating job performance like some unambiguous metric is unjustified.
Add to this the systematic biases known to be present in the assessment of job performance, which we discussed briefly in the previous section. For example, age (44), halo effects (45), height (46), facial attractiveness (47) and unconscious ethnic bias (48) are all known to influence supervisor ratings of work performance.
Imagine we were attempting to predict something like house prices. While our model may require its own complexity, the number being predicted is unambiguous. We all know precisely what a price is. Building models related to human intelligence already comes with a huge burden of interpretation. Predicting something like job performance, which lacks a concrete definition and is demonstrably prone to a number of biases, only introduces further ambiguity into what is already a massively complex issue.
Some researchers have attempted to “predict” more objective measures of job performance using IQ, such as actual sales. These studies show very low correlations (49, 50), and of course suffer no less from the previously discussed problems with intelligence research.
A common claim made by IQ proponents is that IQ-job performance correlations are stronger for more complex jobs, such as attorneys, medical doctors, and pilots (27, 60, 61).
Notwithstanding the issues already discussed regarding poor correlations and ambiguous measures, the claim that jobs with higher complexity show higher correlations is also suspect. The same approach used to support the job complexity claim has been used to reveal correlations as uniform and small across all job complexity categories, with values in the range 0.06–0.07(3).
In addition to poor correlations between apparent cognitive ability and work performance, it has been shown that job knowledge is more indicative of workplace competence (51). This result isn’t surprising. How many “talent myth” books need to be published before we accept the obvious; deep domain familiarity and deliberate practice trump innate ability.
Further studies suggest as much as 67% of the abilities deemed essential for effective performance are emotional (52). This result held true across ALL categories of jobs, and across ALL types of organizations.
The work relating to 67% has also been criticized, hence the use of the word “suggested.” This section is about the complexity of job performance and how different views exist regarding what makes employees good at their job (multiple proposed explanations). Not agreeing with an emotional definition of intelligence (perfectly fine) doesn’t warrant falling back to a simpler unjustified definition of intelligence. The point is, job performance is complex and should be handled as such.
Job performance is clearly a complex topic. And I use the word complex as I’ve been using it throughout this entire article. Even the most mundane jobs involve working with different personalities, adapting to new situations, and handling a large variety of various tasks. Like any complex system people’s outputs are “multiply realizable.” There are many different ways one can find solutions to the same task. The idea that simplistic correlations are going to accompany something as real-world as job performance, and permit the “prediction” of such, shows the same ignorance of complexity we see in all IQ studies.
The Self-Fulfilling “Predictions” of IQ
We’ve been focusing on the correlation between IQ and job performance since these studies are regularly cited as THE basis for the predictive validity of IQ. But what about correlations related to educational achievement and occupational level (these correlations usually come in at around 0.5).
If these seem less interesting it’s because the construction of IQ tests obviously attempts to match the cognitive processes we believe occur in schooling, and by relation occupational level. In other words, the “discovered” correlations between IQ testing and any kind of scholastic-like achievement should surprise no one.
Towards the end of PART I we briefly looked at the problem of making a ruler out of the same stuff you’re measuring. Measurements in science are expected to be divorced from the phenomenon they look to quantify. Looking to predict educational achievement or occupational level using tests constructed with the same cognitive challenges that occur in school is, once again, circular.
“From the very way in which the tests were assembled [such correlation] could hardly be otherwise “. — Thorndike and Hagen
We also know correlations between IQ and school achievement tends to increase with age (33) and that parental encouragement with school learning increases children’s IQ (34, 35). This further suggests IQ tests are not an independent measure of schooling.
The Flynn Effect
Something interesting happens when new IQ test subjects take older IQ tests. In almost every case their average IQ scores are better than people who took the older test when it first came out. In fact average IQ test scores have been increasing since the beginning of IQ. What’s happening here? Are people getting smarter?
For example, British children’s average scores went up 14 IQ points from 1942 to 2008 (39), with similar gains being observed across many other countries. The increasing test performance occurs for every major test, every age range, at every ability level, and in every modern industrialized country.
But some studies show different results. For example, one study focusing on Finland revealed a “Reverse Flynn Effect”, with IQ scores declining over time for those with high IQs (41).
The Flynn Effect (or its reverse) is in need of an explanation. It suggests that either something fundamental is changing with respect to human intelligence (e.g. brain structure?) or something environmental is affecting IQ scores. In 2017, 75 “experts” in the field of intelligence research suggested four key causes of the Flynn effect; better health, nutrition, education, and rising standards of living (42), so the consensus appears to be on the side of environmental causes.
But what I want to highlight here is not whether something fundamental or environmental is the cause of the Flynn Effect. What I want to stress is how unsurprising the Flynn Effect is given what we know about complexity.
Is wine good for you? Many studies say yes, so sure. But then again a recent global study says no amount of alcohol is good for your health. What about MSG? It’s been associated with weight gain and liver injury, but other studies show no adverse effects. Chocolate? Cow’s Milk? Fruit Juice?
Conflicting nutrition news is nothing new, and I’m here to tell you that isn’t going to change anytime soon. Why is nutrition so hard to pin down? Sure, scientists debate all the time, but generally-accepted theories are commonplace in the core sciences. Why does nutrition advice “flip” around so haphazardly from year to year?
A core message throughout this article has been the increase in proxy distance with complexity. There are large “distances” between a measurement and the thing we think we’re measuring when we choose to study complex phenomena.
Choosing to frame nutrition as a science places an enormous burden of interpretability on the researcher. It also means they require a modeling approach that is commensurate with approximating the complexity observed. As I’ve argued throughout this article, simplistic linear measures fail under complexity because they cannot adequately capture the complexity being studied. This is why algorithmic modeling tackles its most complex challenges using high-dimensional feature spaces and highly intricate, black-block models.
Nutrition researchers use simple, traditional statistical measures that are interpretable. This is the data modeling approach discussed earlier in PART I. But these simple measures are not stable under complexity. Traditional statistics play out inside low-dimensional feature spaces, leveraging simplistic pre-constructed models that are incommensurate with high complexity. The consequence is a measure that is not anchored well to the phenomena being explained.
Just like nutrition, IQ studies will forever flip around haphazardly because the choice of measure (IQ tests) is constructed from simplistic statistical methods that are not fixed to the swirling underlying complexity of the phenomenon. IQ tests are unanchored to the phenomena they purport to explain and predict.
If You’re Surprised by the Flynn Effect you Don’t Understand Complexity
This brings us full circle to my original tweet around complexity and IQ. Forcing a statistically convenient model onto something overtly complex is not science. As long as you use simplistic models to measure the complex you can expect your world to remain unmoored to its underlying truth.
A Thought Experiment
Imagine the history of IQ never happened. Imagine intelligence research was only beginning, and that you knew nothing about the current approaches used to explain and predict human intelligence. I now tell you I am going to introduce you to the researchers who have decided to tackle this problem, and who claim to be able to explain and predict human intelligence.
Given what you now know about science, measurement, complexity, the burden of interpretation, and where human intelligence sits on the complexity spectrum, what would your opinion be about these scientists you are about to meet? Remember, these are real researchers, working and teaching inside today’s universities.
The only reasonable opinion would be that these scientists must be the best in the world. After all, they are tackling THE most complex phenomenon around, bringing with it the maximum burden of interpretation. If their claims about explaining and predicting human intelligence are true they must be exceedingly well-trained in statistics and probability, not to mention insanely rigorous in their approach to the scientific method.
They must be constructing tremendously rich feature spaces using an immense amount of data, and approximating intelligence with highly intricate models. They likely have tight cross-discipline collaborations with fields like Artificial Intelligence, and have an uncanny ability to embrace complexity and its many implications. They must be heralded as scientists capable of navigating their research through the nonlinear, opaque, and emergent world of the most complex phenomenon known. These are the scientists I would want to learn from.
Let’s open the door.
It would be altogether impossible to overstate the sheer weight of disappointment. IQ research is literally diametrically opposed to everything we know about how to handle the modeling of complex phenomena.
But perhaps we are instead witnessing some paradigm-shifting work by a group of yet-to-be-accepted scientists. Perhaps IQ will yet see its light of day, ushered in by pioneers of meticulous cognitive science and an unbridled adherence to the quest of explaining and predicting human intelligence. Perhaps only these misunderstood researchers can see the simple elegance of their prescient work, entirely ahead of its time.
Perhaps. Or perhaps the last 100+ years of intelligence testing is simply the worst case of physics envy in the history of science. Perhaps what started as a simple wish to be accepted snowballed into an insidious misuse of statistics and complete evasion of scientific responsibility. Perhaps IQ is our best example of how not to do science, particularly when attempting to explain and predict something as complex as intelligence.
The truth is IQ represents a perfect violation of the scientific method. Its faults were dressed up in mathematical elegance from the beginning, obscuring its deep inadequacies from those unable to spot its untruths. The litany of problems associated with IQ studies is more than enough to place it into the dustbin of failed ideas.
The Future of Social Science?
While this article has focused on IQ testing, the various disciplines in social science are all concerned with some aspect of society and the relationships among individuals. Whether social scientists realize it or not this places their research at the extreme end of the complexity spectrum.
I anticipate certain subjects in the social sciences will begin to be wrestled away from those adhering to simplistic, traditional statistics. Scientists who understand complexity, coming from areas outside conventional fields of social science will create better models, offering superior explanatory and predictive power.
If you are part of the next generation looking to do research in social science I highly recommend you learn to question conventional thinking and embrace the approaches used by others studying complexity. Your field has the chance to define itself as leaders of complexity thinking, tackling some of the toughest challenges in science. But that cannot happen by choosing convenience and simplistic interpretability over validity and scientific obligation.
Those who embrace complexity will usher in our next paradigm shift. Those who don’t will be left on the wrong side of history.
— — — — — — — — — — — — — — — — — — — — —
*Under the Condition you Know how to Craft an Argument
I welcome legitimate rebuttals to my article but I will not respond to faulty reasoning or non-arguments. Keep in mind that a conclusion is not an argument, and that premises used to support your conclusion may be weaker than you think (are your premises facts or opinions?). Also, check your unstated assumptions. What are you assuming must be true in order for your conclusion to be true?
It’s important to realize that you can be logically consistent using false premises. This is a source of confusion for many people thinking they are making good arguments when they are not. If your premises are faulty no amount of logical consistency can save your argument.
Check to see if your argument is circular. This is by far the most common fallacy among IQ proponents (a consequence of the inherent circularity in IQ studies).
Finally, I will also not engage with those who aren’t even wrong. If your premises can neither be proven correct nor falsified then there is no possibility of discussion in a rigorous and scientific sense.
If you don’t receive a response from me you’re likely making one of the above mistakes. Otherwise I welcome fruitful discussion.
- Statistical Modeling: The Two Cultures
- Understanding IQ Tests and IQ Scores
- Does IQ Really Predict Job Performance?
- IQ is largely a pseudoscientific swindle
- Common Misapplications and Misinterpretations of Correlation in Social “Science”
- The Mismeasure of Man
- IQ in early adulthood and later risk of death by homicide: Cohort study of 1 million men
- Nassim Taleb on IQ
- To Explain or to Predict
- The Validity and Utility of Selection Methods in Personnel Psychology: Practical and Theoretical Implications of 85 Years of Research Findings
- History, development, evolution, and impact of validity generalization and meta-analysis methods, 1975–2001
- Validity Generalization: A Critical Review
- Meta-Analysis: An Improved Version of Hunter, Schmidt and Jackson Framework
- Meta-analysis and the art of the average.
- Methods of Meta-Analysis: Correcting Error and Bias in Research Findings
- Adaptation and intraindividual variation in sales outcomes: Exploring the interactive effects of personality and environmental opportunity.
- Interrater Correlations Do Not Estimate The Reliability Of Job Performance Ratings
- Restriction of Range and Correlation in Outlier-Prone Distributions
- Fairness in employment testing: Validity generalization, minority issues and the General Aptitude Test Battery.
- Revisiting a 90-year-old debate: the advantages of the mean deviation
- Is artificial Intelligence (AI) just glorified statistics?
- The Mythos of Model Interpretability
- Understanding complexity in the human brain
- The Validity of Aptitude Tests in Personnel Selection
- The status of validity generalization research: Key issues in drawing inferences from cumulative research findings
- Development of a General Solution to the Problem of Validity Generalization
- Validity and utility of alternative predictors of job performance
- Meta-analysis and validity generalization as research tools: Issues of sample bias and degrees of mis-specification
- Assessing adolescent and adult intelligence
- Validation strategies for primary studies
- Racial bias in testing
- Technology and intelligence in a literate society
- The predictive value of IQ
- Intelligence and how to get it: Why schools and cultures count
- Intelligence: new findings and theoretical developments
- What is Intelligence?
- The Flynn effect puzzle: A 30-year examination from the right tail of the ability distribution provides some missing pieces
- The Logic of Validity Generalization
- Requiem for nutrition as the cause of IQ gains: Raven’s gains in Britain 1938–2008
- The Flynn Effect: Wikipedia
- A negative Flynn effect in Finland, 1997–2009
- Survey of expert opinion on intelligence: The FLynn effect and the future of intelligence
- Performance assessment for the workplace
- Workplace Age Discrimination
- Systematic distortions in memory-based behavior ratings and performance evaluations
- The effect of physical height on workplace success and income: preliminary test of a theoretical model
- The effects of physical attractiveness on job-related outcomes: A meta-analysis of experimental studies
- Racial/ethnic differences in the criterion-related validity of cognitive ability tests: a qualitative and quantitative review
- On The Interchangeability Of Objective And Subjective Measures Of Employee Performance: A Meta‐analysis
- A meta-analytic review of predictors of job performance for salespeople
- The impact of job knowledge in the cognitive ability-performance relationship
- Working with Emotional Intelligence
- More Is Different
- A simple guide to chaos and complexity
- Scientists rise up against statistical significance
- Embryonic IQ tests could ‘screen’ for less intelligent children, firm says
- Meta-analysis: statistical alchemy for the 21st century
- Criticisms of Meta-Analysis
- Meta-Analysis Fixed effect vs. random effects
- Why g matters: The complexity of everyday life. Intelligence
- The Oxford handbook of personnel assessment and selection