The man who tried to catalog humanity

Luigi Luca Cavalli-Sforza chased Darwin’s dream of a tree of humankind

Luigi Luca Cavalli-Sforza, known simply as “Luca” to generations of human geneticists, died this week at age 96. More than any other human geneticist, Cavalli-Sforza believed in the potential of genes and culture together to trace humanity’s origins. In the course of his work, he pioneered new ideas and models that brought together these two distinct areas of science.

Luigi Luca Cavalli-Sforza at a meeting in 2010. Photo: Luca Giarelli CC-BY-SA 3.0

Like most scientists, many of his ideas would turn out to be wrong in the details. But his work helped form the foundation of our current knowledge of human genome variation across the world.

In 1991, Cavalli-Sforza wrote an essay for Scientific American that explained the course of his life’s work to that point. He recollected a time as a young man when he worked in the Cambridge laboratory of Ronald A. Fisher, one of the founders of modern evolutionary theory.

“I started thinking about a project so ambitious it seemed al­most crazy: the reconstruction of where human populations originated and the paths by which they spread through­out the world.”

From his start working with microbes this idea was quite a massive leap. But he chose a lucky moment to enter the field of human genetics. During the 1950s, the nascent field was starving for data on how human variation connected to inheritance. New approaches were about to provide such data, along with new opportunities to understand the evolution of recent populations.

Anthropologists understood human variation by looking at traits like the shape of the skull. Such traits could be examined with complicated math, but geneticists needed simpler systems to start to unlock how human genes might vary. Some of the earliest-known examples of Mendelian inheritance were genetic disorders, and while these were very important, they were also very rare, meaning that they could not be broadly informative about normal human variation.

But a handful of traits, many of them invisible variations like blood types, likewise showed a Mendelian inheritance pattern. Postwar geneticists developed ways to test people for these traits, making it feasible to sample distant populations, at first by typing blood or carrying out simple tests like the ability to taste the bitter chemical phenylthiocarbamide, and later with electrophoresis of proteins in the laboratory.

These variations became known as “classical markers”. All of them obeyed Mendel’s laws of inheritance, making it possible for geneticists to use mathematics to understand how their frequencies might change over time. Geneticists traveled to the four corners of the globe, gradually building maps of the frequencies of blood types and other classical markers. No one really knew how old the blood groups were, or how long ago the differences between human populations might have arisen. But they could see big differences: Some populations had almost no type B blood, for example, while other populations had quite a lot of it. Until the 1980s, classical markers would remain the state of the art evidence of human genetic variation.


Cavalli-Sforza first made his mark in his native Italy, traveling to villages in the Parma Valley to sample blood. He worked to understand how inbreeding within these small towns was connected to the slight differences in frequency of blood groups. With several coworkers, he scoured church records of marriages and births, tracing the times when people moved between villages as well as the number of children they had. Tracing these multiple lines of evidence, he could show that consanguineous marriages, or inbreeding, were the main drivers of genetic differences between these small towns. In doing so, he provided some of the earliest evidence that humans were still being affected by genetic drift, the random change in gene frequencies that happens in small populations.

Cavalli-Sforza realized that if genetic drift could explain the gene frequencies in small Italian towns, it might have affected humanity over a much deeper past. Genetic drift was a force that over long periods of time tended to drive populations slowly apart, inexorably diverging in gene frequencies. Applied to a group of populations over long periods of time, genetic drift would form a tree.

It was during this period that Cavalli-Sforza began collaborating with the statistical geneticist A. W. F. Edwards, developing ways to reconstruct evolutionary trees from gene frequencies. The statistical methods used measures of distance, computed from the frequencies of several genes across populations, and they generated a new picture of human origins.

From Cavalli-Sforza 1966, “Population structure and human evolution.”

Here, the branches of humanity came into focus. American Indians, Asians, and Oceanians on one broad branch, Europeans and Africans on the other.

The tree looks very different from our understanding today, which places African populations as the most diverse elements of humanity, not a minor twig. It is worth noting why Cavalli-Sforza’s early trees turned out to be wrong. Blood groups were first discovered and studied in people of European descent, meaning that African variation was not fully included by looking at the traits that vary in Europe. These five loci in particular include several that reflect natural selection, especially the Fy, or Duffy, locus, which approaches fixation in many sub-Saharan populations. Today, using whole genome sequences, it is clear that the deepest branches of human population trees are African.

But more important, the tree illustrates an enormous limitation of the classical markers. The frequencies of a few genes simply do not provide enough information to tell when and how much mixture may have happened among the populations. Cavalli-Sforza, drawing upon his work in the Parma Valley, and later work with Pygmies in central Africa, was willing to assume that migration and mixture were rare. In his model genetic drift, not gene flow, was the main force driving human evolution. Natural selection happened, too, but with patterns that might be recognized by comparing to the predictions of genetic drift alone.

A tree for visualizing genetic differences was a powerful tool. But Cavalli-Sforza and Edwards went a step further. They used an early computer to take the genetic differences between populations and transform them into principal components, which reflected the common correlations among the gene frequencies.

The first principal component of genetic variation across Europe, from Cavalli-Sforza 1997 Proc. Nat. Acad. Sci USA.

With this approach, they could not only show how gene frequencies changed on a map; they could now show how the common correlation of many gene frequencies changed. In Cavalli-Sforza’s vision, these maps provided a view of the historical forces that caused people to vary. A gradient across all the classical markers could show the possible pathways of movement and migration in the past. What the maps couldn’t show was how and why those movements had happened.

“The results of principal components analyses looked very good, but there was nothing to compare them with because the questions they helped to answer had never been asked”

He set out to find other sources of data that could make the genetic distances meaningful. With the archaeologist A. J. Ammerman, Cavalli-Sforza turned his attention to the Neolithic. This was an epochal archaeological change: the time that agriculture first spread from the Near East into Europe, taking with it pottery and stone implements that were ground and polished rather than chipped and flaked into shape. If events of the past had been powerful enough to sculpt gene frequencies across Europe, it seemed that the Neolithic should have been the strongest of them all.

By the early 1970s, the radiocarbon revolution had taken hold across European archaeological sites. The earliest signs of Neolithic traditions in various regions of Europe, from Greece to Ireland, had been dated with the new method. And they formed a striking pattern: it appeared that the Neolithic had spread slowly, around one kilometer a year, from the southeast to the northwest.

For Cavalli-Sforza, this picture had a clear implication: No migrating horde of farmers had colonized Europe. Instead, farming spread as farming populations gradually increased in size, carrying their new way of life to the next small region or hamlet. This process seemed almost to ignore variations in environment such as forest or hills, it seemed to have been inexorable. It was not the mere diffusion of ideas, instead it was a diffusion of culture together with genes. It was, as Ammerman and Cavalli-Sforza would name it, a process of demic diffusion.

“All evolutionary processes are basically similar, whichever the objects that evolve.”

During the 1980s and 1990s, demic diffusion became the dominant model of demographic change for the Neolithic. The idea underlay Colin Renfrew’s influential theory that Indo-European languages also spread with the Neolithic into Europe, replacing earlier languages spoken by Mesolithic or earlier peoples. It appeared that the processes of culture change and dispersal could be linked to the growth and expansion of human groups, if only geneticists could fill in the gaps in their data. Cavalli-Sforza worked more and more to understand how cultural and biological changes were linked, establishing a long-lasting collaboration with Marc Feldman to examine how cultures evolve.

We know today from ancient DNA data that the details of this vision of the Neolithic were wrong. The expansion of agriculture was important, but it was not alone. Much later movements of people, some of them quite rapid, transformed the genetic makeup of European populations. Today it appears that Indo-European languages invaded Europe during the Bronze Age, and that early farmers were genetically most like today’s Sardinians, a linguistic isolate that Cavalli-Sforza knew well.

But those facts learned from ancient DNA have come to most geneticists as a surprise. The synthetic view promoted by Cavalli-Sforza was so compelling, linking economic, demographic, and genetic change, that it would take a new data revolution — still underway today — to overturn.


During the 1980s, Cavalli-Sforza worked to synthesize the growing evidence base of human genetic variation around the world. He side-stepped the brouhaha over mitochondrial Eve, pointing out that while the new DNA sequence-based approaches could trace the history of a single gene with great accuracy, they were not yet capable of combining information from many genes. He went to work building trees that could summarize the relationships of entire populations, not just individuals.

In doing so, he struck upon a collaboration with a group of maverick linguists, who believed they could build a universal tree of human languages. The intellectual leader of these linguists was Joseph Greenberg. Greenberg eschewed the approach followed by most historical linguists to reconstruct language relationships. Where most other specialists tried to reconstruct a full history of sound changes and shared grammatical constructs, Greenberg instead took the much simpler approach of comparing common words. In his view, this approach would enable recognition of much deeper, more ancient relationships among languages — most controversially, he suggested that all Native American languages could be grouped into three large families instead of the dozens recognized by other linguists.

Greenberg’s colleague Merrit Ruhlen took this a step further, proposing that every language in the world might be placed into a single tree. This idea resonated with Cavalli-Sforza. If humans had really originated from a small population within the last hundred thousand years, they must all have divided from an ur-population that spoke a single language. Just as a tree of genes could link all populations, so could a tree of languages.

And if Cavalli-Sforza was right about cultural and genetic change, those trees of languages and genes would be the same tree. All that remained was to draw it.

The tree of genetics and languages. From Cavalli-Sforza 1991, Scientific American
“It was exciting to discover that we had con­firmed a conjecture made by no less a pioneer than Charles Darwin…that if the tree of genetic evolution were known, it would enable scholars to pre­dict that of linguistic evolution.”

It was a remarkable vision. People who spoke similar languages did seem to be more genetically similar than those who came from different language families. Of course, there were many exceptions. Some of the apparent similarities between the genetic and language trees might, after all, just reflect the fact that genetically similar people tend to live nearby each other, and so also are more likely to borrow or adopt common languages. But then some of the differences between the trees also might reflect proximity: For example, the genetic similarity of north and south Indian people despite their different language families could reflect thousands of years of mixing between people with different origins.

In short, language and genes don’t reflect identical histories. Each is influenced by its own evolutionary forces, and language experiences extensive horizontal transfer as people adopt new languages and change their old ones. But the connections are tantalizing. To pursue them, Cavalli-Sforza catalogued gene frequencies, building larger and larger synthetic maps. He wrote popular books and articles on the idea. Meanwhile, traditional linguists continued to dismiss the idea that they could ever discover a “mother tongue” for all humanity, and archaeologists divided about whether a truly comprehensive view of human migrations was even possible.

With Paolo Menozzi and Alberto Piazza in 1994 he published his magnum opus, The History and Geography of Human Genes. This book — as thick as the Boston phone directory! — included an enormous number of charts and pages of raw data on gene frequencies. It was meant to serve not only as a summary of the state of the field but also as a sourcebook for further research. It was not completely successful in its aims; by the mid-1990s the entire field of human genetics was rapidly moving on from classical markers to microsatellites and DNA sequence data. Meanwhile, the internet made it possible to share data online, meaning that a monograph with tables of gene frequencies was no longer a boon for new research. Nonetheless, that book served as a major statement of one era of human genetics, and most scientists in the field in the mid-1990s saw it as an essential source.


I first met Cavalli-Sforza at a symposium on human evolution that he organized at the Cold Spring Harbor Laboratory in 1997. Then, he was already 75 years old, a formidable white-haired figure who completely dominated his surroundings. The world was changing — at that meeting, I saw for the first time a demonstration of microarray data for genotyping dozens of loci in parallel. But Cavalli-Sforza was still at the center of the field, his lab and collaborators at the forefront of applying new technologies.

At the time, he was still building support for his initiative to sample the genetics of small indigenous populations around the world. The “Human Genome Diversity Project”, conceived at Stanford during the early 1990s, aimed to collect genetic samples from more than a thousand people from small-scale societies around the world. Scientists would use a new technique to transform these samples into immortal cell lines, which would then be distributed to laboratories for research.

Cavalli-Sforza and others saw the HGDP as a necessary corrective for the Human Genome Project, which at the time was spending some 2.7 billion dollars to develop a first draft of the sequences of all 23 pairs of human chromosomes. The Human Genome Project did encompass a sprinkling of variation — with individuals of European, African, and Asian ancestry examined by various laboratories, the resulting genome draft would be a demographic mosaic of the powerful nations where most genetic labs were located. But this scheme left out hundreds of populations around the world who represented the breadth of human diversity. Cavalli-Sforza and other project leaders wanted to sample linguistic isolates, hoping to capture their unique genetic variation before their children or grandchildren merged into the larger populations of nation-states.

Many other scientists saw the HGDP as a boondoggle, or worse as a threat to mine the DNA of indigenous people for medically valuable — and potentially patentable — traits. Funders who were initially receptive to the idea decided during the 1990s that they could not support it after all. The National Institutes of Health would move toward a broader sampling of variation, first in the International HapMap and later in the 1000 Genomes Project, but it focused on the largest populations in nation-states around the world, not small-scale societies.

Yet, the HGDP trundled on, with collaborators around the world contributing samples and cell lines. The Centre d’Etude du Polymorphisme Humain-Fondation Jean Dausset (CEPH) in Paris stepped forward to maintain and distribute the cell lines to researchers, and ultimately hundreds of laboratories would apply the data to problems of human origins and diversity.

Tree of population relationships based on HGDP-CEPH panel data, from Li et al., 2008, Science.

As they did so, those laboratories would rely upon Cavalli-Sforza’s basic analytical approaches. They built trees of relationship among the populations — this time, correctly placing African populations near the base of the tree, and all non-African populations on a single, fairly recent branch reflecting their common history. They computed principal components from the gene frequency data, yielding a picture of how humans varied across space for hundreds of thousands of genetic markers. The tight correlation between genetic variation and geography would repeatedly be marked, a consequence of genetic drift, selection and migration among past human populations.

Many new approaches to examine DNA variation had emerged through the 1970s, 1980s, and 1990s, making the classical markers a thing of the past. But remarkably, nobody improved much upon the basic analytical approaches introduced by Cavalli-Sforza and Edwards until the advent of genome-scale data in the early 2000s.

The past fifteen years have seen rapid methodological development, a resurgence of population genetic theory, and with it new ways of understanding human evolution. New statistical approaches like STRUCTURE and D4 statistics enabled geneticists to finally progress beyond the binary trees where populations could only branch away from each other, to consider reticulation and links among the branches. Into the last decade, Cavalli-Sforza remained conversant with these new approaches, sometimes pointing out that what might seem to be improvement instead could be a dead end. In a 2011 interview, he noted his admiration of Jonathan Pritchard’s STRUCTURE approach, while thinking ahead toward a vision of leveraging genetic information to improve human health. At the same time, he prickled at upstarts like John Novembre, who had started to pick away at weaknesses of the venerable principal components approach.

I was especially struck with a comment that showed Cavalli-Sforza was thinking about my own work, and that of many others, that had begun to demonstrate the scope of human evolution in the recent past.

“I am especially interested by one issue: It is possible, maybe even likely, that an important fraction of the genetic medical variation that we find in our species arose in the last 10,000 years. Such variation/adaptation may be the result of the intense cultural development that led to agropastoral economies, a lifestyle that introduced major differences in the lives of large sections of humanity in almost every part of the world.”

In my early career, I found myself disliking Cavalli-Sforza much of the time. His pronouncements about human origins overstretched the bounds of the simple statistical models and data that were available. He seemed to brush aside objections, rather than engage with them. He dismissed as meddling politics what seemed to be reasonable objections to the HGDP.

Today I see more clearly how human genetics in the 1990s was at a crossroads, needing badly to include evolution in its models of variation and change, but still unable to grapple with the traits and observations that obsessed anthropologists. Rereading his work, I see how perspicacious Cavalli-Sforza could be, not only about where advances were to be made, but also about where the methods fell short.

Many geneticists today are building upon this foundation — putting together datasets of past and present variation, finding more and more resolved trees, reaching out for connections with archaeology. A few even want to replicate his work at genome-scale, dreaming of an “atlas” of human genome variation.

It is a monumental task, and as Cavalli-Sforza’s massive book poignantly shows, likely obsolete the moment it is published. What matters is not the monumental tome, but the methods, and the ideas of connecting different forms of evidence about the human past, to understand where we came from and how we are connected together. It is a worthy legacy.


More reading:

Cavalli-Sforza was a prolific public writer and popularizer, and several of his books are accessible to laypeople. Now more than 20 years old, most of his books no longer reflect today’s state-of-the-art. But they do give a rich account of the history of human geneticists’ attempts to build the picture of human origins using our genes. I especially recommend Genes, Peoples, and Languages, and The Great Human Diasporas, written with his son Francesco.

An abstract of this work was published by Scientific American in 1991, titled “Genes, Peoples, and Languages,” and is the source of the gene-language tree I’ve excerpted above.

I highly recommend the 2010 interview between Cavalli-Sforza and Franz Manni, published in Human Biology, “Interview with Luigi Luca Cavalli-Sforza: Past Research and Directions for Future”.

A. W. F. Edwards in 2009 published a remarkable remembrance of the origins of his statistical work with Cavalli-Sforza, “Statistical methods for evolutionary trees,” which includes personal stories of his collaboration and encounters with Motoo Kimura.