Alex Szalay: Big data a big step forward for science
There have been three major eras in the history of science, as Johns Hopkins University astrophysicist and computer scientist Alex Szalay describes it. The first, which lasted for millennia, was empirical, involving mostly the recording of data: Chinese star charts, Leonardo da Vinci’s codices on how turbulent water flows, Tico Brahe’s recording of the motions of planets. The second era, which he calls the theoretical paradigm, launched when Brahe gave his observations to Johannes Kepler, who came up with the laws of planetary motion.
That lasted for centuries, says Szalay, until the mid-20th century, when Manhattan Project scientists used computers to solve fluid dynamics equations, and ushered in the computational paradigm.
And now, just a few decades later, the fourth era of science is upon us, Szalay says: it’s the big data paradigm, and he’s playing a major role in shaping it. Szalay is a Bloomberg distinguished professor of Physics, Astronomy and Computer Science at Johns Hopkins University, where he serves, among other capacities, as founding director of the Institute for Data Intensive Engineering and Science. He will discuss the history of scientific discovery and how big data is shaking things up at Brain Bar Budapest 2016.
Szalay, born and raised in Debrecen, Hungary, earned his PhD in astrophysics at Budapest’s Eötvös Loránd University, where he returned to become a full professor after postdoctoral research fellowships at the University of California, Berkeley, the University of Chicago and Fermilab. He moved to Johns Hopkins in 1989. By 2001, the astrophysics professor had become so adept in computer science that he took on an appointment in that department.
Szalay’s professional interest in computers centered on big datasets, and his work on the Sloan Digital Sky Survey solidified the astrophysicist’s unlikely reputation as a big-data leader. It had come about by accident, Szalay says.
“In 1992, a couple of us got together with the goal of building a new instrument, running it and making the data available to everyone,” Szalay says. “Nobody wanted to do the database. So I figured I’d jump in.”
He had help: in particular, the database pioneer Jim Gray, by then with Microsoft Research, became a close collaborator.
“I was teaching Jim astronomy and he was teaching me computer science,” says Szalay.
Since then, Szalay has led the development of massive databases for areas far removed from astrophysics — radiation oncology and high-throughput genomics being two examples. The commonalities of the data science across these wildly different disciplines informs Szalay’s contention that we’ve entered a new era.
We now have instruments capable of generating enormous amounts of scientific data: astronomical tools such as the Sloan Digital Sky Survey telescopes and the Hubble Space Telescope; particle colliders such as the Large Hadron Collider; high-speed gene sequencers and so on. We’re talking about databases with hundreds of billions of data points. Finding the needles in such haystacks demands a new synthesis of statistics and computer science that promises to break down the silos between what have been discrete fields — at least in terms of how they make sense of the outputs of increasingly powerful scientific tools.
“The synthesis of statistics, computer science and basic sciences will become the fundamental language used by the next generation of scientists,” Szalay says.
It wouldn’t be the first time specialists expanded their breadth of expertise out of necessity. Two hundred years ago, scientists were poorly versed in mathematics.
“It became clear that the physicists who were formulating solutions couldn’t run to their mathematician friends every day,” Szalay says. “So to figure out how to solve problems, they had to have an internalized view of mathematics. Now we take that for granted.”
With big data, basic scientists (or, more probably, teams of them) will need to combine domain-specific expertise, statistical skills, and also computer-science chops to ensure that scientifically relevant algorithms run efficiently on multi-billion-field databases across millions of machines in the cloud. He sees scientific discovery in the future hinging on teams of three to five people — the size of an original hunter-gatherer pack, he notes — working their magic on these massive datasets.
With time, Szalay says, the software will advance to the point that it assists in the discovery process itself. It could well involve narrowing the focus of analysis down to manageable chunks, he says.
“In life sciences, it may be you have 4,000 mice sequenced, but you can only analyze 100 of them,” Szalay says. “Which 100 do you sequence? Right now, scientists make these choices based on instinct. I think computers will enter this game and help us do a better job.”
Still, there’s going to be room for the human touch.
“I think the Nobel prize-winning discoveries will be made by humans,” Szalay says. “But they’ll be able to focus on the Nobel prize discoveries, and not on the mundane massaging of data 90 percent of the time.”