Opinion: The Rise of the Data Physicist

In the search for new physics, a new kind of scientist is bridging the gap between theory and experiment.

By Benjamin Nachman | October 13, 2023

face illustration

Traditionally, many physicists have divided themselves into two tussling camps: the theorists and the experimentalists. Albert Einstein theorized general relativity, and Arthur Eddington observed it in action as “bending” starlight; Murray Gell-Mann and George Zweig thought up the idea of quarks, and Henry Kendall, Richard Taylor, Jerome Freidman, and their teams detected them.

In particle physics especially, the divide is stark. Consider the Higgs boson, proposed in 1964 and discovered in 2012. Since then, physicists have sought to scrutinize its properties, but theorists and experimentalists don’t share Higgs data directly, and they’ve spent years arguing over what to share and how to format it. (There’s now some consensus, although the going was rough.)

But there’s a missing player in this dichotomy. Who, exactly, is facilitating the flow of data between theory and experiment?

Traditionally, the experimentalists filled this role, running the machines and looking at the data — but in high-energy physics and many other subfields, there’s too much data for this to be feasible. Researchers can’t just eyeball a few events in the accelerator and come to conclusions; at the Large Hadron Collider, for instance, about a billion particle collisions happen per second, which sensors detect, process, and store in vast computing systems. And it’s not just quantity. All this data is outrageously complex, made more so by simulation.

In other words, these experiments produce more data than anyone could possibly analyze with traditional tools. And those tools are imperfect anyway, requiring researchers to boil down many complex events into just a handful of attributes — say, the number of photons at a given energy. A lot of science gets left out.

In response to this conundrum, a growing movement in high-energy physics and other subfields, like nuclear physics and astrophysics, seeks to analyze data in its full complexity — to let the data speak for itself. Experts in this area are using cutting-edge data science tools to decide which data to keep and which to discard, and to sniff out subtle patterns.

Machine learning, in particular, has allowed scientists to do what they couldn’t before. For example, in the hunt for new particles, like those that might comprise dark matter, physicists don’t look for single, impossible events. Instead, they look for events that happen more often than they should. This is a much harder task, requiring data-parsing at herculean scales, and machine learning has given physicists an edge.

Nowadays, the experimentalists who manage the control rooms of particle accelerators are seldom the ones developing the tools of machine learning. The former are certainly experts; they run colliders, after all. But in projects of such monumental scale, nobody can do it all, and specialization reigns. After the machines run, the data people step in.

The data people aren’t traditional theorists, and they’re not traditional experimentalists (though many identify as one or the other). But they’re here already, straddling different camps and fields, proving themselves invaluable to physics.

For now, this scrappy group has no clear name. They are data scientists or specialized physicists or statisticians, and they are chronically interdisciplinary. It’s high time we recognize this group as distinct, with its own approaches, training regimens, and skills. (It’s worth noting, too, data physics’ discreteness from computational physics. In computational physics, scientists use computing to cope with resource limitations; in data physics, scientists deal with data randomness, making statistics — what you might call “phystatistics” — a more vital piece of the equation.)

Naming delivers clout and legitimacy, and it shapes how future physicists are educated and funded. Many fields have fought to earn this recognition, like biological physics, sidelined for decades as an awkward meeting of two unlike sciences — and now a full-fledged and vibrant subfield.

It’s the data wranglers’ turn. I propose that we give these specialists a clear identity — the “data physicists.” Unlike a traditional experimentalist, a data physicist probably won’t have much hands-on experience with instrumentation. They probably won't spend time soldering together detector parts, a typical experience for experimentalists-in-training. And unlike a theorist, they may not have much experience with first-principles physics calculations, outside of coursework.

But the data physicist does have the core skills to understand and interrogate data — complete with a strong foundation in data science, statistics, and machine learning — as well as the computational and theoretical background to relate this data to underlying physical properties.

The data physicists have their work cut out for them, given the enormous amount of data being churned out by experiments in and beyond high-energy physics. Their efforts will, in turn, improve the development of new experimentation methods, which are today often developed from simpler, synthetic datasets that don’t map perfectly to the real world.

But this data will go underutilized without a skilled cohort of scientists who can deftly handle it with new tools, like machine learning. In this sense, I’m not merely arguing for name recognition. We need to identify and then train the next generation, to tackle the data we have right now.

How? First, we need the right degrees: Universities should develop programs explicitly for data physicists in graduate school. I expect the data physicist to have a strong physics background and extensive training in statistics, data science, and machine learning. Take my own path as a starting point: I studied computational aspects of particle theory as a master’s student and took many courses in statistics as a PhD student, which led to naturally interdisciplinary research between physics and statistics/machine learning — and between theorists and experimentalists.

The right education is a start, but the field also needs tenure-track positions and funding. There are promising signs, including new federal funding to help institutions launch “Artificial Intelligence Institutes” dedicated to advancing this research. But while investments like this fuel interdisciplinary research, they don’t support new faculty — not directly, at least. And if you’re not at one of the big institutions that receive these funds, you’re out of luck.

This is where small-scale funding must step in, including money for individual research groups, rather than for particular experiments. This is easier said than done, because a typical group grant, which a PI uses to fund themselves and a student or postdoc, forces applicants to adhere to the traditional divide: theory or experiment, or hogwash. The same goes for the Department of Energy’s prestigious Early Career Award — there is no box to check for “interdisciplinary data physics.”

As tall an order as this funding is, it could be easier to achieve than a change in attitude. Physicists might well be famous for many of humanity’s greatest discoveries, but they’re also notorious for their exclusionary, if not outright purist, suspicion of interdisciplinary science. Physics that borrows tools and draws inspiration from other fields — from cells in biological physics, say, or from machine learning in data science — is often dressed down as “not real physics.” This is wrong, of course, but it’s also a bad strategy: A great way to lose brilliant physicists is to scoff at them.

Not all are skeptical; far more, in fact, are excited. Within APS, the Topical Group on Data Science (GDS) is growing rapidly and might soon become a Division on Data Science, a reflection of the field’s growing role in physics. My own excitement about working directly with data inspired me to become an “experimentalist” myself, although I realize now how restrictive that label was.

As available data grows, so does our need for data physicists. Let’s start by calling them what they are. But then let’s do the hard work: educating, training, and funding this brilliant new generation.

Benjamin Nachman is a Staff Scientist at Berkeley Lab, where he leads the Machine Learning for Fundamental Physics Group, and a Research Affiliate at the UC Berkeley Institute for Data Science. He is also a Secretary of the APS Topical Group on Data Science.

The author wishes to thank the Editor, Taryn MacKinney, for her work on this article, and David Shih for coining the term 'data physicist' at a recent Particle Physics Community Planning Exercise.

The views expressed in interviews and in opinion pieces, like the Back Page, are not necessarily those of APS. APS News welcomes letters responding to these and other issues.

©1995 - 2024, AMERICAN PHYSICAL SOCIETY
APS encourages the redistribution of the materials included in this newspaper provided that attribution to the source is noted and the materials are not truncated or changed.

Editor: Taryn MacKinney