There really is too much information. For example, let’s say you’re an astronomer searching the cosmos for black holes or a climate scientist modeling the next century of global temperature change. After just a few days of recording observations or running simulations on the most modern equipment, you can end up with millions of gigabytes of data. Some of it contains the things that interest you, but a lot of it doesn’t. It is too much to analyze, too much to even store.
“We’re drowning in data,” says Rafael Hiriart, a computer scientist at the National Radio Astronomy Observatory in New Mexico, which will soon be home to the next-generation Very Large Array radio telescope. (Jodi Foster uses the forerunner, the first Very Large Array, to put in Contact.) When it goes online in a few years, the telescope’s antennas will collect 20 million gigabytes of night sky observations every month. Handling so much data requires a computer that can perform 100 quadrillion floating point operations per second; Only two supercomputers on earth are that fast.
And it’s not just astronomers who drown. “I would argue that almost any scientific field would face it,” said Bill Spotz, a program manager on the US Department of Energy’s Advanced Scientific Computing Research program that manages many of the country’s supercomputers, including Summit, the world’s second fastest machine. .
From climate modeling to genomics to nuclear physics, ever more precise sensors and more powerful computers are delivering data to scientists at breakneck speeds. In 2018, Summit performed the very first exascale calculation on a series of poplar tree genomes and calculated in an hour what a normal laptop would take about 30 years. (An exabyte is a billion gigabytes – enough to hold a video call that lasts more than 200,000 years. An exa computation is one trillion floating point operations per second.) Supercomputers in the works, such as Frontier at Oak Ridge National Laboratory, still in progress go faster and generate even more data.
These vast amounts of data and incredible speeds allow scientists to advance on all sorts of problems, from developing more efficient engines to studying the relationship between cancer and genetics to studying gravity at the center of the galaxy. But the sheer volume of data can also become unwieldy – big data is too big.
For this reason, the Department of Energy convened a (virtual) meeting in January with hundreds of scientists and data experts to discuss what to do with all this data and the even greater flood of data. The DOE has since allocated $ 13.7 million to research ways to get rid of some of this data without getting rid of the useful things. In September, it awarded funding to nine of these data reduction efforts, including research teams from several national laboratories and universities. “We’re trying to put our arms around exabytes of data,” says Spotz.
“That’s definitely something we need,” says Jackie Chen, a mechanical engineer at Sandia National Laboratories who uses supercomputers to simulate turbulence-chemical interactions in internal combustion engines to create more efficient engines that burn CO2-neutral fuels. “We have the ability to generate data that gives us unprecedented insights into complex processes, but what do we do with all of this data? And how do you extract meaningful scientific information from this data? And how do you reduce it to a form that can be used by someone who actually develops practical devices like motors? “
Another area that could benefit from better data reduction is bioinformatics. Although it is currently less data-intensive than climate science or particle physics, faster and cheaper DNA sequencing means the deluge of biological data will continue to increase, says Cenk Sahinalp, a computational biologist at the National Cancer Institute. “Storage costs are becoming an issue, and analysis costs are a big, big problem,” he says. Data reduction could help with data-intensive omics problems like these. For example, the data reduction could facilitate the sequencing and analysis of the genome of thousands of individual tumor cells in order to specifically destroy and destroy certain cell groups.
However, reducing data is a particular challenge with scientific problems because it has to respond to anomalies and outliers that are so often the source of knowledge. For example, attempts to explain anomalous observations of some form of light emitted by hot, black objects eventually led to quantum mechanics. Data reduction that truncates unexpected or infrequent events and smooths every curve would be unacceptable. “If you’re trying to answer a question that has never been answered before, you may not know what data is useful,” says Spotz. “You don’t want to throw away the interesting part.”
DOE-funded researchers will work on several strategies to solve the problem, including improving compression algorithms that will allow scientific teams to have more control over what amounts are lost in compression; Minimizing the dimensions represented in a data set; Build data reduction into instruments yourself; and developing better ways to trigger instruments to start recording data only when a phenomenon occurs – for example, an astronomer looking for exoplanets may want a telescope to record data only when it senses the slight obscuration, which occurs when a planet passes a star. All of them will involve machine learning to some extent.
Byung-Jun Yoon, an applied mathematician at Brookhaven National Labs, leads one of the data reduction teams. On a Zoom call, fittingly plagued by bandwidth issues, he explained that scientists already often reduce data out of necessity, but that “it’s more of a combination of art and science.” In other words, it is imperfect and forces scientists to be less systematic than they would like. “And it doesn’t even have to be taken into account that a lot of the data generated is simply dumped because it cannot be saved,” he says.
Yoon’s approach is to develop ways to quantify the influence of a data reduction algorithm on signals in a data set precisely defined by scientists, e.g. B. a planet crossing a star, or a mutation in a particular gene. Quantifying this effect will allow Yoon to tinker with the algorithm to maintain acceptable resolution within those sets of interest while using as much of the irrelevant data as possible removed. “We want to be more confident when it comes to data reduction,” he says. “And that’s only possible if we can quantify the impact on things that really interest us.”
Yoon strives for his method to be applicable in all scientific fields, but will start with data sets from cryo-electron microscopy as well as particle accelerators and light sources, which are among the largest producers of data in science and which are expected to soon generate regular exabytes of data that as well need to be reduced quickly. If we don’t learn anything else from our exabytes, at least we can be sure that less is more.
Future Tense is a partnership between Slate, New America, and Arizona State University studying emerging technologies, public policy, and society.