"Open Data" Policy a Cause for Optimism and Concern

By Michael Lucibella

Plans are moving ahead slowly for making public the raw data obtained by federally funded scientists, though how that ultimately might take shape is still unclear. Experts expressed both excitement and apprehension about the final form the new policy might take.

On February 22, 2013, the administration’s Office of Science and Technology Policy (OSTP) released a memorandum stipulating that all federal agencies that fund more than $100 million in research come up with a plan to open up peer reviewed results and raw data to the public.

“Most of the noise has been around the literature, not the data, but the data is likely going to have the longest term impact,” said John Wilbanks, the chief commons officer at Sage Bionetworks, and who had previously run Creative Commons’ Science Commons project.

In March 2014, OSTP collected, reviewed and returned proposals from 23 agencies. The proposals haven’t been released to the public. Over the next several months, OSTP will meet with agency representatives to continue to refine proposals.

“I certainly expect that by the end of this year we’ll see the plans,” Wilbanks said.

More than a year after the memo was first issued, there has been no official word as to how the federal agencies plan on implementing the opening of scientists’ datasets. However, data experts are not worried and have applauded the administration for its deliberative pace.

“They recognize that this is a very difficult problem, far more difficult even than open access [for publications],” said Michael Lubell, director of public affairs for APS.

There are several outstanding questions about the policy, including what kind of data is covered and where it will be stored.

The memorandum defines data generally as “digital recorded factual material commonly accepted in the scientific community” and goes on to say that items like notebooks, physical objects, peer review reports and preliminary drafts and analyses wouldn’t be included. However, pinning down precisely what might be included and what might not be could prove to be tricky.

“The agencies are struggling with how to deal with datasets,” said Bonnie Carroll, CEO of Information International Associates. “The problem with datasets is people don’t really have a good definition of what is included and what isn’t.”

Carroll added that because there is so much variety in the kinds of data collected by scientists, it’s almost inevitable that exceptions will pop up that haven’t been covered by whatever policy a funding agency adopts.  

Where data are to be stored presents its own potential issues. Some experiments, especially those in high energy and astrophysics, can collect terabytes or even exobytes of data. It’s not yet clear whether the mandates will require agencies to set up a single central database or link to data stored on outside servers.

“I don’t think they’ve even thought about that at this stage,” Lubell said. “These are huge huge datasets…. I think the magnitude of the storage problem for a central repository would be extremely large.”

Though high energy and astrophysics datasets are large, there are relatively few of them because there are a limited number of large experiments. Most already are open to the public in some way.

“I think it will change physics less than other fields where data starts from a more ‘artisanal’ source,” Wilbanks said. “If biology got to where physics is, I think that we will all declare that a victory.”

Right now, scientists are waiting to hear what the administration and the agencies ultimately decide. OSTP did not respond to interview requests.

Lubell cautions about the potential sweep of the requirements. He said that if the mandate on data ends up being overly broad, it could include calibration data that could be inherently misleading or even abused by individuals who want to misrepresent results.

“If you force people to post all their data, including the stuff you rejected…then who in the public knows enough to say that I shouldn’t look at something because it’s garbage?” Lubell said. “A large amount of data that you take, especially at the beginning of an experiment…is probably not correct.”

The reason cited in the mandate for opening the data is to let others use the raw information to innovate and ultimately spur the economy. However, Carroll said that she hopes that this will produce tools and algorithms that can help interpret the data, rather than just linking to reams of raw data alone.

“What ultimately becomes important is the metadata and the linkage to the documentation,” Carroll said. “More and more we’re linking to the data sets so the boundary between publication and data becomes more and more porous.”

