APS News

Digital Archives Grow in Size and Usefulness

By Michael Lucibella

Einstein Archives Online
Photo: Einstein Papers Project

Einstein Online: The Einstein Papers Project has posted scans, transcriptions, and translations of its collection.

Historians, scientists, and the public now have more access to digitized raw materials than ever before. In the last few months, two large libraries of historical science documents were posted online, freely accessible to the public. Though online archives like these are becoming more common, the challenge of digitizing tens or hundreds of thousands of documents has kept the pace of uploads relatively slow.

In September, CERN began posting its massive photo archive to the lab’s online documents server. Already the group has posted nearly 40,000 of its more than 120,000 black and white photo negatives from the 1950s through the early 1980s.

Then in December, the Einstein Papers Project, located at Caltech, started publishing digitized versions of Albert Einstein’s correspondence up through 1927. “What we put online are the existing scholarly annotated papers that have been collected,” said Diana Kormos Buchwald, a Caltech historian and director of the Einstein Papers Project. “We are not just putting up copies or facsimiles of known documents written by Einstein.”

The Einstein collection in particular required a lot of additional scholarship prior to its release, including transcription and translation of the original documents. “Throwing up scans or copies of manuscripts is not sufficient in this day and age. You want to explain when they were written, why they were written, and to whom they were written,” Buchwald said. And there is so much scholarship and documentation surrounding the life of Albert Einstein that sifting through and picking the relevant documents to upload and providing context for each is a major undertaking.

In a way, CERN’s photo archive poses the opposite challenge for the library team. Many of the photographs have little or no documentation, leaving the team in the dark about the events or the names of people and equipment depicted in many of their photos. “The old database that we have isn’t as good as it could be,” said Alex Brown, the assistant multimedia librarian at CERN who’s helping digitize the photos.

To fill in some of the blank spots in the record, the team has been reaching out for help in identifying people and items in the photographs. Any member of the public or scientific community who recognizes someone or something in the photographs can post a comment to the document server.

The team wanted to concentrate on the older trove of black and white photos first because of concern for the longevity of both the negatives and the individuals who could help identify the people and items in them.

For historical researchers, these kinds of big online repositories have been a major boon. Digital tools like text search are letting researchers scan through collections of documents more efficiently than ever before, while the Internet brings the archives to people around the world.

“It allows me to do a lot of research that I otherwise wouldn’t be doing because I don’t have the time or the money,” said Alex Wellerstein, a historian at the Stevens Institute of Technology. His own research is focused primarily on the history of American nuclear weapons and technology. He said also that traveling and staying in different cities to access physical archives is one of the biggest expenses in historical research.

Faster computers and ever-cheaper hard drive space have allowed more archives to put large portions of their collections online for public access. Archives have embraced these online repositories because it results in less wear and tear on the documents themselves as fewer people handle them. Their preservation was what in 2011 prompted the archivists at the Niels Bohr Library & Archives at the American Institute of Physics to digitize its most popular collection, the Samuel A. Goudsmit Papers.

“One of the interesting things is that archives have moved very quickly in a very short period of time towards online archives and digitization,” said Joe Anderson, director of the Bohr Library.

However, the process of scanning potentially millions of pages of documents is time-consuming and labor intensive. In 2012, comedian Seth McFarlane made a donation that enabled the Library of Congress to acquire the collected papers of Carl Sagan. The library posted on its website about 110 selected items from the more than 1,700 document boxes, but has no plan to digitize the whole collection.

“It’s basically an online exhibit that was created to commemorate the acquisition of the papers,” said Trevor Owens, a digital archivist at the Library of Congress. “The idea of that project is to situate Carl Sagan’s papers within the broader collection of the L.O.C.”

The library is home to enormous troves of books and documents extending over hundreds of years, and the archivists have to conserve their digitization resources. Other collections like presidential or congressional records take precedence, largely because a collection like Sagan’s poses more challenges than older ones.

“The biggest challenge in doing modern collections like the Sagan papers … [is that] there’s a lot of rights issues to consider,” Owens said. He added that this becomes especially difficult as the collections grow in size.

However, one problem researchers have run into is that online archives are not inherently permanent fixtures. In late 2013, two big online archives maintained by the Department of Energy (DOE) went dark. A collection of documents and photographs about the agency’s Hanford Site was turned off around November, as was its Marshall Island document collection. The Hanford archive was likely taken down because of the outdated infrastructure it used. A number of the photographs once available through the site have migrated onto the DOE’s Flickr account.

The Marshall Island archives hosted about 14,000 individual documents, largely related to nuclear testing in the Pacific and the health effects on the Marshall Islanders. When Wellerstein inquired about the status of the archive, the department informed him that it would be a temporary outage, but more than a year later the archive still has not returned

“The problem with the government hosting these archives … [is] the government may have a million reasons to take them down. There’s no mandate for keeping them online,” Wellerstein said. “The real scary thing about digital is that [while] it’s easy to put things out there, it’s really easy to turn it off again.”

However, disappearing archives are the exception rather than the rule, and interest from the public has helped encourage these kinds of projects. The CERN photo project was made possible by a fund personally authorized by the lab’s director general. “We were really impressed by the interest it generated,” said Jens Vigen, the head librarian at CERN. “Across European media, it has been all over the place.”

Part of the reason for the archive’s popularity is that the public has not only been encouraged to help identify the photos, but also to use them as well. CERN allows anyone to reuse their photos as long as they credit the lab.

Brown said also that the artistic quality of a lot of the photographs caught the public’s eye. “You had to make sure you were going to get a good shot and that you were taking pictures that were going to be enduring. The artistic value of some of these pictures is quite high,” Brown said. “This kind of old-school cool is a bit of a trend at the moment.”

APS encourages the redistribution of the materials included in this newspaper provided that attribution to the source is noted and the materials are not truncated or changed.

Editor: David Voss
Staff Science Writer: Michael Lucibella
Art Director and Special Publications Manager: Kerry G. Johnson
Publication Designer and Production: Nancy Bennett-Karasik