Long-term preservation of digitized newspapers

Long-term preservation of digital objects is a challenging problem, and is especially so for newspaper digitization projects. Many newspaper projects are very large, with huge numbers of objects to store and preserve. The digital objects themselves also tend to be large, and it’s not uncommon for a scanned image of one newspaper page to be 50Mb or more. Large newspaper digitization projects often end up with tens or even hundreds of terabytes of digital objects to store and preserve.

In addition to the logistical challenges of securely storing huge amounts of data there are challenges related to technological changes, data formats becoming obsolete, and more.

Our Veridian-based services are traditionally focused on digitization, discovery, and delivery, and we don’t pretend to offer a complete solution to ensuring digital objects are preserved and usable for decades or centuries. We do offer services to help make this easier however, and we can back up and preserve large amounts of data and ensure it is safe in the medium term (i.e. years, as opposed to centuries!) And we can of course make recommendations based on our experiences with other projects.

Options and recommendations

  • Create digital objects in a standardized data format, preferably METS/ALTO. The long-term benefit of adopting the same standards used by other projects is if that standard ever becomes obsolete you won’t be the only project needing to solve the problem. That is, hundreds of projects have digitized hundreds of millions of newspaper pages as METS/ALTO objects, so if it ever becomes obsolete a suitable migration path is certain to be developed.
  • The industry standard, as recommended by the Library of Congress for the National Digital Newspaper Program (NDNP) is still to archive uncompressed TIFF master images of each newspaper page. These images are very large (often 50-100Mb) so large collections require a huge amount of storage. Some projects choose not to archive these very large images, and instead store JPEG 2000 images. The JPEG 2000 images have the same resolution and quality as the TIFFs, but use lossless compression resulting in much smaller files. This decision to retain uncompressed TIFFs or discard them depends on the practices of the institution digitizing the newspapers, on the budget and infrastructure they have available for long-term preservation, and on how comfortable they are with departing from the accepted “best practice”.
  • While storage space is constantly becoming less expensive it is still relatively difficult and expensive to securely store tens of terabytes of data. Costs depend on the quality of the “preservation” on offer of course. For example, simply storing all the data on commodity hard drives costs relatively little. Hard drives and other media do eventually degrade and fail however. A simple LOCKSS (Lots Of Copies Keeps Stuff Safe) approach is much better, but is usually more complex and costly. And there are of course many other options to consider.
  • For those of our customers who want help preserving large quantities of digitized data we’ve developed a service built on the AWS Glacier product from Amazon. With this service we can’t of course guarantee that the data will be available and usable next decade or next century. We can however be sure (within the 99.999999999% durability that Amazon Glacier was designed to offer) that the objects will be preserved without any degradation from one year to the next. And if the customer eventually chooses to migrate their archive to an alternative platform when/if better options become available it is relatively easy to do so.
  • There are also an increasing number of non-profit, non-commercial digital preservation options developed by the digital heritage industry for libraries, archives, and other digital memory organizations. One we have worked with in the past is MetaArchive, and we can provide help and advice for institutions planning to use such a service.