What sort of “digital objects” should I produce for my newspaper digitization project?

Author: Stefan Boddie, Managing Director, DL Consulting Ltd.
Date: 2014-06-22

For our purposes we consider a “digital object” to be all the files that describe a single newspaper issue. A typical “digital object” produced by most modern newspaper digitization projects includes the following files.

  • One METS XML file for the entire newspaper issue. This file describes the issue and usually includes links to all the other files that make up the complete digital object.
  • One ALTO XML file for each page of the newspaper issue. These files contain the actual text produced from the Optical Character Recognition (OCR) process, as well as page layout information.
  • One uncompressed TIFF image (usually 300DPI) for each page of the newspaper issue. Uncompressed TIFFs are quite large (often 40Mb – 80Mb per newspaper page) but they are still the industry standard for long-term digital preservation. Many of the projects we work on still choose to archive these large files, though some choose to discard them and keep only derivative files (usually in JPEG 2000 format).
  • One JPEG 2000 image file for each page of the newspaper issue. These images should preferably comply with the National Digital Newspaper Program (NDNP) JPEG 2000 profile (see http://www.digitalpreservation.gov/formats/fdd/fdd000192.shtml. The JPEG 2000 images are derivatives created from the original TIFFs, and are usually used instead of the TIFFs for online display.
  • One multi-page PDF file for the entire newspaper issue, often containing a text layer with OCR text. These files are optional and don’t contain any information that isn’t already stored in the various files listed above. Since PDFs are still a useful alternative mechanism for viewing and printing a digitized newspaper many projects still choose to produce them however.
  • One single-page PDF file for each page of the newspaper issue. These too are optional, but many projects choose to produce them as they are a convenient way to make it easy for patrons to print individual newspaper pages.

Usually all the files that make up a single digital object are stored together in a consistent folder/directory structure. Different projects use different file naming schemes but a commonly used scheme is as follows.

<batch-name>/<publication-code>/<year>/<month>/<day>/<files>

Using that scheme a single digital object of a three page issue of the New York Times from July 1st 1940 might look as follows:

BATCH1/NYT/1940/07/01/NYT_19400701_mets.xml
BATCH1/NYT/1940/07/01/NYT_19400701_issue.pdf
BATCH1/NYT/1940/07/01/NYT_19400701_ALTO_0001.xml
BATCH1/NYT/1940/07/01/NYT_19400701_ALTO_0002.xml
BATCH1/NYT/1940/07/01/NYT_19400701_ALTO_0003.xml
BATCH1/NYT/1940/07/01/NYT_19400701_0001.tif
BATCH1/NYT/1940/07/01/NYT_19400701_0002.tif
BATCH1/NYT/1940/07/01/NYT_19400701_0003.tif
BATCH1/NYT/1940/07/01/NYT_19400701_0001.jp2
BATCH1/NYT/1940/07/01/NYT_19400701_0002.jp2
BATCH1/NYT/1940/07/01/NYT_19400701_0003.jp2
BATCH1/NYT/1940/07/01/NYT_19400701_0001.pdf
BATCH1/NYT/1940/07/01/NYT_19400701_0002.pdf
BATCH1/NYT/1940/07/01/NYT_19400701_0003.pdf