The Process of Scanning Newspapers

Author: Stefan Boddie, Managing Director, DL Consulting Ltd.
Date: 2014-06-20

The first step in most newspaper digitization projects is to scan the original newspaper pages (or microfilm of the originals) to produce digital images.

Newspapers are uniquely challenging to scan and it’s important to select a vendor with experience in scanning newspapers for the library/heritage market. There are many vendors capable of scanning your newspapers, but a much smaller number are capable of doing it properly. We provide scanning services through partnerships with several carefully selected vendors in the United States and elsewhere, and can help with selecting a suitable scanning vendor. Likewise, if your library plans to scan newspapers in-house we are happy to provide advice to ensure the scanned images are as good as they can be.

The remainder of this page contains information and recommendations from the many newspaper digitization projects we’ve been involved with. Every project is different, but we’ve attempted to capture here some of the key considerations when scanning newspapers.

Different types of “original” newspapers

There are three main types of “originals”, as follows.

  1. Newspapers in their original (paper) form.
  2. Analog microfilms taken from original paper newspapers. Many newspaper titles have been microfilmed over the past 50 years, for preservation purposes.
  3. Digital versions of newspapers, usually in PDF format. Typically these are newer (mostly 21st century) newspapers, for which digital PDFs were created and archived during the original printing process. No scanning is required for these of course, but the rest of the digitization process (creation of METS/ALTO objects, access systems, and digital preservation) remains very similar.

Notes, considerations, and recommendations

  • Microfilm quality varies greatly, depending on when it was created, how it was created, and how it is stored. Some microfilmed newspapers are unreadable, and it isn’t possible to produce good digital images from poor microfilm.
  • Digital images created by scanning original paper newspapers are often better quality than those created by scanning microfilm.
  • It is usually more difficult, more logistically challenging, and more expensive to scan paper newspapers than it is to scan microfilmed newspapers. For example, microfilm is relatively easy to copy and ship to a scanning facility, but large collections of newsprint are difficult and expensive to transport. Large-format (e.g. broadsheet) newspapers also require very specialized and expensive scanners/cameras to digitize directly. For this reason many libraries choose to digitize from microfilm (if they have it), in preference to digitizing from paper originals.
  • For the reasons stated above it is often preferable to find a local scanning vendor if scanning from paper originals. If scanning from microfilm the location of the scanner is less important.
  • Normally the scanning process should produce 300DPI grayscale uncompressed TIFF images. Some projects choose to produce color images, especially if scanning original paper newspapers, but most produce grayscale images. In some cases higher resolution images (either 400DPI or 600DPI) can be produced, but 300DPI is the most common. Images in alternative formats like JPEG 2000 are often produced later in the process, but the scanned master images should normally be TIFFs.
  • If using different vendors for the scanning and the OCR and METS/ALTO creation processes it is often best to leave any necessary image cleanup (e.g. splitting two-up images, cropping borders, deskewing, despeckling, etc.) for the OCR vendor. Some automated image cleanup treatments can adversely affect OCR quality, so it’s best if the scanning vendor delivers the cleanest images they can produce, right from their scanner or camera. The OCR vendor can then perform any necessary image cleanup.
  • Ensure your scanning vendor has an appropriate facility with suitable security and fire prevention systems, especially if you intend sending them irreplaceable source documents.