Newspaper article segmentation — the pros and cons

Author: Stefan Boddie, Managing Director, DL Consulting Ltd.
Date: 2014-06-19

When digitizing newspapers it is possible to identify, segment, and categorize individual newspaper articles as they appear on the page. Deciding whether to perform this “article segmentation” process is one of the key decisions when planning a newspaper digitization project.

As discussed in other articles on this site (see the Related Reading section below) we strongly recommend using the METS/ALTO format/standard when digitizing newspapers. METS/ALTO supports both digitized newspapers without article segmentation (sometimes called “page level” METS/ALTO) and those with article segmentation (sometimes called “article level” METS/ALTO). Likewise our Veridian discovery and delivery platform supports both types of METS/ALTO data. Some projects even mix the two types of data in the same collection, on the same site.

Examples

Example newspaper projects using METS/ALTO with article segmentation

Cambridge Public Library

Cambridge Public Library
Historic Cambridge newspapers, 1846 – 1923
59,070 pages
Vassar College

Vassar College
Vassar College student newspapers, 1872-present
55,426 pages

Example newspaper projects using METS/ALTO without article segmentation

Library of Virginia

Library of Virginia
Virginia newspapers, 1841 – 1999
350,000+ pages
Library of Virginia

Indiana State Library
Indiana newspapers, 1840 – 1922
90,000+ pages

The advantages of article segmentation

Article segmentation identifies newspaper articles within each digitized page. With this type of METS/ALTO data the smallest component is a newspaper article, as opposed to a newspaper page. When a user searches a collection at the article level they retrieve a list of articles matching their search terms, as opposed to a list of matching pages. That makes search results much more useful to users — for example, headlines and bylines and other article-level metadata may be displayed in search results.

With this type of article-level data each article can also be categorized. For example, articles can be specified as “advertisements”, or “illustrations”, or “family notices”. Those categories allow for much richer search options — for example, users can restrict their searches to just family notices, or can exclude advertisements from their search results.

Article-level data also allows individual articles to be identified and highlighted when viewing a newspaper using Veridian’s discovery and delivery platform. That has a number of benefits, including allowing articles to be “clipped out” from the original page image, for saving or printing.

The disadvantage of article segmentation

Cost! All the advantages listed above unfortunately come at a price. There are some solid software products available for automatically identifying articles on digitized newspaper pages, but those systems still need considerable input from human operators in order to produce high quality output. The result is that the article segmentation process takes a lot of time, which adds significantly to the per-page digitization cost. Producing METS/ALTO data with article segmentation may cost $0.80 or more per newspaper page, which is two to five times more than producing METS/ALTO without article segmentation.