Author: Stefan Boddie, Managing Director, DL Consulting Ltd.
What is ALTO?
The Analyzed Layout and Text Object (ALTO) is an open XML standard maintained by the Library of Congress.
ALTO is a schema for capturing the word content, styles, and layout elements on a digitized textual page, including the spatial coordinates of text elements like columns and lines. It is often used in tandem with METS XML, which provides descriptive and administrative metadata about the object to which the ALTO XML file belongs.
What ALTO XML contains
An ALTO XML document comprises the physical description, composition, and the page content of digital objects. ALTO files generally have 3 sections.
Section 1 – Description
The Description section contains descriptive information pertaining to the ALTO file itself, including measurement units, source file information, processing software and creator, and OCR information.
Section 2 – Styles
The Styles section contains descriptions of fonts and paragraphs. Common information includes the font-family and size, font styling, and paragraph alignments and line spacing.
Section 3 – Layout
<TopMargin ID=”P1_TM00001″ HPOS=”0″ VPOS=”0″ WIDTH=”4516″ HEIGHT=”323″/>
<LeftMargin ID=”p1_LM00001″ HPOS=”0″ VPOS=”323″ WIDTH=”133″ HEIGHT=”5981″/>
<TextLine ID=”p1_TL00001″ HPOS=”163″ VPOS=”1909″ WIDTH=”4198″ HEIGHT=”23″/>
The layout section is where the actual content (String) and dimensions (HPOS, VPOS, WIDTH, and HEIGHT) are located. Each block of text is listed and absolutely positioned in units, typically fractions of inches or millimeters, from the top-left corner of the page. Further detail and positioning is provided for every line and each word of content on the page. The layout section also describes and positions any other object, such as pictures, tables, and formula, that may be on the page.