Veridian data ingestion guide

Table of Contents

1. Introduction
1.1. Documents
1.2. Publications
1.3. Storing the source data
1.4. The import directory structure
2. Document ingestion
2.1. Linux
2.2. Windows
3. Document removal
3.1. Linux
3.2. Windows


1. Introduction

This section introduces the key concepts of a Veridian™ “document” and a Veridian™ “publication”, and describes how to organise documents into publications in a Veridian™ collection.

1.1. Documents

In Veridian™, a document may be:

  • A newspaper issue: a METS XML file for the issue, an ALTO XML file for each page of the issue, a JPEG/JP2/TIFF image file for each page of the issue, and optionally PDF files for each page and the entire issue.

  • A book: a METS XML file for the book, an ALTO XML file for each page of the book, a JPEG/JP2/TIFF image file for each page of the book, and optionally PDF files for each page and the entire book.

  • A periodical (or journal issue): a METS XML file for the periodical, an ALTO XML file for each page of the periodical, a JPEG/JP2/TIFF image file for each page of the periodical, and optionally PDF files for each page and the entire periodical.

  • A series of images: JPEG/JP2/TIFF image files only (no METS/ALTO XML files). Although these documents will have no article information and no text content (meaning they cannot be searched), these documents can still be viewed in the Veridian™ interface. For some collections, it may be a helpful intermediate step to add the image data for documents while they are being OCRed. The image files are processed in alphabetical order, so they must be named accordingly.

  • A single image: one JPEG/JP2/TIFF image file. These documents will have no text content.

  • A video file: one FLV/F4V Flash video file. These documents will have no text content.

  • An audio file: one F4A Flash audio file. These documents will have no text content.

In all these cases each Veridian™ document must be part of a publication and must be in its own directory using the directory structure described below.

1.2. Publications

The idea of a publication in Veridian™ is a key concept that must be understood from the outset. Publications affect all parts of Veridian™, from the source data directory structure through to the delivery system interface.

Publications are used to group individual Veridian™ documents together. For example, all the issues of a newspaper would be grouped together in one publication, and all the issues of a journal would be grouped together in another publication. Each book usually becomes a publication on its own, unless it is part of a series. There is no limit to the number of publications allowed in a Veridian™ collection.

Each publication has a unique code, which should contain letters only (no spaces, punctuation or numbers are permitted). Veridian™ documents with the same publication code are grouped together into the same publication internally, and in the delivery system.

The publication code for a Veridian™ document is extracted from its directory structure, usually from the first subdirectory. In other words, the directory structure for the Veridian™ data specifies what publications are in the collection and how the Veridian™ documents are grouped into those publications. For more information, see the import directory structure.

Once the data has been ingested (described below), it is necessary to define metadata about the publications in the collection. To do this, edit the veridian/etc/publication-metadata.xml file and add a new <Publication> entry containing appropriate information for the new publication. Entries look like:

<Publication code=”CODE”>
  <Title>Publication title</Title>
  <Title-alternate>Alternate publication title</Title-alternate>
  <URL-icon>URL of masthead image goes here</URL-icon>
  <Description language=”en”>Short English description goes here.</Description>
  <Description language=”de”>Short German description goes here.</Description>
  <Coverage-geographic>Area</Coverage-geographic>
  <Place-of-publication>City</Place-of-publication>
  <Digitisation-range>incomplete</Digitisation-range>
  <Essay language=”en”>Long English description goes here.</Essay>
  <Essay language=”de”>Long German description goes here.</Essay>
  <Acknowledgement language=”en”>English acknowledgement goes here.</Acknowledgement>
  <Acknowledgement language=”de”>German acknowledgement goes here.</Acknowledgement>
</Publication>

The only required fields are the publication code (“CODE”) and Title fields. Values from the other fields, if they are present, are displayed on the publication “about” page.

After entering the publication metadata, run (Linux):

cd veridian
./bin/shell/Load-Publication-Metadata.sh

Or (Windows):

cd veridian
.\bin\shell\Load-Publication-Metadata.bat

1.3. Storing the source data

The source data files must be accessible from the machine containing the Veridian™ software system. For best performance these files should be on local disk, but if necessary they can be network mounted from another machine. It is important that the source data files have their permissions set so they are readable by all. The directories must be executable by all.

The source data should be broken up into parts (e.g. by publication or by digitisation batch). This allows the data to be ingested in stages, and new batches of data to be ingested without re-ingesting the earlier data. This is described further below.

1.4. The import directory structure

The directory structure of the source Veridian™ data is very important because it controls how the documents are grouped into publications (described above) and specifies the dates of the documents (and optionally, the document edition).

The Veridian™ ingestion plugins can be customised to handle almost any directory structure for the source data, but for simplicity we recommend:

<batch-name>/<publication-code>/<year>/<month>/<day>/<files>

For example, for METS/ALTO documents from July 1 1940 and July 2 1940 with publication code “CODE”, the structure might look like:

BATCH1/CODE/1940/07/01/mets.xml
BATCH1/CODE/1940/07/01/0001.jp2
BATCH1/CODE/1940/07/01/0001.xml
BATCH1/CODE/1940/07/01/… etc.

BATCH1/CODE/1940/07/02/mets.xml
BATCH1/CODE/1940/07/02/0001.jp2
BATCH1/CODE/1940/07/02/0001.xml
BATCH1/CODE/1940/07/02/… etc.

If a book from 1980 was added to the collection it would be a new publication (e.g. with publication code “BOOK”), so it might look like:

BATCH2/BOOK/1980/mets.xml
BATCH2/BOOK/1980/0001.jp2
BATCH2/BOOK/1980/0001.xml
BATCH2/BOOK/1980/… etc.

The directory structures accepted by the standard Veridian™ ingestion plugins are:

  • Documents with a full date:

    <batch-name>/<publication-code>/<YYYY><MM><DD>/<files>
    <batch-name>/<publication-code>/<YYYY>-<MM>-<DD>/<files>
    <batch-name>/<publication-code>/<YYYY>/<MM>/<DD>/<files>

    <batch-name>/<publication-code>/<YYYY><MM><DD>_<edition>/<files>
    <batch-name>/<publication-code>/<YYYY>-<MM>-<DD>_<edition>/<files>
    <batch-name>/<publication-code>/<YYYY>/<MM>/<DD>_<edition>/<files>

  • Documents with a month-level date:

    <batch-name>/<publication-code>/<YYYY><MM>/<files>
    <batch-name>/<publication-code>/<YYYY>-<MM>/<files>
    <batch-name>/<publication-code>/<YYYY>/<MM>/<files>

    <batch-name>/<publication-code>/<YYYY><MM>_<edition>/<files>
    <batch-name>/<publication-code>/<YYYY>-<MM>_<edition>/<files>
    <batch-name>/<publication-code>/<YYYY>/<MM>_<edition>/<files>

  • Documents with a year-level date:

    <batch-name>/<publication-code>/<YYYY>/<files>

    <batch-name>/<publication-code>/<YYYY>_<edition>/<files>

  • Documents with no date:

    <batch-name>/<publication-code>/<files>

    <batch-name>/<publication-code>/_<edition>/<files>


2. Document ingestion

2.1. Linux

To add a batch of data to your Veridian collection on Linux, follow these steps:

  1. Put the batch of data (one or more documents) in a new veridian/import/BATCH1 directory, following the directory structure described above.

    On Linux, symbolic links can be used to avoid duplicating the data (particularly if the data is network mounted from another computer). For example, if the source Veridian™ data was stored in /data/Veridian/BATCH1, you would run:

    cd veridian
    ln -s /data/Veridian/BATCH1 import/BATCH1
  2. If the batch of data has previously been ingested, you will first need to delete the old veridian/archives/BATCH1 directory:

    cd veridian
    rm -rf archives/BATCH1
  3. Ingest the batch of data by running:

    cd veridian
    ./bin/shell/Import-Batch.sh BATCH1

    Once the Import-Batch.sh process has finished, a veridian/archives/BATCH1 directory for this batch will have been created.

  4. Check the veridian/logs/import.BATCH1.<DATE>.log file for warnings and errors (there will be an error message for any documents that were rejected). For METS/ALTO data, in most cases the problem will be an XML parsing problem caused by XML files not being well-formed. In this case, the OCR data provider should be contacted for a fixed version of the METS/ALTO document.

  5. Build the batch of data into the live index by running:

    cd veridian
    ./bin/shell/Build-Batch.sh BATCH1
  6. Check the veridian/logs/build.BATCH1.<DATE>.log file for errors and exceptions, and contact DL Consulting for assistance if necessary.

When adding further batches of data, new import directories should be created (e.g. “import/BATCH2”) and then these steps should be repeated.

2.2. Windows

To add a batch of data to your Veridian collection on Windows, follow these steps:

  1. Put the batch of data (one or more documents) in a new veridian\import\BATCH1 directory, following the directory structure described above.

  2. If the batch of data has previously been ingested, you will first need to delete the old veridian\archives\BATCH1 directory:

    cd veridian
    rmdir /S /Q archives\BATCH1
  3. Ingest the batch of data by running:

    cd veridian
    .\bin\shell\Import-Batch.bat BATCH1

    Once the Import-Batch.bat process has finished, a veridian\archives\BATCH1 directory for this batch will have been created.

  4. Check the veridian\logs\import.BATCH1.<DATE>.log file for warnings and errors (there will be an error message for any documents that were rejected). For METS/ALTO data, in most cases the problem will be an XML parsing problem caused by XML files not being well-formed. In this case, the OCR data provider should be contacted for a fixed version of the METS/ALTO document.

  5. Build the batch of data into the live index by running:

    cd veridian
    .\bin\shell\Build-Batch.bat BATCH1
  6. Check the veridian\logs\build.BATCH1.<DATE>.log file for errors and exceptions, and contact DL Consulting for assistance if necessary.

When adding further batches of data, new import directories should be created (e.g. “import\BATCH2”) and then these steps should be repeated.


3. Document removal

3.1. Linux

To remove a complete batch of data from your Veridian collection index on Linux, follow these steps:

  1. Remove the batch of data from the index (the archives are not affected) by running:

    cd veridian
    ./bin/shell/Remove-Batch.sh BATCH1
  2. Check the veridian/logs/remove.BATCH1.<DATE>.log file for errors, and contact DL Consulting for assistance if necessary.

You can also remove specific documents from the index by running:

cd veridian
./bin/shell/Remove-Documents.sh <OIDs>

where “<OIDs>” is a comma-separated list of Veridian document OIDs.

3.2. Windows

To remove a complete batch of data from your Veridian collection index on Windows, follow these steps:

  1. Remove the batch of data from the index (the archives are not affected) by running:

    cd veridian
    .\bin\shell\Remove-Batch.bat BATCH1
  2. Check the veridian\logs\remove.BATCH1.<DATE>.log file for errors, and contact DL Consulting for assistance if necessary.

You can also remove specific documents from the index by running:

cd veridian
.\bin\shell\Remove-Documents.bat <OIDs>

where “<OIDs>” is a comma-separated list of Veridian document OIDs.