User Text Corruption! What are the risks of Crowdsourced text correction?

Author: Stefan Boddie, Managing Director, DL Consulting Ltd.
Date: 2014-11-21

Pencil_eraser

Veridian’s Crowdsourced User Text Correction (UTC) feature allows users to correct OCR errors that commonly occur in digitized newspapers (and other digitized text). It has been adopted by many newspaper digitization projects, and has been very successful. For example, as I write this article the California Digital Newspaper Collection has 2,778 registered users whose combined contribution is more than 3.3 million corrections to OCR errors!

The idea then with UTC is to create a community of people working together to create a resource that is made more valuable because of their contributions. We are often asked though what prevents users from maliciously vandalizing or corrupting the text in these collections. The fact is there’s nothing preventing them from doing that!

Since we’re asked about user text vandalism so often we thought we’d attempt to write some thoughts on the subject, so here they are:

  • Two of the largest and oldest newspaper digitization projects to include crowdsourced user text correction, Trove from the National Library of Australia and the California Newspaper Collection (CDNC) have both carried out studies to attempt to detect vandalism. Both found occurrences of vandalism to be extremely rare, and in the case of the CDNC they couldn’t find any occurrences at all! This doesn’t prove much of course, as subtle acts of vandalism can be hard to detect in a huge collection of newspapers, and some cases have been detected in Trove (though it’s a miniscule proportion of the many millions of OCR errors corrected by that project). What it suggests though is that for most newspaper digitization projects there is only a low risk that users will use crowdsourcing tools to vandalize text.
  • Users can of course “correct the corrections” if they choose to do so. So vandalism by one user can be corrected by another.
  • Veridian doesn’t modify the original source data produced with the OCR process, so it’s always possible to roll back to the original text if it’s necessary to do so.
  • CDNC and all the other Veridian-based collections with a crowdsourced UTC feature force users to register with a legitimate email address before making corrections. Users can of course use a throw-away gmail or yahoo address, but the enforced registration process probably does reduce occurrences of vandalism. Having said that, Trove allows anonymous corrections (and Veridian can be configured that way too), but vandalism is still extremely rare.
  • If necessary it’s possible to disable UTC for specific newspaper titles, or for specific time periods, while still using it for the remainder of a large collection. This is an option for those collections containing specific content that may be more likely to attract vandalism.
  • If it was detected that one or more users had been busy vandalizing the text in a collection we could delete the corrections made by just those users, without deleting corrections by other users.
  • We’ve been asked many times about creating some sort of review/moderation process so corrections can be “checked” by an administrator, prior to going live. While it’s technically possible we haven’t done it as we don’t think it’s practical. That is, it’d take a huge amount of time for a moderator to check all the corrections that are made on a big site like the CDNC. A better solution we think is to crowdsource the moderation process, to allow a set of users to check the work performed by other users. That’s an interesting idea, and we welcome discussion with any project wishing to pursue it. To date though vandalism simply hasn’t been a significant problem for any of the projects we work with, so this type of “crowdsourced moderation module” has never become a priority.