Search Engine Optimization (SEO) for digitized newspapers

Author: Stefan Boddie, Managing Director, DL Consulting Ltd.
Date: 2014-06-11

April 15th, 2012 marked the 100 year anniversary of the sinking of the Titanic. Fittingly, only a few weeks before the anniversary, James Cameron, the director of the blockbuster movie Titanic piloted a submersible to the bottom of Marianas Trench. Cameron described the bottom of the bottom of the ocean as, “Very lunar, a very desolate place, very isolated.”

We went searching Google for information about the Titanic sinking. What we found were the general entries on Wikipedia, a lot of information about the 1997 movie, and a slew of news releases about the 100th year anniversary. Of course if you search for Titanic in a digital newspaper collection like Chronicling America, the California Digital Newspaper Collection, or Papers Past, you find many high-quality newspaper articles written at the time. Instead of the clinical account of the tragedy’s timeline that you find on Wikipedia, these sources reveal harrowing personal accounts of what happened, which provide an immensely better understanding of just how tragic an event this was (I draw this subjective conclusion simply by the fact that while reading the examples below aloud to my wife, we both found ourselves teary eyed; and neither the film, nor the Wikipedia articles elicited a similar response… out of me at least).

For example, you learn of Major Archibald Butt who’s last human exchange was helping a young lady board a lifeboat, smiling despite his imminent death, hoping she’d remember him kindly. Or of two toddlers who were reunited with their mother after they survived the sinking, but their father did not.

Of course you already know about the richness of information held and shared by libraries. Librarians have been pleading their case as the original search engines since the advent of search engines. Unfortunately, too many people who might benefit from these collections simply aren’t aware of them; and Google certainly isn’t helping them out. It’s as if your digital collection were at the bottom of the sea next to the ship’s wreckage.

So how do we raise our sunken treasure from the cold dark depths of the web?

In some ways, the solution is simply about marketing our collections better. This of course is a more daunting task than it seems!

There are many different aspects to marketing digital collections, but there’s a certain technological component, certain “Google-limiting” factors inherent to digital collections, that we’ll discuss here.

Whether we like it or not for many internet users their favorite search engine — Google, Bing, Yahoo, or the like — is the only way they know to find information. So unless your digitized content is properly indexed and visible via those sources it might as well not exist, at least so far as a large portion of internet users is concerned. Search Engine Optimization (SEO) has a slightly bad name in some circles, due to unscrupulous SEO “experts” attempting to raise the search rankings of their pages through dubious means. The concept of SEO is important though, and can have an enormous impact on the number of users visiting your digitized content. In our experience with Veridian we’ve routinely seen visitor numbers increase 20 times over once appropriate SEO is configured and the major search engines have indexed a digital collection!

How Google and other search engines index and rank web content is a highly guarded and guessed at secret that many would-be wizards try to master to increase traffic to websites.  In fact, an entire marketing/technology SEO crossover industry thrives today. The grand guiding principle behind most SEO efforts these days involves creating exceptional content that indexes well and attracts quality backlinks.

Developing “great content” may be the most repeated suggestion in the SEO world. Yet, despite its clichéd status, appealing, useful content is critical to search engine optimization. Every search performed at the engines comes with an intent – to find, learn, solve, buy, fix, treat, or understand. Search engines place web pages in their results in order to satisfy that intent in the best possible way, and crafting the most fulfilling, thorough content that addresses a searcher’s needs provides an excellent chance to earn top rankings.

~seomoz.org

But if this were all that went into rankings, why do digital collections not fare better in search results?

First off, let us concede that there is a huge difference between optimizing a typical website to attract visitors and optimizing a digital collection.  The sheer scale of most digital collections prevent it. SEO campaigns typically target a few specific keywords and visitor personas;  but digital collections contain information about an immense range of subjects… how would you even begin to keyword optimize a large collection of digitized newspapers?

That said, there are several things that can be done to help Google better index and deliver the content of your collections in its search results.  Here are the three steps every digital collection should implement as a minimum treatment for indexing well in Google.

1. Create a Sitemap

The most important thing you should be doing is creating an XML sitemap.  This tells Google’s crawler robots which pages of the collection you want them to index, and acts as a guide as they crawl your site. This is important because of how information gets stored and linked in software systems like Veridian, used for presenting digitized content on the internet. It’s quite possible that without a good sitemap, your collection will send the crawlers on a never ending circuitous loop of links, indexing and reindexing the same content through different pathways. This not only renders “ill-will” with the Google algorithms, but can also place considerable load on your servers.

For more information about sitemaps and how to create one, visit Sitemaps.org. Or if you use the Veridian software to host your digitized newspaper you don’t need to worry — Veridian creates suitable sitemaps automatically.

2. Include a Robots.txt file

A Robots.txt is typically used to indicate which parts of a digital collection the crawlers should not index. This can help keep them from indexing administration pages, search results pages, duplicate content, or any other parts of a digital collection that you don’t want turning up in Google search results. We use a robots.txt file, along with a Sitemap, to insure web crawlers index everything they should, and nothing they shouldn’t.

In Veridian, we also include code so the OCR text is what’s indexed and shown in Google results pages, yet when the results link is clicked and the searcher is directed to a Veridian website, the newspaper image files get displayed instead of the text.

Learn more about how to create a robots.txt file. As with Sitemaps though, if you use Veridian a suitable robots.txt is created automatically.

3. Mind your page titles

Most SEOs agree that the search algorithms emphasize page titles and headings content. The information being inserted into these HTML tags in your collection can impact how your site is ranked. If your software includes some flexibility in its configuration, make the most descriptive information load in the title and heading tags. An example would be:

Article Title | Publication | Date | Collection Name

or

Heroes of Titanic Faced Death Smiling and Aiding Others | San Francisco Call | 20 April 1912 | California Digital Newspaper Collection

Keep in mind, Google will only display the first 70 characters of a page title, so put the information most relevant to the likely search terms first. Doing so means not only better search ranking, but also increases the likelihood of a searcher clicking your link.

There are many more ways to make your digital collection more friendly to search engines, in addition to the basics noted above. We’ve expended a lot of effort to ensure collections hosted with Veridian are as well optimized for visibility through Google and other search engines as it is possible to be.