Why not plain text?

By now you’ve noticed that all the old articles on this website are displayed as embedded graphics. In order to fit on screen, many are unfortunately resized too small to easily read in context. But I want you to read them, so many articles also have a link to a large PDF which is easy to read. Some of these articles are interesting but so long that you might want to save them to Instapaper and read later on your iPad, Kindle, nook, or other portable device. I’d like that, too. Some of those devices offer kludgy PDF support, but for now a better reading experience is just not an option.

In order for that to be an option, I need plain text versions of the articles. I tried running them through a program that converts images to text (OCR) but it didn’t do a very good job because so many letters are faded or otherwise hard for a computer to recognize. I’d consider typing them out myself but that would take too long. And it would be redundant, because that work is already being done. In fact, there’s a good chance you’ve done part of that work.

Thousands of blogs, including this one, use a service called reCaptcha to help stop automated “spambots” from leaving comments. Before you’re allowed to post a comment, you’re asked to type in a pair of words distorted so that computer programs can’t read them. One of those words is a test to prove that you’re a human. The other word is actually from an old book, newspaper, or magazine that OCR was unable to automatically convert to text. When you enter the word, you help convert the document from an image to plain text. In essence, millions of people are doing the work that the computer can’t do, one word at a time.

In September, 2009, reCaptcha was bought by Google, and their current focus is on converting the New York Times archives to plain text. It’s projected that the entire archive will be completely converted some time in 2010. I hope that the plain text generated by this project will be made publicly available — at least for the public domain articles — so that I can integrate them into this website.

You can try out reCaptcha for yourself at right without needing to leave a comment. To learn more about reCaptcha and how they are helping to digitize old books and newspapers, visit their website.

3 responses to “Why not plain text?”

  1. Hi David,

    I think this is a really great project. Like super great, and I can’t wait for the next batch or articles.

    It will be awesome when Google make the archives available in plain text. In the meantime, I’m putting time aside to read all the PDFs.



  2. Great idea! I love reading old newspaper articles 🙂
    One question though: A internet surfer types in two words to prove he is not a spambot. One of the words is the actual test, the other comes from a backlog of words OCR failed to convert to text. Wouldn’t it then be enough to type the first word to prove that one is not a spambot? Because the other word cannot be known to the computer doing the verification, or else someone at recaptcha would have had to read and type the word beforehand, thus making the internet surfer’s contribution unnecessary.

    To test this I’ll type in only the first word of the captcha.
    Well, it doesn’t work. Neither just the first word nor the first word and some token letters (so the absence of the second word won’t be noticed) got my comment through. I’ll try typing the second word now.


  3. The only trouble with a project like this is that people know about it. Some people have made a point to replace the “other”, non-test word with a “dirty” word so that all these old documents are riddled with foul language- amusing, but not very helpful. There’s even an infographic floating around that helps you figure out which word is the test word and which is the one that can be replaced.


