By now you’ve noticed that all the old articles on this website are displayed as embedded graphics. In order to fit on screen, many are unfortunately resized too small to easily read in context. But I want you to read them, so many articles also have a link to a large PDF which is easy to read. Some of these articles are interesting but so long that you might want to save them to Instapaper and read later on your iPad, Kindle, nook, or other portable device. I’d like that, too. Some of those devices offer kludgy PDF support, but for now a better reading experience is just not an option.
In order for that to be an option, I need plain text versions of the articles. I tried running them through a program that converts images to text (OCR) but it didn’t do a very good job because so many letters are faded or otherwise hard for a computer to recognize. I’d consider typing them out myself but that would take too long. And it would be redundant, because that work is already being done. In fact, there’s a good chance you’ve done part of that work.
Thousands of blogs, including this one, use a service called reCaptcha to help stop automated “spambots” from leaving comments. Before you’re allowed to post a comment, you’re asked to type in a pair of words distorted so that computer programs can’t read them. One of those words is a test to prove that you’re a human. The other word is actually from an old book, newspaper, or magazine that OCR was unable to automatically convert to text. When you enter the word, you help convert the document from an image to plain text. In essence, millions of people are doing the work that the computer can’t do, one word at a time.
In September, 2009, reCaptcha was bought by Google, and their current focus is on converting the New York Times archives to plain text. It’s projected that the entire archive will be completely converted some time in 2010. I hope that the plain text generated by this project will be made publicly available — at least for the public domain articles — so that I can integrate them into this website.
You can try out reCaptcha for yourself at right without needing to leave a comment. To learn more about reCaptcha and how they are helping to digitize old books and newspapers, visit their website.
Leave a Reply