The widespread use of CAPTCHAs, those mashed-up letter sequences you must type to “prove you’re not a robot,” are not just annoying — they’re also useless.
The company behind the software built it to think and learn like humans do. It could eventually have applications in robotics and image and video search.
To some, a web site like Craigslist asking you to verify that you are indeed a human by retyping distorted, nonsensical words is irritating. But the next time you do it, you could be helping to fill in some historical blanks.
NPR ran a story yesterday on Luis von Ahn, assistant professor of computer science at Carnegie Mellon University and one of the guys who helped develop the CAPTCHA technology. The short version: Efforts to digitize (really) old books and newspapers were being hampered by faded ink that confounded OCR software. The solution von Ahn came up with was to use the words that the software couldn’t recognize and insert them into these so-called reCAPTCHAs and use the power of human brains to decipher them. CAPTCHAs serve up two words, one is the security word, the other goes toward the book digitization effort. It sounded interesting, so I called von Ahn to find out more.
Here’s how it works. The New York Times is working to digitize all of its issues starting way back in 1851. It starts by scanning every single page as an image. That’s where reCAPTCHA comes in. It runs two optical character recognition (OCR) programs to turn all of those images of pages into text. Different OCR programs tend to make different mistakes. When the two programs disagree on a word, that word is plucked out and distributed among CAPTCHA security programs spread out across 45,000 web sites like Craigslist and TicketMaster.