Friday 5 October 2007

reCAPTCHA

A CAPTCHA image:


I read about this amazing bit of technology today. It's to do with CAPTCHA, the challenge-response test used in computing to determine whether the user is human. Classically the CAPTCHA image text is entered into a form when signing up for a website (to prevent spammers, etc, using computers to sign up millions of accounts automatically).

A CAPTCHA image being used signing up for Google:


Essentially there's a problem in the academic world with converting current physical texts (books, papers, scrolls, etc) into digital media. The most advanced text-scanning software is called Optical Character Recognition (OCR) but still can only accurately read about 90% of the words. Around one word in ten is mis-read by OCR. As there are around 100 million books waiting to be digitized and the only way to accurately decipher these words is with human intervention, this is a big problem.

OCR mis-reading words:


The ingenious solution the guys at OCR have come up with is called reCAPTCHA. It puts the undecipherable words into CAPTCHA images for website sign-ups. To double-check the authenticity of the response to the CAPTCHA check they put the image out to two different people and if both the responses match - bingo.

A reCAPTCHA image:


With this solution the guys using OCR are able to decipher around one million words per day - saving around three thousand man hours per day. And that's awesome.

Original article:
http://news.bbc.co.uk/2/hi/technology/7023627.stm
CAPTCHA on Wikipedia:
http://en.wikipedia.org/wiki/Captcha