A method for soliciting help from the general public in order to assist large, text-to-computer projects that digitize thousands of old books. CAPTCHAs are the distorted words found on Web sites that users must type back in to validate that they are humans and not computers. Every day, tens of millions of CAPTCHAs are entered, creating a huge pool of human resources to draw on.
In a reCAPTCHA system, the images of words that the optical character recognition (OCR) scanner cannot decipher are dispersed to several people in the form of a CAPTCHA to get a consensus. For more information or to get reCAPTCHA code, visit
Download Computer Desktop Encyclopedia to your PC, iPhone or Android.
| This article is outdated. Please update this article to reflect recent events or newly available information. Please see the talk page for more information. (February 2012) |
reCAPTCHA is a system originally developed at Carnegie Mellon University's main Pittsburgh campus. It uses CAPTCHA to help digitize the text of books while protecting websites from bots attempting to access restricted areas.[1] On September 16, 2009, Google acquired reCAPTCHA.[2] reCAPTCHA is currently digitizing the archives of The New York Times and books from Google Books.[3] As of 2009, twenty years of The New York Times had been digitized and the project planned to have completed the remaining years by the end of 2010.[4]
reCAPTCHA supplies subscribing websites with images of words that optical character recognition (OCR) software has been unable to read. The subscribing websites (whose purposes are generally unrelated to the book digitization project) present these images for humans to decipher as CAPTCHA words, as part of their normal validation procedures. They then return the results to the reCAPTCHA service, which sends the results to the digitization projects.
The system is reported to display over 100 million CAPTCHAs every day,[5] and among its subscribers are such popular sites as Facebook, TicketMaster, Twitter, 4chan, CNN.com, and StumbleUpon.[6] Craigslist began using reCAPTCHA in June 2008.[7] The U.S. National Telecommunications and Information Administration also used reCAPTCHA for its digital TV converter box coupon program website as part of the US DTV transition.[8]
|
Contents
|
The reCAPTCHA program originated with Guatemalan computer scientist Luis von Ahn, aided by a MacArthur Fellowship. An early CAPTCHA developer, he realized "he had unwittingly created a system that was frittering away, in ten-second increments, millions of hours of a most precious resource: human brain cycles".[9]
Scanned text is subjected to analysis by two different optical character recognition programs. Their respective outputs are then aligned with each other by standard string matching algorithms and compared both to each other and to an English dictionary. Any word that is deciphered differently by both OCR programs or that is not in the English dictionary is marked as "suspicious" and converted into a CAPTCHA. The suspicious word is displayed along with a control word already known. The system assumes that if the human types the control word correctly, the questionable word is also correct. If the user were to correctly type the control word "gone", but incorrectly type the word OCR failed to recognize, the digital version of documents could end up containing the incorrect word. Thus, due to human error in distinguishing between the word Internet and the French name Infernet, references to Captain Infernet have occasionally become Captain Internet.[10] The identification performed by each OCR program is given a value of 0.5 points, and each interpretation by a human is given a full point. Once a given identification hits 2.5 points, the word is considered called. Those words that are consistently given a single identity by human judges are recycled as control words.[11]
reCAPTCHA tests are taken from the central site of the reCAPTCHA project, which supplies the words to be deciphered. This is done through a JavaScript API with the server making a callback to reCAPTCHA after the request has been submitted. The reCAPTCHA project provides libraries for various programming languages and applications to make this process easier. reCAPTCHA is a free service (that is, the CAPTCHA images are provided to websites free of charge, in return for assistance with the decipherment),[12] but the reCAPTCHA software itself is not open source.
reCAPTCHA offers plugins for several web-application platforms, like ASP.NET, Ruby, or PHP, to ease the implementation of the service.
The basis of the CAPTCHA system is to prevent automated access to a system by computer programs or "bots". On December 14, 2009, Jonathan Wilkins released a paper describing weaknesses in reCAPTCHA that allowed a solve rate of 18%.[13][14][15]
On August 1, 2010, Chad Houck gave a presentation to the DEF CON 18 Hacking Conference detailing a method to reverse the distortion added to images which allowed a computer program to determine a valid response 10% of the time.[16][17] The reCAPTCHA system was modified on 21 July 2010, before Houck was to speak on his method. Houck modified his method to what he described as an "easier" CAPTCHA to determine a valid response 31.8% of the time. Houck also mentioned security defenses in the system such as a high security lock out if a valid response isn't given 32 times in a row.[18]
On May 26th, 2012 Adam, C-P and Jeffball of DC949 gave a presentation at the LayerOne hacker conference detailing how they were able to achieve an automated solution with an accuracy rate of 99.1%[19]. Their tactic was to use a form of artificial intelligence known as machine learning to analyse the audio version of reCAPTCHA which is available for the visually impaired. Google released a new version of reCAPTCHA just hours before their talk which made major changes to both the audio and visual versions of their service. In this release, the audio version was increased in length from 8 seconds to 30 seconds, and is much more difficult to understand, both for humans as well as bots.
reCAPTCHA frequently modifies its system which would require the author of a similar program to frequently update the method of decoding, which may frustrate potential abusers.
reCAPTCHA has also created project Mailhide, which protects email addresses on web pages from being harvested by spammers.[20] By default, the email address is converted into a format that does not allow a crawler to see the full email address. For example, "mailme@example.com" would be converted to "mai...@example.com". The visitor would then click on the "..." and solve the CAPTCHA in order to obtain the full email address. One can also edit the popup code so that none of the address is visible.
This entry is from Wikipedia, the leading user-contributed encyclopedia. It may not have been reviewed by professional editors (see full disclaimer)