Early CAPTCHAs such as these, generated by the EZ-Gimpy program, were used on Yahoo. However, technology was developed to read
this type of CAPTCHA
[1].
A modern CAPTCHA. Rather than attempting to create a distorted background and high levels of warping on the text, this CAPTCHA
focuses on making segmentation difficult by adding an angled line.
Another way to making segmentation difficult. Crowded symbols can be easily read by humans but can't be segmentated by
bots.
A CAPTCHA (IPA: /ˈkæptʃə/) is a type
of challenge-response test used in computing to determine whether the user is human. "CAPTCHA" is a
contrived acronym for "Completely Automated Public
Turing test to tell Computers and Humans Apart", trademarked by
Carnegie Mellon University. A CAPTCHA involves one computer (a
server) which asks a user to complete a test. While the computer is able to generate
and grade the test, it is not able to solve the test on its own. Because computers are unable to solve the CAPTCHA, any user
entering a correct solution is presumed to be human. The term CAPTCHA was coined in 2000 by Luis
von Ahn, Manuel Blum, Nicholas J. Hopper (all of Carnegie Mellon University), and
John Langford (then of IBM). A common
type of CAPTCHA requires that the user type the letters of a distorted image, sometimes with the addition of an obscured sequence
of letters or digits that appears on the screen.
A CAPTCHA is sometimes described as a reverse Turing test, because it is
administered by a machine and targeted to a human, in contrast to the standard Turing test
that is typically administered by a human and targeted to a machine.
Characteristics
A CAPTCHA system is a means of generating new challenges which:
- Current computers are unable to accurately solve.
- Most humans can solve.[2]
- Does not rely on the attacker never having seen the given type of CAPTCHA before. For example, although a checkbox "check
here if you are not a bot" might serve to distinguish between humans and computers, it is not a CAPTCHA because it relies on the
fact that an attacker has not spent effort to break that specific form.
- Is able to automatically generate new challenges that require artificial intelligence techniques to solve.
In practice, the algorithm used to create the CAPTCHA does not need to be made public, though it may be covered by a patent.
Although publication can help demonstrate that breaking it requires the solution to a difficult problem in the field of
artificial intelligence, deliberate withholding of the algorithm can increase
the integrity of a limited set of systems (see security through obscurity).
The most important factor in deciding whether an algorithm should be made open or restricted is the size of the system. Although
an algorithm which survives scrutiny by security experts may be assumed to be more conceptually secure than an unevaluated
algorithm, an unevaluated algorithm specific to a very limited set of systems is always of less interest to those engaging in
automated abuse. Breaking a CAPTCHA generally requires some effort specific to that particular CAPTCHA implementation, and an
abuser may decide that the benefit granted by automated bypass is negated by the effort required to engage in abuse of that
system in the first place.
Origin
The potential difficulty of differentiating humans from computers pretending to be humans was addressed at least as early as
1950, when Alan Turing described his now-famous Turing
test. (His test was not automated.) The first discussion of automated tests which distinguish humans from computers for
the purpose of controlling access to web services appears in a 1996 manuscript of Moni Naor
from the Weizmann Institute of Science, entitled "Verification of a human
in the loop, or Identification via the Turing Test".
A simple CAPTCHA had been developed in 1995 by Anton Lam of The Chinese
University of Hong Kong, in a voting application written for Radio Television
Hong Kong. The public were able to vote for their favorite singers and songs online for the first time in the annual "Top
Ten Chinese Songs Award". To prevent robotic submissions, users were required to input a 6-digit number, which was displayed in
an image, correctly.
Other primitive CAPTCHAs seems to have been later developed in 1997 at AltaVista by
Andrei Broder and his colleagues to prevent bots
from adding URLs to their search engine.
In order to make the images resistant to OCR (Optical Character
Recognition), the team simulated situations that scanner manuals claimed resulted in bad OCR. In 2000, von Ahn and Blum developed
and publicized the notion of a CAPTCHA, which included any program that can distinguish humans from computers. They invented
multiple examples of CAPTCHAs, including the first CAPTCHAs to be widely used (at Yahoo!).
Applications
CAPTCHAs are used to prevent automated software from performing actions which degrade the quality of service of a given
system, whether due to abuse or resource expenditure. Although CAPTCHAs are most often deployed as a response to encroachment by
commercial interests, the notion that they exist to stop only spammers is mistaken.
CAPTCHAs can be deployed to protect systems vulnerable to e-mail spam, such as the
webmail services of Gmail, Hotmail, and Yahoo!. CAPTCHAs have also found active use in
stopping automated posting to blogs or forums, whether as a
result of commercial promotion, or harassment and
vandalism. CAPTCHAs also serve an important function in rate limiting, as automated usage of a service might be desirable
until such usage is done in excess, and to the detriment of human users. In such a case, a CAPTCHA can enforce automated usage
policies as set by the administrator when certain usage metrics exceed a given threshold. An example of a system in which
vulnerabilities exist, which could easily be prevented using CAPTCHA, is presented in [3].
Accessibility
- See also: Web accessibility
Because CAPTCHAs rely on perception, users unable to perceive a CAPTCHA (for example, due to a disability or because it is
difficult to read) will be unable to perform the task protected by a CAPTCHA. As such, sites implementing CAPTCHAs should provide
an audio version of the CAPTCHA in addition to the visual method. The official CAPTCHA site [4] recommends providing an audio CAPTCHA for accessibility reasons.
Attempts at more accessible CAPTCHAs
Even an audio and visual CAPTCHA will require manual intervention for some users, such as those who are both deaf and blind.
There have been various attempts at creating CAPTCHAs that are more accessible. Attempts include the use of JavaScript[5], mathematical questions ("what is 1+1"), or "common sense"
questions ("what color is the sky"). These attempts violate one or both of the principles of CAPTCHAs: either they cannot be
automatically generated or they can be easily cracked given the state of artificial intelligence. As such, the only security
these CAPTCHAs provide is security through obscurity; an attacker is unlikely
to have encountered the formulation of the CAPTCHA in question, and unlikely to find it worth the time spending resources to
break the CAPTCHA of a small site.
Due to the lack of security provided by text based CAPTCHAs, most sites choose to use an audio and visual CAPTCHA as a way of
balancing accessibility and security. Often, email support is used to manually provide access to users who are unable to solve a
CAPTCHA.
Circumvention
There are a few approaches to defeating CAPTCHAs: using cheap human labor to
recognize them, exploiting bugs in the implementation that allow the attacker to completely bypass the CAPTCHA, and finally
improving character recognition software.
Human solvers
CAPTCHA is vulnerable to a relay attack that uses humans to solve the puzzles. One
approach involves relaying the puzzles to a sweatshop of human operators who can solve
CAPTCHAs. In this scheme, a computer fills out a form and when it reaches a CAPTCHA, it gives the CAPTCHA to the human operator
to solve. If the humans are dedicated employees who receive minimum wage this is not likely
to be viable.[6] Another variation of this technique
involves copying the CAPTCHA images and using them as CAPTCHAs for a high-traffic site owned by the attacker. With enough
traffic, the attacker can get a solution to the CAPTCHA puzzle in time to relay it back to the target site.[7]
Insecure implementation
Like any security system, design flaws in a system implementation can prevent the theoretical security from being realized.
Many CAPTCHA implementations, especially those which have not been designed and reviewed by experts in the fields of security,
are prone to common attacks.
Some CAPTCHA protection systems can be bypassed without using OCR
simply by re-using the session ID of a known CAPTCHA image. A correctly designed CAPTCHA does
not allow multiple solution attempts at one CAPTCHA. This prevents the reuse of a correct CAPTCHA solution or making a second
guess after an incorrect OCR attempt.[8]. Other CAPTCHA
implementations use a hash (such as an MD5 hash) of the solution as a key passed to the client to
validate the CAPTCHA. Often the CAPTCHA is of small enough size that this hash could be cracked.[9] Further, the hash could assist an OCR based attempt. A more secure scheme would
use an HMAC. Finally, some implementations use only a small fixed pool of CAPTCHA images.
Eventually, when enough CAPTCHA image solutions have been collected by an attacker over a period of time, the CAPTCHA can be
broken by simply looking up solutions in a table, based on a hash of the challenge image.
Computer character recognition
A number of research projects have attempted (often with success) to beat visual CAPTCHAs by creating programs that contain
the following functionality:
- Extraction of the image from the web page.
- Removal of background clutter, for example with color filters and detection of thin lines.
- Segmentation, i.e. splitting the image into segments containing a single letter.
- Identifying the letter for each segment.
Steps 1, 2, and 4 are easy tasks for computers [10] The
only part where humans still outperform computers is segmentation. If the background clutter consists of shapes similar to letter
shapes, and the letters are connected by this clutter, the segmentation becomes nearly impossible with current software. Hence,
an effective CAPTCHA should focus on the segmentation.
Several research projects have broken real world CAPTCHAs, including one of Yahoo's early CAPTCHAs called "EZ-Gimpy"[11] and the CAPTCHA used by popular sites such as Paypal and
LiveJournal as well as open source software such as phpBB.[12] [13]
Image-recognition CAPTCHAs
Some researchers promote image recognition CAPTCHAs as a possible alternative for text based CAPTCHAs. To date, no major
website has made use of an image based CAPTCHA. As such, the technology would be best described as in the stage of theoretical
research. Image recognition CAPTCHAs face many potential problems which have not been fully studied:
- It is difficult for a small site to acquire a large dictionary of images which an attacker does not have access to. Without a
means of automatically acquiring new labelled images, an image based challenge does not meet the definition of a CAPTCHA.
- Some current image recognition CAPTCHAs ask the user to make a binary choice (is this a cat or a dog?[14]). Even with 16 images, a bot has a 1 in 65536 (=216) chance of
getting the image right. In order to be effective against a botnet attack, the user would be
forced to solve a prohibitively large number of images.
Collateral benefits
Some of the original inventors of the CAPTCHA system have implemented a means by which some of the effort and time spent by
people who are responding to CAPTCHA challenges can be harnessed as a distributed work system. This works by including "solved"
and "unrecognized" elements (images which were not successfully recognized via OCR) in each challenge. The respondent thus answers both elements and roughly half of his
or her effort validates the challenge while the other half is captured as work.
This reCAPTCHA system is being used to aid in the conversion of printed works (scanned
images) into digital text. The approach is similar to one of the techniques by which CAPTCHA systems can be circumvented (in that
the respondents are performing human intelligence to accomplish small amounts of work in a highly distributed way).
The reCAPTCHA maintainers estimate that existing CAPTCHA systems represent approximately 150,000 hours of labor per day that
could be transparently tapped into via their revised system. This would be equivalent to nearly 19,000 people working 8 hours per
day on correcting OCR.[15]
References
See also
External links
Defeating CAPTCHAs:
This entry is from Wikipedia, the leading user-contributed encyclopedia. It may not have been reviewed by professional editors (see full disclaimer)