CAPTCHA: What? Why? Build. Break.
Love them, hate them, or otherwise in this day and age, CAPTCHAs are a part of everyday life on the web. In this blog we will dig a little deeper into the technology behind CAPTCHAs to find out what they are, why they are used, and how they are created, implemented, bypassed and broken. What are these things, and why are they everywhere?
First and foremost, let’s strip away the first layer of obfuscation and figure out exactly what CAPTCHA means.
|T||uring test to tell|
Their basic function is to prevent spammers, bots, and brute force attacks via an automated test designed to inhibit the use of other unwanted automated tests. In theory, a properly designed CAPTCHA has a human success rate over seventy percent and a bot success rate below one percent. While that may seem drastic, we must consider that while a 1/100 chance might be terrible odds for a human, it is hardly an issue at all for automated programs with the ability to send thousands of request a minute.
The most commonly utilized CAPTCHAs are text variations wherein the image containing the text is highly distorted via scrunching, squeezing, pulling, twisting, color, and transparency like the ones seen below.
(Image from: http://www.extremetech.com)
Those with visual impairments are likely to encounter CAPTCHAs in the form of voice recordings in which each character is spoken separately and oftentimes distorted using heavy background noise.
##Further down the rabbit hole
More advanced forms of CAPTCHA can include word and image matching, slide-show style .GIFs each containing their own sub CAPTCHA, reading comprehension tests, etc.
(Image from: http://random.irb.hr/signup.php)
While all of these implementations can be very complex, the sad fact is they can all be easily bypassed (Look at me getting ahead of myself). For now, let’s hold off on that and delve into standard implementations and methods behind creating a successful CAPTCHA.
##But what if I want to build one of my own?
There are quite a few ways to design a CAPTCHA, all of which require a data set of verified answers being matched with their respective challenge. Some of the more complex CAPTCHA implementations utilize images scanned into a database manually and then distorted. These distorted images are then verified once a critical mass of individuals can agree on what the image contains. A specific example of this can be found in one of the many iterations of reCAPTCHA.
(Image from: http://developers.google.com)
In this iteration, reCAPTCHA has the user read and answer the CAPTCHA challenge, which contains two separate images placed side by side. The first image in the CAPTCHA is a known or “verified” image, meaning that the correct answer is already known by the system. The second image in the CAPTCHA does not have an associated “verified” answer as it is currently being “vetted”. What this means to the everyday user is that as long as they enter the answer to the first image correctly they will pass the CAPTCHA regardless of the answer to the second.
The second image in the CAPTCHA must reach a critical mass of the same answer given by a large number of individuals who have passed the test of the first image. By doing this the system assumes that if you are able to give the correct answer for the first image you are in fact human; therefore, it will now take all the answers for the second image at face value until it has reached the threshold I mentioned earlier, at which time the newly “vetted” secondary image will move to a primary spot and the cycle begins anew.
While this can get a bit complex in terms of designing a CAPTCHA from scratch, in most cases developers can simply plug an existing CAPTCHA service into their application. You should, however, do some research on the CAPTCHA service you are using as there have been instances in which the provider itself is vulnerable to bypass attacks through impersonation. If you would like to read more on impersonating a CAPTCHA provider take a look here.
##No but really, I want to hack something
All CAPTCHAs regardless of their complexity can and will be bypassed if the end goal justifies the means. Now that I have your attention let’s discuss why that is. There are many ways to bypass a CAPTCHA depending on the implementation itself; however, the main two ways are:
- OCR to automatically read and answer CAPTCHAs
- Human CAPTCHA breaking “farms”
First, let’s cover the basics of OCR and what it does. Optical Character Recognition or OCR is a technology that was designed primarily to read characters from scanned documents so that machines can read, manipulate, search, and store the data more efficiently. As with most other technology OCR has since then been highly adaptive, and is being used for a variety of purposes. In the case of CAPTCHAs, OCR software has been trained to attempt pattern recognition similar to that of the human brain (Looking for free OCR for CAPTCHAs? Look no further! http://skipinput.com). What this means is both fascinating and slightly frightening; however, I would rather not touch on the philosophical debate of artificial intelligence in this blog so we are moving right along.
Another frequently utilized method for bypassing CAPTCHA on a large scale are human type “farms”.
Human type farms push the work of solving CAPTCHAs to other people. This is done by isolating the CAPTCHA in the application and passing it to a third party via framing or other methods and then passing the correct answer back to the application once it has been solved. This same method of passing the CAPTCHA has been used by attackers through the creation of image galleries or the like (often times pornographic) in which a user must solve a CAPTCHA to view the image. These “farms” typically charge anywhere between a few cents to a couple dollars per 1,000 CAPTCHA’s solved correctly.
(Image from: http://cheapcaptcha.com/pricing)
##Wrap it up
The conclusion here is that ultimately, if someone wants to bypass your CAPTCHA badly enough they have more than a few cheap and reliable solutions due to the inherent flaw: they are designed to be solved. I don’t mean to say that CAPTCHAs are completely useless, but rather, almost completely useless. The best you can do is try to prevent the masses from getting in. The most efficient way to do that is by using an existing, reputable implementation. A good example is the new “noCAPTCHA reCAPTCHA” which I plan to go over further in a future post.