ReBreakCaptcha: Breaking Google’s ReCaptcha v2 using.. Google

TL;DR A logic vulnerability, dubbed ReBreakCaptcha, which lets you easily bypass Google’s ReCaptcha v2 anywhere on the web.

Overview

Back in 2016, I started poking around to see how hard it would be for a threat actor to find a new method that bypasses Google’s ReCaptcha v2. It would be ideal if it worked in any environment, rather than being tailored to fit a specific use case.

I would like to introduce you to ReBreakCaptcha – a brand new bypassing technique for Google’s ReCaptcha v2.

ReBreakCaptcha works in three stages:

  1. Audio Challenge – Getting the correct challenge type.
  2. Recognition – Converting the audio challenge audio and sending it to Google’s Speech Recognition API.
  3. Verification – Verifying the Speech Recognition result and bypassing the ReCaptcha.

As of the time of posting, it is confirmed that this vulnerability still works.

ReBreakCaptcha Stage 1: Audio Challenge

There are three types of ReCaptcha v2 challenges:

  • Image Challenge – The challenge contains a description and an image which consists of 9 sub-images. The user is requested to select those sub-images that best match the given description.
  • Audio Challenge – The challenge contains an audio recording, The user is requested to enter the digits that are heard.
  • Text Challenge – The challenge contains a category and 5 candidate phrases. The user is requested to select those phrases which best match the given category.

ReBreakCaptcha knows how to solve ReCaptcha v2 audio challenges. Therefore, we need a methodology of how to get an audio challenge every time.

When clicking the “I’m not a robot” checkbox of ReCaptcha v2, we are often presented with the following challenge type:

imagechallenge
Figure 1: Image Challenge

To get an audio challenge we need to click the following button:

audiochallengebutton
Figure 2: The Audio Challenge Button

Then we are presented with an audio challenge that can be easily bypassed:

audiochallenge
Figure 3: Audio Challenge

Some of you may notice that instead of an audio challenge, sometimes you get a text challenge like so:

textchallenge
Figure 4: Text Challenge

To bypass it and get an audio challenge, you simply click the ‘Reload Challenge’ button until you get the correct type. The Reload-Challenge button:

newchallengebutton
Figure 5: Get New Challenge Button

What was our goal? To bypass the ReCaptcha. Can we do this? Yes. How? Google Speech Recognition API!

ReBreakCaptcha Stage 2: Recognition

Now comes the fun part, taking advantage of one Google’s service to beat another Google’s service!
Let’s get back to the audio challenge (Figure 3).
As you can see, the controls on this challenge page are:
1. A play button – to hear the challenge.
2. A textbox – for user input.
3. A download button – to download the audio challenge.

Let’s download the audio file and send it to Google Speech Recognition API. Before doing so, we will convert it to a ‘wav’ format which is requested by Google’s Speech Recognition API.
Now we have the audio challenge file and are ready to send it to Google Speech Recognition.
How can this be done? Using their API.

There is a great Python library named SpeechRecognition for performing speech recognition, with support for several engines and APIs, online and offline.
We will use this library implementation of Google Speech Recognition API.

We will send the ‘wav’ audio file and the Speech Recognition will send us back the result in a string (e.g. ‘25143’).

This result will be the solution to our audio challenge.

ReBreakCaptcha Stage 3: Verification

This stage is fairly short. All we need to do now is to copy-paste the output string from Stage 2 into the textbox, and click ‘Verify’ on the ReCaptcha widget.

That’s right, we now semi-automatically used Google’s Services to bypass another service of its own.

ReBreakCaptcha Complete Proof-Of-Concept

I have proceeded and made a complete POC script using Python.

It utilizes all of the presented stages of the technique for a fully-automated bypass of ReCaptcha v2.

Link to the GitHub repository: https://github.com/eastee/rebreakcaptcha

3/2/2017- Update:

It has come to my attention that a lot of people encounter a harder version of the audio challenge. Therefore, I have commited a workaround to the GiHub Repo that should overcome this situation, though at a lower success rate compared to the original easier audio challenges.
It is still not fully clear how this harder version is triggered, but the number one reason suspected is when your IP is suspicious to Google.
This is usually the case when one uses a public proxy / VPN, as their IP’s are flagged in the Google system as suspicious (harder ReCaptcha’s and more ReCaptcha encounters).

3/3/2017 – Update #2:

It seems that Google has fully patched this: raising the minimum number of digits from 4-5 to 10-12 and introducing new digit recordings that are harder to speech recognize, as well as background noise. The POC has stopped working as a result. It’s been fun while it lasted 🙂

69 thoughts on “ReBreakCaptcha: Breaking Google’s ReCaptcha v2 using.. Google

  1. Hi, to start thanks for your research about this. I’ve testing in 3 examples, and none had the correct answer: first one only detected 3 out of 6 numbers, the seconds had 10 digits, one of them wrong, and the third couldn’t recognise.

    Also, it seams that google implement a max number of retries for audio challenge.

    Like

    1. Hello David,
      Thank you for reading my post.
      It seems that your IP was suspicious to Google, thus you got a much harder than normal version of the audio challenge (the easy ones are 4-5 in length).
      First, please update your code from the GitHub Repo, as I commited a workaround for this problem (though success rate is lower than the easier ones).
      Then, may I ask you to try changing your IP / not use any proxy nor VPN (VPN IP’s are known to be more suspicious- thus harder challenges, more captchas than normal etc.)

      Like

  2. The hard part is writing a script that can jump through the many random extra hurdles that are required, and completely automating it without any user intervention and solving it in a reasonable amount of time.

    First thing I encountered in this approach:
    “Multiple correct solutions required – please solve more.”

    Great that you’ve coded this case in, but the audio is sufficiently garbled enough that over 10 tries, it still hasn’t solved it.

    [1] Clicking on audio challenge
    [1] Google Speech Recognition: cl350 6178
    [1] Need to solve more. Let’s do this!
    [1] Google Speech Recognition: 3156 7711
    [1] Need to solve more. Let’s do this!
    [1] Google Speech Recognition: 297 bh2433
    [1] Need to solve more. Let’s do this!
    [1] Google Speech Recognition: nice ring to 129 route 27
    [1] Need to solve more. Let’s do this!
    [1] Google Speech Recognition could not understand audio
    [1] Need to solve more. Let’s do this!
    [1] Google Speech Recognition could not understand audio
    [1] Need to solve more. Let’s do this!
    [1] Google Speech Recognition could not understand audio
    [1] Need to solve more. Let’s do this!
    [1] Google Speech Recognition could not understand audio
    [1] Need to solve more. Let’s do this!
    [1] Google Speech Recognition could not understand audio
    [1] Need to solve more. Let’s do this!

    Eventually the code broke.

    CouldntDecodeError(“Decoding failed. ffmpeg returned error code: {0}\n\nOutput from ffmpeg/avlib:\n\n{1}”.format(p.returncode, p_err))
    pydub.exceptions.CouldntDecodeError: Decoding failed. ffmpeg returned error code: 1
    [mp3 @ 0000000000706e40] Failed to read frame size: Could not seek to 1154.
    c:\users\dave\appdata\local\temp\tmpaelusk: Invalid argument

    So, this is not a working solution.

    Like

    1. Hello Dave,
      Thank you for reading my post.
      It seems that your IP was suspicious to Google, thus you got a much harder than normal version of the audio challenge (the easy ones are 4-5 in length).
      First, please update your code from the GitHub Repo, as I commited a workaround for this problem (though success rate is lower than the easier ones).
      Then, may I ask you to try changing your IP / not use any proxy nor VPN (VPN IP’s are known to be more suspicious- thus harder challenges, more captchas than normal etc.)
      As for the traceback- it should be an issue in the GitHub Repo, not here.
      I can add a catch for that exception but it seems to be low memory issue / unfinished audio download.

      Like

      1. Thanks for your reply. Unfortunately your script still does not work and stops after multiple tries and these errors:.

        best_hypothesis = max(actual_result[“alternative”], key=lambda alternative: alternative[“confidence”])
        KeyError: ‘confidence’

        and

        audio_output = self.string_to_digits(recognizer.recognize_houndify(audio, client_id=HOUNDIFY_CLIENT_ID, client_key=HOUNDIFY_CLIENT_KEY))
        File “C:\Python27\lib\site-packages\speech_recognition\__init__.py”, line 960, in recognize_houndify
        base64.urlsafe_b64decode(client_key),
        File “C:\Python27\lib\base64.py”, line 119, in urlsafe_b64decode
        return b64decode(s.translate(_urlsafe_decode_translation))
        File “C:\Python27\lib\base64.py”, line 78, in b64decode
        raise TypeError(msg)
        TypeError: Incorrect padding

        I’m using my home IP, and it should not be ‘suspicious’ to google at all. The script is not guessing the numbers correctly, in fact some of the guesses even have words in them not numbers.

        Are you saying this only works on ‘easy’ short length captchas with a low rate of success, in my case 0% success rate, and after multiple tries, and after a few minutes. I’m wondering if you can provide a video of it solving correctly for 10 captchas in a row. That would be enough to prove that it is an ‘easy bypass’. As it stands, this is not a feasible solution, it didn’t solve the captcha even once.

        Like

    1. Hello AppSec 🙂
      This is the first time it comes to my attention that something like this has already been around for so long, though is quite different from this blog post (audio analysis using own taught neural networks vs. years of experience speech recognition services that are self improving using users data).
      2012.. 5 years later and we’re back to square one, although this is a complete new product (ReCaptcha V2 + some more revisions).
      First, please see the updated code of the GitHub Repo, as I commited a workaround for the complex audio problem (though success rate is lower than the easier audios).
      Second, your POC is nice but is manual and takes a long time to complete- thus rendering it non useful in a bot scenario.
      My POC is fully automatic- without human interaction at all- thus breaks the ReCaptcha most important intention- to block non human site interaction.

      Liked by 1 person

    1. Hello Nikita,
      Thank you for reading my post.
      There are many Pingbacks to this blog post, so I’d say it works as expected.
      If you encounter any problem, try asking for help in the official WordPress forums.

      Like

  3. Hi, i work on this captcha audio too, but only i didn’t use google speechAPI, i use another method,
    i use nuance dragon naturally speaking for recognizing the audio and i think this software run good for that although this software not 100% can give exact number but it can help to type the number (have to reload the audio when the number only appear 2 digits) . and i had another challenge because i want to run this method automatically, if you could help then? (maybe some script in imacros+java)

    thank you in advance

    Like

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s