This is a simple project to demonstrate how to use Puppeteer to extract data from a website that is protected by reCaptcha. It uses a headless browser provided by Puppeteer to load the page. To solve the reCaptcha, it uses a service called 2Captcha. The service provides an API that can be used to solve the reCaptcha.
The project requires the following to be installed:
It also requires a 2Captcha API key.
After installing the requirements, clone the repository and install the dependencies:
git clone https://github.com/arman-bd/capsy-the-puppeteer.git
cd capsy-the-puppeteer
npm install
The configuration is done in the .env
file. The following environment variables are required:
CAPTCHA_API_KEY
: The API key for 2CaptchaCAPTCHA_API_URL
: The API URL for 2Captcha
Copy the .env.example
file to .env
by running the following command:
cp .env.example .env
Now, you can edit the .env
file and fill in the values.
You can run the project in development mode by running the following command:
npm run dev
This will start the server on port 8800. You can access the API at http://localhost:8800
.
You can also run the project using Docker. To do so, run the following command:
docker-compose up --build -d
The command will build the Docker image and start the container in detached mode.
You can access the API at http://localhost:8800
.
The application currently has the following API(s):
GET /
: Returns a simple message to indicate that the API is working.GET /ping
: Returns a ping response with Timestamp.GET /task/screenshot?url={url}
: Returns a screenshot of the given URL.GET /track/caru?id={id}
: Returns the tracking information for the given Caru Container ID.
The project uses Puppeteer to load the page and solve the reCaptcha. The following steps are performed for Caru Container Tracking API:
- The page is loaded using Puppeteer.
- Waits for the page to load.
- Checks if the reCaptcha is present.
- Asks 2Captcha to solve the reCaptcha.
- Places the solution in the reCaptcha field.
- Clicks the "Continue" button.
- Waits for the next page to load.
- Places the Container ID in the field.
- Submits the form.
- Waits for the page to load.
- Extracts the tracking information.
- Returns the tracking information via the API.
Normally the Puppeteer would run in headless mode. However, for demonstration purposes, the Puppeteer is run in non-headless mode. This allows you to see the browser in action. The following video shows the project in action:
The video shows the following:
- The API is called to get the tracking information for the given Caru Container ID.
- The Puppeteer loads the page.
- The reCaptcha is solved.
- The tracking information is extracted.
API Output
After extracting the tracking information, the API returns the output to the original requester.
Note: Some of the information is hidden in the screenshot for privacy reasons.
This project is licensed under the MIT License - see the LICENSE file for details.
This project is for educational purposes only. It is not intended to be used for any production purposes. The author is not responsible for any misuse or damage caused by this project. Use it at your own risk.