Skip to content

Commit

Permalink
move the instruction of readme and container provisioner to separate …
Browse files Browse the repository at this point in the history
…readme files
  • Loading branch information
algo7 committed Jan 21, 2024
1 parent eee4ad5 commit 2039560
Show file tree
Hide file tree
Showing 3 changed files with 45 additions and 52 deletions.
53 changes: 1 addition & 52 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,15 +21,7 @@ A simple scraper for TripAdvisor (Hotel, Restaurant, Airline) reviews.
- [Run Using Docker CLI](#run-using-docker-cli)
- [Known Issues](#known-issues)
- [Container Provisioner](#container-provisioner)
- [Pull the latest scraper Docker image](#pull-the-latest-scraper-docker-image)
- [Credentials Configuration](#credentials-configuration)
- [R2 Bucket Credentials](#r2-bucket-credentials)
- [R2 Bucket URL](#r2-bucket-url)
- [Run the container provisioner](#run-the-container-provisioner)
- [Visit the UI](#visit-the-ui)
- [Live Demo](#live-demo)
- [Proxy Pool](#proxy-pool)
- [Running the Proxy Pool](#running-the-proxy-pool)

## How to Install Docker:
1. [Windows](https://docs.docker.com/desktop/windows/install/)
Expand Down Expand Up @@ -67,47 +59,4 @@ A simple scraper for TripAdvisor (Hotel, Restaurant, Airline) reviews.
3. The hotel scraper uses date of review instead of date of stay as the date because the date of stay is not always available.

# Container Provisioner
Container Provisioner is a tool written in [Go](https://go.dev/) that provides a UI for the users to interact with the scraper. It uses [Docker API](https://docs.docker.com/engine/api/) to provision the containers and run the scraper. The UI is written in raw HTML and JavaScript while the backend web framwork is [Fiber](https://docs.gofiber.io/).

The scraped reviews will be uploaded to [Cloudflare R2 Buckets](https://www.cloudflare.com/lp/pg-r2/) for storing. R2 is S3-Compatible; therefore, technically, one can also use AWS S3 for storing the scraped reviews.

## Pull the latest scraper Docker image
```bash
docker pull ghcr.io/algo7/tripadvisor-review-scraper/scraper:latest
```
## Credentials Configuration
### R2 Bucket Credentials
You will need to create a folder called `credentials` in the `container_provisioner` directory of the project. The `credentials` folder will contain the credentials for the R2 bucket. The credentials file should be named `creds.json` and should be in the following format:
```json
{
"bucketName": "<R2_Bucket_Name>",
"accountId": "<Cloudflare_Account_Id>",
"accessKeyId": "<R2_Bucket_AccessKey_ID>",
"accessKeySecret": "<R2_Bucket_AccessKey_Secret>"
}
```
### R2 Bucket URL
You will also have to set the `R2_URL` environment variable in the `docker-compose.yml` file to the URL of the R2 bucket. The URL should end with a `/`.

## Run the container provisioner
The `docker-compose.yml` for the provisioner is located in the `container_provisioner` folder.

## Visit the UI
The UI is accessible at `http://localhost:3000`.

## Live Demo
A live demo of the container provisioner is available at [https://algo7.tools](https://algo7.tools).

# Proxy Pool
Proxy Pool is a docker image that runs both HTTP and SOCKS5 Proxies over OpenVPN (config to be provided by the user via docker bind mounts). `sockd`, `squid`, and `openvpn` client are managed by `supervisord` in the container. The service integrates with the Container Provisioner to provide a pool of proxies for the scraper to use. The container provisioner uses `docker-compose labels` to distinguish between different proxies. At this moment, the container provisioner only supports connecting to the Proxy Pool using HTTP proxies. Each service in the `docker-compose.yml` file represents a single proxy in the pool. The `docker-compose.yml` file for the proxy pool is located in the `proxy_pool` folder.

The Proxy Pool service can also be used directly with the scraper. Just make sure that the `PROXY_ADDRESS` environment variable is in the `docker-compose.yml` file for the scraper.

## Running the Proxy Pool
1. Pull the latest scraper Docker image
```bash
docker pull ghcr.io/algo7/tripadvisor-review-scraper/vpn_worker:latest
```
2. Create a docker-compose.yml file containing the configurations for each proxy (see the docker-compose.yml provided in the proxy_pool folder).
3. Place the OpenVPN config file of each proxy in the corresponding bind mount folder speicified in the docker-compose.yml file.
4. Run `docker-compose up` to start the container.
# Proxy Pool
31 changes: 31 additions & 0 deletions container_provisioner/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
# Container Provisioner
Container Provisioner is a tool written in [Go](https://go.dev/) that provides a UI for the users to interact with the scraper. It uses [Docker API](https://docs.docker.com/engine/api/) to provision the containers and run the scraper. The UI is written in raw HTML and JavaScript while the backend web framwork is [Fiber](https://docs.gofiber.io/).

The scraped reviews will be uploaded to [Cloudflare R2 Buckets](https://www.cloudflare.com/lp/pg-r2/) for storing. R2 is S3-Compatible; therefore, technically, one can also use AWS S3 for storing the scraped reviews.

## Pull the latest scraper Docker image
```bash
docker pull ghcr.io/algo7/tripadvisor-review-scraper/scraper:latest
```
## Credentials Configuration
### R2 Bucket Credentials
You will need to create a folder called `credentials` in the `container_provisioner` directory of the project. The `credentials` folder will contain the credentials for the R2 bucket. The credentials file should be named `creds.json` and should be in the following format:
```json
{
"bucketName": "<R2_Bucket_Name>",
"accountId": "<Cloudflare_Account_Id>",
"accessKeyId": "<R2_Bucket_AccessKey_ID>",
"accessKeySecret": "<R2_Bucket_AccessKey_Secret>"
}
```
### R2 Bucket URL
You will also have to set the `R2_URL` environment variable in the `docker-compose.yml` file to the URL of the R2 bucket. The URL should end with a `/`.

## Run the container provisioner
The `docker-compose.yml` for the provisioner is located in the `container_provisioner` folder.

## Visit the UI
The UI is accessible at `http://localhost:3000`.

## Live Demo
A live demo of the container provisioner is available at [https://algo7.tools](https://algo7.tools).
13 changes: 13 additions & 0 deletions proxy_pool/README.MD
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
# Proxy Pool
Proxy Pool is a docker image that runs both HTTP and SOCKS5 Proxies over OpenVPN (config to be provided by the user via docker bind mounts). `sockd`, `squid`, and `openvpn` client are managed by `supervisord` in the container. The service integrates with the Container Provisioner to provide a pool of proxies for the scraper to use. The container provisioner uses `docker-compose labels` to distinguish between different proxies. At this moment, the container provisioner only supports connecting to the Proxy Pool using HTTP proxies. Each service in the `docker-compose.yml` file represents a single proxy in the pool. The `docker-compose.yml` file for the proxy pool is located in the `proxy_pool` folder.

The Proxy Pool service can also be used directly with the scraper. Just make sure that the `PROXY_ADDRESS` environment variable is in the `docker-compose.yml` file for the scraper.

## Running the Proxy Pool
1. Pull the latest scraper Docker image
```bash
docker pull ghcr.io/algo7/tripadvisor-review-scraper/vpn_worker:latest
```
2. Create a docker-compose.yml file containing the configurations for each proxy (see the docker-compose.yml provided in the proxy_pool folder).
3. Place the OpenVPN config file of each proxy in the corresponding bind mount folder speicified in the docker-compose.yml file.
4. Run `docker-compose up` to start the container.

0 comments on commit 2039560

Please sign in to comment.