diff --git a/README.md b/README.md index 8ad8cae..53f274a 100644 --- a/README.md +++ b/README.md @@ -21,15 +21,7 @@ A simple scraper for TripAdvisor (Hotel, Restaurant, Airline) reviews. - [Run Using Docker CLI](#run-using-docker-cli) - [Known Issues](#known-issues) - [Container Provisioner](#container-provisioner) - - [Pull the latest scraper Docker image](#pull-the-latest-scraper-docker-image) - - [Credentials Configuration](#credentials-configuration) - - [R2 Bucket Credentials](#r2-bucket-credentials) - - [R2 Bucket URL](#r2-bucket-url) - - [Run the container provisioner](#run-the-container-provisioner) - - [Visit the UI](#visit-the-ui) - - [Live Demo](#live-demo) - [Proxy Pool](#proxy-pool) - - [Running the Proxy Pool](#running-the-proxy-pool) ## How to Install Docker: 1. [Windows](https://docs.docker.com/desktop/windows/install/) @@ -67,47 +59,4 @@ A simple scraper for TripAdvisor (Hotel, Restaurant, Airline) reviews. 3. The hotel scraper uses date of review instead of date of stay as the date because the date of stay is not always available. # Container Provisioner -Container Provisioner is a tool written in [Go](https://go.dev/) that provides a UI for the users to interact with the scraper. It uses [Docker API](https://docs.docker.com/engine/api/) to provision the containers and run the scraper. The UI is written in raw HTML and JavaScript while the backend web framwork is [Fiber](https://docs.gofiber.io/). - -The scraped reviews will be uploaded to [Cloudflare R2 Buckets](https://www.cloudflare.com/lp/pg-r2/) for storing. R2 is S3-Compatible; therefore, technically, one can also use AWS S3 for storing the scraped reviews. - -## Pull the latest scraper Docker image -```bash -docker pull ghcr.io/algo7/tripadvisor-review-scraper/scraper:latest -``` -## Credentials Configuration -### R2 Bucket Credentials -You will need to create a folder called `credentials` in the `container_provisioner` directory of the project. The `credentials` folder will contain the credentials for the R2 bucket. The credentials file should be named `creds.json` and should be in the following format: -```json -{ - "bucketName": "", - "accountId": "", - "accessKeyId": "", - "accessKeySecret": "" -} -``` -### R2 Bucket URL -You will also have to set the `R2_URL` environment variable in the `docker-compose.yml` file to the URL of the R2 bucket. The URL should end with a `/`. - -## Run the container provisioner -The `docker-compose.yml` for the provisioner is located in the `container_provisioner` folder. - -## Visit the UI -The UI is accessible at `http://localhost:3000`. - -## Live Demo -A live demo of the container provisioner is available at [https://algo7.tools](https://algo7.tools). - -# Proxy Pool -Proxy Pool is a docker image that runs both HTTP and SOCKS5 Proxies over OpenVPN (config to be provided by the user via docker bind mounts). `sockd`, `squid`, and `openvpn` client are managed by `supervisord` in the container. The service integrates with the Container Provisioner to provide a pool of proxies for the scraper to use. The container provisioner uses `docker-compose labels` to distinguish between different proxies. At this moment, the container provisioner only supports connecting to the Proxy Pool using HTTP proxies. Each service in the `docker-compose.yml` file represents a single proxy in the pool. The `docker-compose.yml` file for the proxy pool is located in the `proxy_pool` folder. - -The Proxy Pool service can also be used directly with the scraper. Just make sure that the `PROXY_ADDRESS` environment variable is in the `docker-compose.yml` file for the scraper. - -## Running the Proxy Pool -1. Pull the latest scraper Docker image -```bash -docker pull ghcr.io/algo7/tripadvisor-review-scraper/vpn_worker:latest -``` -2. Create a docker-compose.yml file containing the configurations for each proxy (see the docker-compose.yml provided in the proxy_pool folder). -3. Place the OpenVPN config file of each proxy in the corresponding bind mount folder speicified in the docker-compose.yml file. -4. Run `docker-compose up` to start the container. \ No newline at end of file +# Proxy Pool \ No newline at end of file diff --git a/container_provisioner/README.md b/container_provisioner/README.md new file mode 100644 index 0000000..34ea7ba --- /dev/null +++ b/container_provisioner/README.md @@ -0,0 +1,31 @@ +# Container Provisioner +Container Provisioner is a tool written in [Go](https://go.dev/) that provides a UI for the users to interact with the scraper. It uses [Docker API](https://docs.docker.com/engine/api/) to provision the containers and run the scraper. The UI is written in raw HTML and JavaScript while the backend web framwork is [Fiber](https://docs.gofiber.io/). + +The scraped reviews will be uploaded to [Cloudflare R2 Buckets](https://www.cloudflare.com/lp/pg-r2/) for storing. R2 is S3-Compatible; therefore, technically, one can also use AWS S3 for storing the scraped reviews. + +## Pull the latest scraper Docker image +```bash +docker pull ghcr.io/algo7/tripadvisor-review-scraper/scraper:latest +``` +## Credentials Configuration +### R2 Bucket Credentials +You will need to create a folder called `credentials` in the `container_provisioner` directory of the project. The `credentials` folder will contain the credentials for the R2 bucket. The credentials file should be named `creds.json` and should be in the following format: +```json +{ + "bucketName": "", + "accountId": "", + "accessKeyId": "", + "accessKeySecret": "" +} +``` +### R2 Bucket URL +You will also have to set the `R2_URL` environment variable in the `docker-compose.yml` file to the URL of the R2 bucket. The URL should end with a `/`. + +## Run the container provisioner +The `docker-compose.yml` for the provisioner is located in the `container_provisioner` folder. + +## Visit the UI +The UI is accessible at `http://localhost:3000`. + +## Live Demo +A live demo of the container provisioner is available at [https://algo7.tools](https://algo7.tools). diff --git a/proxy_pool/README.MD b/proxy_pool/README.MD new file mode 100644 index 0000000..a33832e --- /dev/null +++ b/proxy_pool/README.MD @@ -0,0 +1,13 @@ +# Proxy Pool +Proxy Pool is a docker image that runs both HTTP and SOCKS5 Proxies over OpenVPN (config to be provided by the user via docker bind mounts). `sockd`, `squid`, and `openvpn` client are managed by `supervisord` in the container. The service integrates with the Container Provisioner to provide a pool of proxies for the scraper to use. The container provisioner uses `docker-compose labels` to distinguish between different proxies. At this moment, the container provisioner only supports connecting to the Proxy Pool using HTTP proxies. Each service in the `docker-compose.yml` file represents a single proxy in the pool. The `docker-compose.yml` file for the proxy pool is located in the `proxy_pool` folder. + +The Proxy Pool service can also be used directly with the scraper. Just make sure that the `PROXY_ADDRESS` environment variable is in the `docker-compose.yml` file for the scraper. + +## Running the Proxy Pool +1. Pull the latest scraper Docker image +```bash +docker pull ghcr.io/algo7/tripadvisor-review-scraper/vpn_worker:latest +``` +2. Create a docker-compose.yml file containing the configurations for each proxy (see the docker-compose.yml provided in the proxy_pool folder). +3. Place the OpenVPN config file of each proxy in the corresponding bind mount folder speicified in the docker-compose.yml file. +4. Run `docker-compose up` to start the container. \ No newline at end of file