Skip to content

Commit

Permalink
Merge pull request #176 from algo7/feature/api_scraper
Browse files Browse the repository at this point in the history
Feature/api scraper
  • Loading branch information
algo7 authored Jan 23, 2024
2 parents 0042380 + 3b8243b commit 38fc626
Show file tree
Hide file tree
Showing 40 changed files with 766 additions and 140 deletions.
10 changes: 10 additions & 0 deletions .github/workflows/ci_scraper.yml
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,16 @@ jobs:
- name: Check Out Repo
uses: actions/checkout@v4

- name: Setup Go
uses: actions/setup-go@v5
with:
go-version-file: 'go.mod'
cache-dependency-path: 'go.sum'
- run: go version

- name: Build Go Application
run: make build

- name: Set up QEMU
uses: docker/setup-qemu-action@v3

Expand Down
1 change: 0 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -111,7 +111,6 @@ review.csv
**source
**Default
**DevToolsActivePort
**pkg
**Project_Files
**tmp
**.DS_Store
Expand Down
66 changes: 26 additions & 40 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,54 +16,40 @@ A simple scraper for TripAdvisor (Hotel, Restaurant, Airline) reviews.
- [TripAdvisor-Review-Scraper](#tripadvisor-review-scraper)
- [Current Issues](#current-issues)
- [Table of Contents](#table-of-contents)
- [How to Install Docker:](#how-to-install-docker)
- [Run Using Docker Compose](#run-using-docker-compose)
- [Run Using Docker CLI](#run-using-docker-cli)
- [Known Issues](#known-issues)
- [Container Provisioner](#container-provisioner)
- [Proxy Pool](#proxy-pool)

## How to Install Docker:
- [Requirements](#requirements)
- [How to Install Docker:](#how-to-install-docker)
- [Project Layout](#project-layout)
- [Scraper](#scraper)
- [Container Provisioner](#container-provisioner)
- [Proxy Pool](#proxy-pool)

## Requirements
1. Go +v1.21
2. Make [Optional]
3. Docker [Optional]
4. Docker Compose [Optional]
5. Node.js +18 [Optional. Only required if you want to use the scraper written in Node.js, which is deprecated.]

### How to Install Docker:
1. [Windows](https://docs.docker.com/desktop/windows/install/)
2. [Mac](https://docs.docker.com/desktop/mac/install/)
3. [Linux](https://docs.docker.com/engine/install/ubuntu/)

## Run Using Docker Compose
1. Download the repository.
2. Create a folder called `reviews` and a folder called `source` in the `scraper` directory of the project.
3. The `reviews` folder will contain the scraped reviews.
4. Place the source file in the `source` folder.
- The source file is a CSV file containing a list of hotels/restaurants to scrape.
- Examples of the source file are provided in the `examples` folder.
- The source file for hotels should be named `hotels.csv` and the source file for restaurants should be named `restos.csv`.
5. Edit the `SCRAPE_MODE` (RESTO for restaurants, HOTEL for hotel) variable in the `docker-compose.yml` file to scrape either restaurant or hotel reviews.
6. Edit the `CONCURRENCY` variable in the `docker-compose.yml` file to set the number of concurrent requests.
- A high concurrency number might cause the program to hang depending on the internet connection and the resource availability of your computer.
7. Edit the `LANGUAGE` variable in the `docker-compose.yml` file to the language of the reviews you want to scrape.
- This option is only supported RESTO mode.
- Available options are `fr` and `en` which will actaully scrape all the reviews.
8. Run `docker-compose up` to start the container.
9. Once the scraping process is finished, check the `reviews` folder for the results.
10. Samples of the results are included in the `samples` folder.
11. Please remember to empty the `reviews` folder before running the scraper again.

## Run Using Docker CLI
1. Download the repository.
2. Replace the `-e SCRAP_MODE`, `-e CONCURRENCY`, `-e LANGUAGE` with custom values.
3. Run `docker run --mount type=bind,src="$(pwd)"/reviews,target=/puppeteer/reviews --mount type=bind,src="$(pwd)"/source,target=/puppeteer/source -e SCRAPE_MODE=HOTEL -e CONCURRENCY=5 -e LANGUAGE=en ghcr.io/algo7/tripadvisor-review-scraper/scraper:latest` in the terminal at the root directory of the project.


## Known Issues
1. The hotel scraper works for English reviews only.
2. The restaurant scraper can only scrap english reivews or french reviews.
3. The hotel scraper uses date of review instead of date of stay as the date because the date of stay is not always available.

# Container Provisioner
## Project Layout
### Scraper
There are 2 scrapers available:
1. [Scraper](https://github.com/algo7/TripAdvisor-Review-Scraper/tree/main/scraper) written in Go
2. [Scraper](https://github.com/algo7/TripAdvisor-Review-Scraper/tree/main/scraperjs) written in Node.js [Deprecated]

The scraper written in Go is preferred because it calls the API directly and is much faster than the scraper written in Node.js which goes the traditional way of parsing HTML. The instructions of how to use them are located in their separate folders.


### Container Provisioner
Automates the process of provisioning containers for the scraper.

Please read more about the container provisioner [here](https://github.com/algo7/TripAdvisor-Review-Scraper/tree/main/container_provisioner)

# Proxy Pool
### Proxy Pool
Provides a pool of proxies for the scraper to use.

Please read more about the proxy pool [here](https://github.com/algo7/TripAdvisor-Review-Scraper/tree/main/proxy_pool)
19 changes: 15 additions & 4 deletions container_provisioner/api/controllers.go
Original file line number Diff line number Diff line change
Expand Up @@ -139,21 +139,32 @@ func postProvision(c *fiber.Ctx) error {
})
}

// Get the scrape target name from the URL
scrapeTargetName := utils.GetScrapeTargetNameFromURL(url, scrapeMode)
// Get the location name
locationName := utils.GetLocationNameFromURL(url, scrapeMode)
if locationName == "" {
return c.Render("submission", fiber.Map{
"Title": "Algo7 TripAdvisor Scraper",
"Message1": "Invalid URL",
"ReturnHome": true,
})
}

// Get the proxy container info
proxyContainers := containers.AcquireProxyContainer()

// Generate the container config
scrapeConfig := containers.ContainerConfigGenerator(scrapeMode, scrapeTargetName, url, uploadIdentifier, proxyContainers.ProxyAddress, proxyContainers.VPNRegion)
scrapeConfig := containers.ContainerConfigGenerator(
url,
uploadIdentifier,
proxyContainers.ProxyAddress,
proxyContainers.VPNRegion)

// Create the container
containerID := containers.CreateContainer(scrapeConfig)

// Start the scraping container via goroutine
go func() {
containers.Scrape(uploadIdentifier, scrapeTargetName, containerID)
containers.Scrape(uploadIdentifier, locationName, containerID)
containers.ReleaseProxyContainer(proxyContainers.ContainerID)
}()

Expand Down
34 changes: 5 additions & 29 deletions container_provisioner/containers/helper.go
Original file line number Diff line number Diff line change
Expand Up @@ -59,44 +59,20 @@ func RemoveContainer(containerID string) {

// ContainerConfigGenerator generates the container config depending on the scrape target
func ContainerConfigGenerator(
scrapeTarget string,
scrapeTargetName string,
scrapeURL string, uploadIdentifier string,
locationURL string, uploadIdentifier string,
proxyAddress string, vpnRegion string) *container.Config {

var scrapeContainerURL string
var targetName string

switch scrapeTarget {
case "HOTEL":
scrapeContainerURL = fmt.Sprintf("HOTEL_URL=%s", scrapeURL)
targetName = fmt.Sprintf("HOTEL_NAME=%s", scrapeTargetName)
case "RESTO":
scrapeContainerURL = fmt.Sprintf("RESTO_URL=%s", scrapeURL)
targetName = fmt.Sprintf("RESTO_NAME=%s", scrapeTargetName)
case "AIRLINE":
scrapeContainerURL = fmt.Sprintf("AIRLINE_URL=%s", scrapeURL)
targetName = fmt.Sprintf("AIRLINE_NAME=%s", scrapeTargetName)
}

scrapeMode := fmt.Sprintf("SCRAPE_MODE=%s", scrapeTarget)
proxySettings := fmt.Sprintf("PROXY_ADDRESS=%s", proxyAddress)

return &container.Config{
Image: containerImage,
Labels: map[string]string{
"TaskOwner": uploadIdentifier,
"Target": scrapeTargetName,
"Target": locationURL,
"vpn.region": vpnRegion,
},
// Env vars required by the js scraper containers
Env: []string{
"CONCURRENCY=2",
"IS_PROVISIONER=true",
scrapeMode,
scrapeContainerURL,
targetName,
proxySettings,
fmt.Sprintf("LOCATION_URL=%s", locationURL),
fmt.Sprintf("PROXY_HOST=%s", proxyAddress),
},
Tty: true,
}
Expand Down Expand Up @@ -275,7 +251,7 @@ func ReleaseProxyContainer(containerID string) {
database.ReleaseLock(lockKey)
}

// GetResultCSVSizeInContainer gets the size of the result csv file in the container
// getResultCSVSizeInContainer gets the size of the result csv file in the container
func getResultCSVSizeInContainer(containerID, filePathInContainer string) {
cli, err := client.NewClientWithOpts(client.FromEnv, client.WithAPIVersionNegotiation())
utils.ErrorHandler(err)
Expand Down
2 changes: 1 addition & 1 deletion container_provisioner/containers/provisioner.go
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@ func Scrape(uploadIdentifier string, targetName string, containerID string) {
}

// The file path in the container
filePathInContainer := "/puppeteer/reviews/All.csv"
filePathInContainer := "reviews.csv"

// Get the file size in the container
getResultCSVSizeInContainer(containerID, filePathInContainer)
Expand Down
21 changes: 8 additions & 13 deletions container_provisioner/utils/utils.go
Original file line number Diff line number Diff line change
Expand Up @@ -108,19 +108,17 @@ func ParseCredsFromJSON(fileName string) Creds {
return creds
}

// GetScrapeTargetNameFromURL get the scrape target name from the given URL
func GetScrapeTargetNameFromURL(url string, scrapOption string) string {
// GetLocationNameFromURL get the scrape target name from the given URL
func GetLocationNameFromURL(url string, scrapOption string) string {

// Split the url by "-"
splitURL := strings.Split(url, "-")
splitURL := strings.Split(url, "_")

switch scrapOption {
case "HOTEL", "RESTO":
return splitURL[4]
case "AIRLINE":
if len(splitURL) > 4 {
return fmt.Sprintf("%s-%s", splitURL[3], splitURL[4])
}
return splitURL[3]
return strings.Join(splitURL[3:], "_")
default:
return ""
}
Expand All @@ -130,14 +128,11 @@ func GetScrapeTargetNameFromURL(url string, scrapOption string) string {
func ValidateTripAdvisorURL(url string, scrapOption string) bool {
switch scrapOption {
case "HOTEL":
match, _ := regexp.MatchString(tripAdvisorHotelURLRegexp.String(), url)
return match
return tripAdvisorHotelURLRegexp.MatchString(url)
case "RESTO":
match, _ := regexp.MatchString(tripAdvisorRestaurantRegexp.String(), url)
return match
return tripAdvisorRestaurantRegexp.MatchString(url)
case "AIRLINE":
match, _ := regexp.MatchString(tripAdvisorAirlineRegexp.String(), url)
return match
return tripAdvisorAirlineRegexp.MatchString(url)
default:
return false
}
Expand Down
7 changes: 5 additions & 2 deletions go.work
Original file line number Diff line number Diff line change
@@ -1,3 +1,6 @@
go 1.21.1
go 1.21.4

use ./container_provisioner
use (
./container_provisioner
./scraper
)
47 changes: 29 additions & 18 deletions proxy_pool/docker-compose.yml
Original file line number Diff line number Diff line change
Expand Up @@ -19,18 +19,18 @@ services:
devices:
- '/dev/net/tun:/dev/net/tun'
# NOTE: These port mappings are only used when you are connecting to the proxy pool from outside of docker network without the container_provisioner (e.g. from your host machine).
ports:
# Squid proxy port
- target: 8888
published: 8888
protocol: tcp
# Dante proxy port
- target: 8881
published: 8881
protocol: tcp
- target: 8881
published: 8881
protocol: udp
# ports:
# # Squid proxy port
# - target: 8888
# published: 8888
# protocol: tcp
# # Dante proxy port
# - target: 8881
# published: 8881
# protocol: tcp
# - target: 8881
# published: 8881
# protocol: udp
# WARNING: Do not Change the name of this network. It is used by the scraper to connect to the proxies.
# At this moment, it is hardcoded in the container_provisioner when creating the containers.
networks:
Expand Down Expand Up @@ -69,7 +69,7 @@ services:
volumes:
# OpenVPN credentials and config (config.vpn and pass.txt)
- type: bind
source: ./VPN/CH76
source: ./VPN/CH14
target: /VPN
bind:
create_host_path: true
Expand All @@ -92,15 +92,26 @@ services:
# Devices required to run OpenVPN
devices:
- '/dev/net/tun:/dev/net/tun'

ports:
# Squid proxy port
- target: 8888
published: 8888
protocol: tcp
# Dante proxy port
- target: 8881
published: 8881
protocol: tcp
- target: 8881
published: 8881
protocol: udp
# WARNING: Do not Change the name of this network. It is used by the scraper to connect to the proxies.
# At this moment, it is hardcoded in the container_provisioner when creating the containers.
networks:
- scraper_vpn
volumes:
# OpenVPN credentials and config (config.vpn and pass.txt)
- type: bind
source: ./VPN/CH76
source: ./VPN/CH46
target: /VPN
bind:
create_host_path: true
Expand Down Expand Up @@ -131,7 +142,7 @@ services:
volumes:
# OpenVPN credentials and config (config.vpn and pass.txt)
- type: bind
source: ./VPN/CH76
source: ./VPN/CH58
target: /VPN
bind:
create_host_path: true
Expand Down Expand Up @@ -162,7 +173,7 @@ services:
volumes:
# OpenVPN credentials and config (config.vpn and pass.txt)
- type: bind
source: ./VPN/CH76
source: ./VPN/CH66
target: /VPN
bind:
create_host_path: true
Expand Down Expand Up @@ -192,7 +203,7 @@ services:
volumes:
# OpenVPN credentials and config (config.vpn and pass.txt)
- type: bind
source: ./VPN/CH76
source: ./VPN/CH70
target: /VPN
bind:
create_host_path: true
Expand Down
Loading

0 comments on commit 38fc626

Please sign in to comment.