Merge pull request #176 from algo7/feature/api_scraper

Feature/api scraper
algo7 · Jan 23, 2024 · 38fc626 · 38fc626
2 parents 0042380 + 3b8243b
commit 38fc626
Show file tree

Hide file tree

Showing 40 changed files with 766 additions and 140 deletions.
diff --git a/.github/workflows/ci_scraper.yml b/.github/workflows/ci_scraper.yml
@@ -29,6 +29,16 @@ jobs:
       - name: Check Out Repo
         uses: actions/checkout@v4
 
+      - name: Setup Go
+        uses: actions/setup-go@v5
+        with:
+          go-version-file: 'go.mod'
+          cache-dependency-path: 'go.sum'
+      - run: go version
+
+      - name: Build Go Application
+        run: make build
+
       - name: Set up QEMU
         uses: docker/setup-qemu-action@v3
 

diff --git a/.gitignore b/.gitignore
@@ -111,7 +111,6 @@ review.csv
 **source
 **Default
 **DevToolsActivePort
-**pkg
 **Project_Files
 **tmp
 **.DS_Store

diff --git a/README.md b/README.md
@@ -16,54 +16,40 @@ A simple scraper for TripAdvisor (Hotel, Restaurant, Airline) reviews.
 - [TripAdvisor-Review-Scraper](#tripadvisor-review-scraper)
 - [Current Issues](#current-issues)
   - [Table of Contents](#table-of-contents)
-  - [How to Install Docker:](#how-to-install-docker)
-  - [Run Using Docker Compose](#run-using-docker-compose)
-  - [Run Using Docker CLI](#run-using-docker-cli)
-  - [Known Issues](#known-issues)
-- [Container Provisioner](#container-provisioner)
-- [Proxy Pool](#proxy-pool)
-
-## How to Install Docker:
+  - [Requirements](#requirements)
+    - [How to Install Docker:](#how-to-install-docker)
+  - [Project Layout](#project-layout)
+    - [Scraper](#scraper)
+    - [Container Provisioner](#container-provisioner)
+    - [Proxy Pool](#proxy-pool)
+
+## Requirements
+1. Go +v1.21
+2. Make [Optional]
+3. Docker [Optional]
+4. Docker Compose [Optional]
+5. Node.js +18 [Optional. Only required if you want to use the scraper written in Node.js, which is deprecated.]
+
+### How to Install Docker:
 1. [Windows](https://docs.docker.com/desktop/windows/install/)
 2. [Mac](https://docs.docker.com/desktop/mac/install/)
 3. [Linux](https://docs.docker.com/engine/install/ubuntu/)
 
-## Run Using Docker Compose
-1. Download the repository.
-2. Create a folder called `reviews` and a folder called `source` in the `scraper` directory of the project.
-3. The `reviews` folder will contain the scraped reviews.
-4. Place the source file in the `source` folder.
-   - The source file is a CSV file containing a list of hotels/restaurants to scrape.
-   - Examples of the source file are provided in the `examples` folder.
-   - The source file for hotels should be named `hotels.csv` and the source file for restaurants should be named `restos.csv`.
-5. Edit the `SCRAPE_MODE` (RESTO for restaurants, HOTEL for hotel) variable in the `docker-compose.yml` file to scrape either restaurant or hotel reviews.
-6. Edit the `CONCURRENCY` variable in the `docker-compose.yml` file to set the number of concurrent requests.
-   - A high concurrency number might cause the program to hang depending on the internet connection and the resource availability of your computer.
-7. Edit the `LANGUAGE` variable in the `docker-compose.yml` file to the language of the reviews you want to scrape.
-   - This option is only supported RESTO mode.
-   - Available options are `fr` and `en` which will actaully scrape all the reviews.
-8. Run `docker-compose up` to start the container.
-9. Once the scraping process is finished, check the `reviews` folder for the results.
-10. Samples of the results are included in the `samples` folder.
-11. Please remember to empty the `reviews` folder before running the scraper again.
-
-## Run Using Docker CLI 
-1. Download the repository.
-2. Replace the `-e SCRAP_MODE`, `-e CONCURRENCY`, `-e LANGUAGE` with custom values.
-3. Run `docker run --mount type=bind,src="$(pwd)"/reviews,target=/puppeteer/reviews --mount type=bind,src="$(pwd)"/source,target=/puppeteer/source -e SCRAPE_MODE=HOTEL -e CONCURRENCY=5 -e LANGUAGE=en ghcr.io/algo7/tripadvisor-review-scraper/scraper:latest` in the terminal at the root directory of the project.
-
-
-## Known Issues
-1. The hotel scraper works for English reviews only.
-2. The restaurant scraper can only scrap english reivews or french reviews.
-3. The hotel scraper uses date of review instead of date of stay as the date because the date of stay is not always available.
-
-# Container Provisioner
+## Project Layout
+### Scraper 
+There are 2 scrapers available:
+1. [Scraper](https://github.com/algo7/TripAdvisor-Review-Scraper/tree/main/scraper) written in Go
+2. [Scraper](https://github.com/algo7/TripAdvisor-Review-Scraper/tree/main/scraperjs) written in Node.js [Deprecated]
+
+The scraper written in Go is preferred because it calls the API directly and is much faster than the scraper written in Node.js which goes the traditional way of parsing HTML. The instructions of how to use them are located in their separate folders.
+
+
+### Container Provisioner
 Automates the process of provisioning containers for the scraper.
 
 Please read more about the container provisioner [here](https://github.com/algo7/TripAdvisor-Review-Scraper/tree/main/container_provisioner)
 
-# Proxy Pool
+### Proxy Pool
 Provides a pool of proxies for the scraper to use.
 
 Please read more about the proxy pool [here](https://github.com/algo7/TripAdvisor-Review-Scraper/tree/main/proxy_pool)
diff --git a/container_provisioner/api/controllers.go b/container_provisioner/api/controllers.go
@@ -139,21 +139,32 @@ func postProvision(c *fiber.Ctx) error {
 		})
 	}
 
-	// Get the scrape target name from the URL
-	scrapeTargetName := utils.GetScrapeTargetNameFromURL(url, scrapeMode)
+	// Get the location name
+	locationName := utils.GetLocationNameFromURL(url, scrapeMode)
+	if locationName == "" {
+		return c.Render("submission", fiber.Map{
+			"Title":      "Algo7 TripAdvisor Scraper",
+			"Message1":   "Invalid URL",
+			"ReturnHome": true,
+		})
+	}
 
 	// Get the proxy container info
 	proxyContainers := containers.AcquireProxyContainer()
 
 	// Generate the container config
-	scrapeConfig := containers.ContainerConfigGenerator(scrapeMode, scrapeTargetName, url, uploadIdentifier, proxyContainers.ProxyAddress, proxyContainers.VPNRegion)
+	scrapeConfig := containers.ContainerConfigGenerator(
+		url,
+		uploadIdentifier,
+		proxyContainers.ProxyAddress,
+		proxyContainers.VPNRegion)
 
 	// Create the container
 	containerID := containers.CreateContainer(scrapeConfig)
 
 	// Start the scraping container via goroutine
 	go func() {
-		containers.Scrape(uploadIdentifier, scrapeTargetName, containerID)
+		containers.Scrape(uploadIdentifier, locationName, containerID)
 		containers.ReleaseProxyContainer(proxyContainers.ContainerID)
 	}()
 

diff --git a/container_provisioner/containers/helper.go b/container_provisioner/containers/helper.go
@@ -59,44 +59,20 @@ func RemoveContainer(containerID string) {
 
 // ContainerConfigGenerator generates the container config depending on the scrape target
 func ContainerConfigGenerator(
-	scrapeTarget string,
-	scrapeTargetName string,
-	scrapeURL string, uploadIdentifier string,
+	locationURL string, uploadIdentifier string,
 	proxyAddress string, vpnRegion string) *container.Config {
 
-	var scrapeContainerURL string
-	var targetName string
-
-	switch scrapeTarget {
-	case "HOTEL":
-		scrapeContainerURL = fmt.Sprintf("HOTEL_URL=%s", scrapeURL)
-		targetName = fmt.Sprintf("HOTEL_NAME=%s", scrapeTargetName)
-	case "RESTO":
-		scrapeContainerURL = fmt.Sprintf("RESTO_URL=%s", scrapeURL)
-		targetName = fmt.Sprintf("RESTO_NAME=%s", scrapeTargetName)
-	case "AIRLINE":
-		scrapeContainerURL = fmt.Sprintf("AIRLINE_URL=%s", scrapeURL)
-		targetName = fmt.Sprintf("AIRLINE_NAME=%s", scrapeTargetName)
-	}
-
-	scrapeMode := fmt.Sprintf("SCRAPE_MODE=%s", scrapeTarget)
-	proxySettings := fmt.Sprintf("PROXY_ADDRESS=%s", proxyAddress)
-
 	return &container.Config{
 		Image: containerImage,
 		Labels: map[string]string{
 			"TaskOwner":  uploadIdentifier,
-			"Target":     scrapeTargetName,
+			"Target":     locationURL,
 			"vpn.region": vpnRegion,
 		},
 		// Env vars required by the js scraper containers
 		Env: []string{
-			"CONCURRENCY=2",
-			"IS_PROVISIONER=true",
-			scrapeMode,
-			scrapeContainerURL,
-			targetName,
-			proxySettings,
+			fmt.Sprintf("LOCATION_URL=%s", locationURL),
+			fmt.Sprintf("PROXY_HOST=%s", proxyAddress),
 		},
 		Tty: true,
 	}
@@ -275,7 +251,7 @@ func ReleaseProxyContainer(containerID string) {
 	database.ReleaseLock(lockKey)
 }
 
-// GetResultCSVSizeInContainer gets the size of the result csv file in the container
+// getResultCSVSizeInContainer gets the size of the result csv file in the container
 func getResultCSVSizeInContainer(containerID, filePathInContainer string) {
 	cli, err := client.NewClientWithOpts(client.FromEnv, client.WithAPIVersionNegotiation())
 	utils.ErrorHandler(err)

diff --git a/container_provisioner/containers/provisioner.go b/container_provisioner/containers/provisioner.go
@@ -38,7 +38,7 @@ func Scrape(uploadIdentifier string, targetName string, containerID string) {
 	}
 
 	// The file path in the container
-	filePathInContainer := "/puppeteer/reviews/All.csv"
+	filePathInContainer := "reviews.csv"
 
 	// Get the file size in the container
 	getResultCSVSizeInContainer(containerID, filePathInContainer)

diff --git a/container_provisioner/utils/utils.go b/container_provisioner/utils/utils.go
@@ -108,19 +108,17 @@ func ParseCredsFromJSON(fileName string) Creds {
 	return creds
 }
 
-// GetScrapeTargetNameFromURL get the scrape target name from the given URL
-func GetScrapeTargetNameFromURL(url string, scrapOption string) string {
+// GetLocationNameFromURL get the scrape target name from the given URL
+func GetLocationNameFromURL(url string, scrapOption string) string {
+
 	// Split the url by "-"
-	splitURL := strings.Split(url, "-")
+	splitURL := strings.Split(url, "_")
 
 	switch scrapOption {
 	case "HOTEL", "RESTO":
 		return splitURL[4]
 	case "AIRLINE":
-		if len(splitURL) > 4 {
-			return fmt.Sprintf("%s-%s", splitURL[3], splitURL[4])
-		}
-		return splitURL[3]
+		return strings.Join(splitURL[3:], "_")
 	default:
 		return ""
 	}
@@ -130,14 +128,11 @@ func GetScrapeTargetNameFromURL(url string, scrapOption string) string {
 func ValidateTripAdvisorURL(url string, scrapOption string) bool {
 	switch scrapOption {
 	case "HOTEL":
-		match, _ := regexp.MatchString(tripAdvisorHotelURLRegexp.String(), url)
-		return match
+		return tripAdvisorHotelURLRegexp.MatchString(url)
 	case "RESTO":
-		match, _ := regexp.MatchString(tripAdvisorRestaurantRegexp.String(), url)
-		return match
+		return tripAdvisorRestaurantRegexp.MatchString(url)
 	case "AIRLINE":
-		match, _ := regexp.MatchString(tripAdvisorAirlineRegexp.String(), url)
-		return match
+		return tripAdvisorAirlineRegexp.MatchString(url)
 	default:
 		return false
 	}

diff --git a/go.work b/go.work
@@ -1,3 +1,6 @@
-go 1.21.1
+go 1.21.4
 
-use ./container_provisioner
+use (
+	./container_provisioner
+	./scraper
+)
diff --git a/proxy_pool/docker-compose.yml b/proxy_pool/docker-compose.yml
@@ -19,18 +19,18 @@ services:
     devices:
       - '/dev/net/tun:/dev/net/tun'
     # NOTE: These port mappings are only used when you are connecting to the proxy pool from outside of docker network without the container_provisioner (e.g. from your host machine).
-    ports:
-      # Squid proxy port
-      - target: 8888
-        published: 8888
-        protocol: tcp
-      # Dante proxy port
-      - target: 8881
-        published: 8881
-        protocol: tcp
-      - target: 8881
-        published: 8881
-        protocol: udp
+    # ports:
+    #   # Squid proxy port
+    #   - target: 8888
+    #     published: 8888
+    #     protocol: tcp
+    #   # Dante proxy port
+    #   - target: 8881
+    #     published: 8881
+    #     protocol: tcp
+    #   - target: 8881
+    #     published: 8881
+    #     protocol: udp
     # WARNING: Do not Change the name of this network. It is used by the scraper to connect to the proxies.
     # At this moment, it is hardcoded in the container_provisioner when creating the containers.
     networks:
@@ -69,7 +69,7 @@ services:
     volumes:
       # OpenVPN credentials and config (config.vpn and pass.txt)
       - type: bind
-        source: ./VPN/CH76
+        source: ./VPN/CH14
         target: /VPN
         bind:
           create_host_path: true
@@ -92,15 +92,26 @@ services:
     # Devices required to run OpenVPN
     devices:
       - '/dev/net/tun:/dev/net/tun'
-
+    ports:
+      # Squid proxy port
+      - target: 8888
+        published: 8888
+        protocol: tcp
+      # Dante proxy port
+      - target: 8881
+        published: 8881
+        protocol: tcp
+      - target: 8881
+        published: 8881
+        protocol: udp
     # WARNING: Do not Change the name of this network. It is used by the scraper to connect to the proxies.
     # At this moment, it is hardcoded in the container_provisioner when creating the containers.
     networks:
       - scraper_vpn
     volumes:
       # OpenVPN credentials and config (config.vpn and pass.txt)
       - type: bind
-        source: ./VPN/CH76
+        source: ./VPN/CH46
         target: /VPN
         bind:
           create_host_path: true
@@ -131,7 +142,7 @@ services:
     volumes:
       # OpenVPN credentials and config (config.vpn and pass.txt)
       - type: bind
-        source: ./VPN/CH76
+        source: ./VPN/CH58
         target: /VPN
         bind:
           create_host_path: true
@@ -162,7 +173,7 @@ services:
     volumes:
       # OpenVPN credentials and config (config.vpn and pass.txt)
       - type: bind
-        source: ./VPN/CH76
+        source: ./VPN/CH66
         target: /VPN
         bind:
           create_host_path: true
@@ -192,7 +203,7 @@ services:
     volumes:
       # OpenVPN credentials and config (config.vpn and pass.txt)
       - type: bind
-        source: ./VPN/CH76
+        source: ./VPN/CH70
         target: /VPN
         bind:
           create_host_path: true