Member | Task | Task |
---|---|---|
albertparedandan | Basic 4 | Basic 6 |
hanifdean | Basic 1 | Basic 5 |
nwihardjo | Basic 2 | Basic 3 |
- Use amazon.com as addition reselling portal
- Price shown is in USD. Price as 0 will be used if no information regarding the price is available
- Posted date timezone is HKT (Hong Kong Time) (TODO: check what shown in the posted date for null)
- Pagination of amazon portal is not handled
- Keyword which return whole new sub-section on amazon, i.e. book, is not handled since it is not specific enough which does not return solely list of available items in the portal. The result will return nothing in this case
- Item listed without any title / name will not be scraped as it is not a valid item
- Main price of amazon item is used, not the 'more buying options' or 'offer price' (usually cheaper price of same item listed in the portal from different seller). Average of the main price is used when the main price is a range between two prices (usually due to different sizes, colours, etc). Cheapest 'more buying options' or 'offer' price is used when no information available on the main price, as a rough estimate on the price of the item
- Posted date from amazon portal is scraped from the date of which the item is posted for the first time
- Service listing on amazon portal (not an item) is handled as well
- If there are results found but prices are all 0, average selling price and lowest selling price will be displayed as 0.0 as opposed to "-". "-" will only be displayed if there are no results found
- Functions that do not have access modifiers are purposely made package-private for unit testing purposes.
- As scraping craigslist is handled concurrently, the output of the console will only be
[int] page(s) of craigslist are being scraped in parallel ...
instead of how many pages has been scraped so far, as multiple pages are scraped at the same time / in parallel.
WebScraper to scrape both amazon and newyork craigslist website based on the keyword specified. Utilised multi-threading to support concurrency on craiglist pagination and amazon items' posted date retrieval which significantly improve the performance.
- Java 8 JDK with Gradle
- JavaFX for GUI framework
- JUnit 4.12 for testing suite
- Jacoco for test coverage measurement
We configure the project with Gradle. Gradle can be considered as Makefile like tools that streamline the compilation for you.
- Goto your project root folder
- Type
gradlew run
. This will build and run the project.
If you want to just rerun the project without rebuilding it,
- Go to the project root folder
build\jar\
- Double click jar file (e.g.
webscraper-0.1.0.jar
) yes, you need a GUI screen to run it.
- Goto your project root folder
./gradlew build
. This will build the project../gradlew run
which going to run the application
If you want to just rerun the project without rebuilding it,
- Go to the folder
build/jar/
- Double click jar file (e.g.
webscraper-0.1.0.jar
) or simply./gradlew run
-
Go to the project root directory
-
./gradlew test jacocoTestReport
to generate the test report anc coverage. It will run all unit tests and generate the coverage report -
Jacoco coverage report can be accessed from
./build/jacocoHTML/index.html
-
Unit tests report is on
./build/reports/tests/test/index.html
Some of the unit tests use cached pages from both portals. Testing utilises Reflection method to unit test private functions (not a good practise i know).
Here for the latest javadoc. Or if you prefer compile it by yourself,
- In project root directory,
./gradlew javadoc
to generate javadoc - Documentation is available at
./build/docs/javadoc/index.html
.