Use Scrapy framework to parse data about men hoodies with reviews.
Website: https://www.dresslily.com/hoodies-c-181-page-1.html
Items to parse:
-
Men Hoodies
- product_id
- product_url
- name
- discount (%)
- discounted_price (0 if no sale)
- original_price
- total_reviews
- product_info (formatted string, e.g. “Occasion:Daily;Style:Fashion” )
-
Reviews
- product_id
- rating
- timestamp (convert review date to Unix timestamp)
- text
- size
- color
- Scrapy framework
- Splash
- Clone this repo and install dependencies (poetry recommended)
- Run Splash. (Due to memory leaks, set restart policy)
docker run -it -p 8050:8050 scrapinghub/splash --memory=3g --restart=always
- Start crawler
cd dresslily
scrapy crawl men_hoodies -s JOBDIR=crawls/men_hoodies
or
chmod +x start_crawler.sh
./start_crawler.sh
After scraping finishes you can grab data in csv (sample data)
- simporter-task-scrapy/dresslily/scraped_data/hoodies.csv
- simporter-task-scrapy/dresslily/scraped_data/reviews.csv