This repo contains a Python script that uses Scrapy to scrape motorcycle attributes off of a Polish website and enter them into an online importing cost estimation tool using Selenium
First, scrape motorcycle attributes from this website. Specifically, we want these fields --> Name
, Year
, Km driven
, CC
, Horsepower
, Price
, Image Link
, and Listing Link
. The screenshot below shows an example of a motorcyle listing.
After crawling the motorcycle attributes, we want to use this website to calculate the total price of each motorcycle in Norwegian Krone including VAT, duties and fees. This website has several fields that must be populated to display the full import price of the motorcycle
Finally, we want to combine the data crawled from the motorcycle website with the price from the calculator into one JSON/CSV file.
To scrape the data off of the Polish website, I used the scrapy framework with ScraperAPI. ScraperAPI is a proxy solution for web crawling that is designed to make scraping the web at scale as simple as possible. It does that by removing the hassle of finding high quality proxies, rotating proxy pools, detecting bans, solving CAPTCHAs, and managing geotargeting, and rendering Javascript. I explained how to integrate ScraperAPI with scrapy in multiple other project descriptions. You can check any of these projects to get an overview on how to do that...
The Scrapy code is located in this path motorcycle_crawling\motorcycle_crawling\spiders
. It is composed of two Py files, motorcycle_listing_page.py
and motorcycle_product_page.py
. The first Py script scrapes the listing page for the motorcycle name and product page URL. The second Py script takes the output of the first script and crawls all the other fields mentioned above in section 1. An exercpt of the scraped dataset is shown below...
For this part, I chose Python Selenium, which is a powerful tool for controlling web browsers via code. The steps I had to automate are shown in the GIF below.
The Selenium code is located in this file norwegian_calculator.py
. The input to this script is the dataset scraped from the Polish website. In addition, the motorcycle prices in PLN need to be converted to NOK. To do this, we can use the frankfurter app API each time we want to calculated the import price. The APIreturns the converted price in JSON format
def convert_currency(amount, from_currency, to_currency):
converted = requests.get(f"https://api.frankfurter.app/latest?amount={float(amount)}&from={from_currency}&to={to_currency}")
converted = converted.json()['rates'][to_currency]
return converted
After the Selenium code terminates, a column is appended to the data frame containing the scraped data. This column is called calculated_price
and shows the import price in NOK. An exercpt of the full dataset is shown below...
Note: Whenever one of the attributes required to calculate the price is missing, the calculated_price
is set to None. For the sake of demonstration, I chose to run the script for the first 8 entries only. For that reason, final_dataset.json
does not contain the calculated_price
for all scraped motorcycles. However, the code is configured to loop over all the entries in product_page.json
If you have any questions or wish to build a scraper for a particular use case (e.g., Competitive Intelligence or price comparison), feel free to contact me on LinkedIn