Web Crawler for gathering financial news for scientific research purposes.
The stack bellow was used mostly due to its ease of installation, configuration, and also efficiency and portability.
- Language: Python (3.7.4)
- Core crawler engine: Scrapy (1.8.0)
This system was developed in Ubuntu 16.04 but will work properly on any other Operational System(OS X, Windows, etc.).
However, this guide will only include instructions for plugins and packages that are not already installed on this OS. For this reason, we assume that technologies like a python interpreter and pipenv are ready for use, and should not be on the scope of this document.
- Now install pipenv dependency manager:
$ pip install --user pipenv
Now we'll start setting up the project.
- Clone the repo from GitHub and change to project root directory. After that, install project dependencies and go to python virtual env, by running:
$ pipenv install
$ pipenv shell
Now we will run the project and get query answers.
- The project will do the following actions:
- Download a raw data like a JSON or an HTML in a controlled not abusive manner, respecting instructions from
robots.txt
files - Save data to a specific folder according to webpage source and stock or asset selected
- Parse the HTML or JSON and extract the data to a CSV file selecting only preselected fields
- Download a raw data like a JSON or an HTML in a controlled not abusive manner, respecting instructions from
$ python scraper
- To run just the step iii above, pass the argument
--step=gen_csv
:
$ python scraper --step=gen_csv
-
Only stocks present in the
companies_list.csv
are allowed to run if--website=wsj_news
. So if necessary, include new entries in this file following previous entries as examples. -
Required params:
--crawl_type
allowed values:api
,website
--stock
a single stock ticker present in the filecompanies_list.csv
or the valuelist
to run for all stocks incompanies_list.csv
-
When no params are passed the default mode of this scraper runs a crawler with the following characteristics:
--website=wsj_news
for wall street journal API--mode=default
default operation mode--step=all
run all steps mentioned before--max_requests=2
maximum number of consecutive requests--max_elements=100
gets a maximum number of elements--start_time
is the current time--end_time
is unused as the systems defaults to the number maximum of consecutive requests -
Example to run with all possible fields:
$ python scraper --crawl_type=api --website=wsj_news --stock=btcusd --mode=default --step=all --max_requests=2 --end_time=2019-10-18T18:59:08 --start_time=2019-11-12T22:30:00
-
Example to run greedy mode:
Beware the greedy mode will run trough exhaustion and may lead to blockage if not used properly. This mode ignores
end_time
andmax_requests
, but it can considerstart_time
.
$ python scraper --website=wsj_news --stock=btcusd --mode=greedy
Tests
Unitary and integration tests will guarantee consistency and good maintainability
Job scheduler
Adding a feature for scheduling each job separately with improve performance
Work for any site
Right now the code only works for a specific news website, it will be adjusted for some other websites candidates:
-
Yahoo finance
-
Business Insider
-
News BTC
-
Be In Crypto
Design improvements
- It seem scrapy expects all loops in the same website are self-contained
Known issues
- See open issues in github
- Go play around! =)