You have to use nodejs version 0.10.x
Download the dependencies:
npm install
Install MongoDB:
sudo apt-get install mongodb
node --debug-brk crawler.js
Then, go to http://localhost:8080/debug?port=5858
node --max-old-space-size=8192 --expose-gc crawler.js
The script consumes a lot of memory in order of their execution time. On the first minute of their execution, about 150 registers are downloaded, netherless, this measure is going down in order of the memory consumption.
A real fixes to this problem is to study how the V8 garbage collector works, and pay attention to remove the closure variables to improve less memory consumption.
So, an work around to this problem is to kill and reopen the script in determined cycle of time using CRON. To do that, run the following instructions:
create a file with the following content on /etc/cron.d/crawler
(without the extension '.sh')
pkill node
cd "<PATH_OF_SOURCE>/node_scrap/"
<YOUR_NODE_PATH>/node --max-old-space-size=8192 index.js > /tmp/crawler.log &
crontab -e
add this on the last line of the file:
*/2 * * * * /bin/sh /etc/cron.d/crawler
This will run automatically the script /etc/cron.d/crawler
on the interval of 2 in 2 minute, It would be kill and re-execute the crawler script.
First of all you need to change the function "initialize" of the class index.js, which the content is something like that:
The first step is to execute the function processStates()
this function will download all the urls of the entities, in order to make the process synchronous, and it will maintain the control of what register was downloaded.
More of 300.000 url's will be downloaded. You can check It on database:
use cnes2015
show collections
You must backup the collection entityurls
to entityurls_bak
by using the following command:
Then, you have to change the function initialize
in order to make it call the function that download the entity details:
You can now check your log with tail: tail -f /tmp/crawler.log
cd output
mongoexport --db cnes --collection entities --csv --fieldFile entities_fields.txt --out entities.csv
rm -rf output/*.csv && node crawler/exporter.js
to generate the dump:
mongodump -d cnes -o output
to restore:
mongorestore cnes