You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We need to scrape a large number of pages. On 100 urls scrape percentage is about 60% - it's fine. But if number of urls increases then scrape percentage drops. For example for 400 urls scrape percentage is 32%. At the start we create one browser instance and then we do this for each page (for context: we need to use proxy for each page and for each one we pick random proxy from the list):
create context with proxyServer parameter
create page with proxy in context
filter out Stylesheet, Image, Font, Media requests
wait until page fully loaded recursively checking for page.network.wait_for_idle
get the body
close the page
dispose context
If following errors occurred during algorithm then scrape is unsuccessful: Ferrum::StatusError, Ferrum::PendingConnectionsError, Ferrum::NoSuchTargetError. Or if page can't be fully loaded (we waited too long for page.network.wait_for_idle to return success).
And this algorithm uses for each page. What you can suggest to improve scrape percentage? Maybe we should create context for each proxy from the list, then pick one random context, open page in this context, scrape and then close the page? Or maybe we should even create browser for each proxy, then pick one random browser, open page in it, scrape and then close the page? Any suggestions? The last thing that comes to mind is to quit browser after certain amount of pages scraped and then create new one so we can keep scrape percentage at one point.
Also worth to mention: after each of code run we have a lot chrome processes hanging - maybe this is the problem and code just spam the chrome processes? About RAM: for 100 urls RAM usage at pick was 5.6 GB; for 400 urls RAM usage at pick was 15.7 GB.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
We need to scrape a large number of pages. On 100 urls scrape percentage is about 60% - it's fine. But if number of urls increases then scrape percentage drops. For example for 400 urls scrape percentage is 32%. At the start we create one browser instance and then we do this for each page (for context: we need to use proxy for each page and for each one we pick random proxy from the list):
proxyServer
parameterStylesheet
,Image
,Font
,Media
requestspage.network.wait_for_idle
If following errors occurred during algorithm then scrape is unsuccessful:
Ferrum::StatusError
,Ferrum::PendingConnectionsError
,Ferrum::NoSuchTargetError
. Or if page can't be fully loaded (we waited too long forpage.network.wait_for_idle
to return success).And this algorithm uses for each page. What you can suggest to improve scrape percentage? Maybe we should create context for each proxy from the list, then pick one random context, open page in this context, scrape and then close the page? Or maybe we should even create browser for each proxy, then pick one random browser, open page in it, scrape and then close the page? Any suggestions? The last thing that comes to mind is to quit browser after certain amount of pages scraped and then create new one so we can keep scrape percentage at one point.
Also worth to mention: after each of code run we have a lot chrome processes hanging - maybe this is the problem and code just spam the chrome processes? About RAM: for 100 urls RAM usage at pick was 5.6 GB; for 400 urls RAM usage at pick was 15.7 GB.
Thanks.
@route
Beta Was this translation helpful? Give feedback.
All reactions