Best solution for scraping a large number of pages #493

xao0isb · 2024-09-16T12:52:40Z

xao0isb
Sep 16, 2024

We need to scrape a large number of pages. On 100 urls scrape percentage is about 60% - it's fine. But if number of urls increases then scrape percentage drops. For example for 400 urls scrape percentage is 32%. At the start we create one browser instance and then we do this for each page (for context: we need to use proxy for each page and for each one we pick random proxy from the list):

create context with proxyServer parameter
create page with proxy in context
filter out Stylesheet, Image, Font, Media requests
wait until page fully loaded recursively checking for page.network.wait_for_idle
get the body
close the page
dispose context

If following errors occurred during algorithm then scrape is unsuccessful: Ferrum::StatusError, Ferrum::PendingConnectionsError, Ferrum::NoSuchTargetError. Or if page can't be fully loaded (we waited too long for page.network.wait_for_idle to return success).

And this algorithm uses for each page. What you can suggest to improve scrape percentage? Maybe we should create context for each proxy from the list, then pick one random context, open page in this context, scrape and then close the page? Or maybe we should even create browser for each proxy, then pick one random browser, open page in it, scrape and then close the page? Any suggestions? The last thing that comes to mind is to quit browser after certain amount of pages scraped and then create new one so we can keep scrape percentage at one point.

Also worth to mention: after each of code run we have a lot chrome processes hanging - maybe this is the problem and code just spam the chrome processes? About RAM: for 100 urls RAM usage at pick was 5.6 GB; for 400 urls RAM usage at pick was 15.7 GB.

Thanks.

@route

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Best solution for scraping a large number of pages #493

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Best solution for scraping a large number of pages #493

xao0isb Sep 16, 2024

Replies: 0 comments

xao0isb
Sep 16, 2024