Address deadlock when fetching pages. #37

kwadkore · 2024-07-24T04:28:55Z

The default size for a channel is 0, so anything sending on a default-sized channel will block until something else reads from the other side. Since workers read from the Jobs channel, but also send on the channel in the event of a failure, if all of them end up failing on a page and attempting to send the failed page through the same channel, there'll be no workers left to read from the channel and the whole process will hang. Increase the channel size to the expected number of pages to reduce the chance of workers blocking themselves.

Akenaide · 2024-07-25T23:03:08Z

This is the expected behaviour, I don't think it's a good practice to spam request as a scraper. If you really want this feature you can add it as option.

kwadkore · 2024-07-29T21:09:22Z

Hanging indefinitely does not sound like expected behavior. As the code stands right now, if a page fails, a worker can end up hanging indefinitely. In the worst case, all workers will hang and nothing will proceed further. Say for example there are 5 workers and 10 pages. In the worst case, the first five pages fail. All 5 workers will block at line 170 and nothing will proceed further because there's nothing else that will read from the Jobs channel (so pages 6-10 wouldn't even be attempted). If the intention is to not retry a failed page, then the solution is to take out line 170 instead. If the intention is to stop the scraper altogether if any page fails, then code should be added to stop all the workers when a failed page is encountered.

The code in this pull request doesn't increase the rate at which pages are tried. It just prevents the hanging issue by making it so that the workers can move past line 170 and try another page.

Akenaide · 2024-08-01T23:43:14Z

At some point they were a retry functionality. I understand your point and you seems to understand mine but I'm quite busy to test it. I will do it ASAP.

Right now, it does not seems to be an issue. The main bottleneck is if there no proxy available or maybe it's linked.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Address deadlock when fetching pages. #37

Address deadlock when fetching pages. #37

kwadkore commented Jul 24, 2024

Akenaide commented Jul 25, 2024

kwadkore commented Jul 29, 2024

Akenaide commented Aug 1, 2024

Address deadlock when fetching pages. #37

Are you sure you want to change the base?

Address deadlock when fetching pages. #37

Conversation

kwadkore commented Jul 24, 2024

Akenaide commented Jul 25, 2024

kwadkore commented Jul 29, 2024

Akenaide commented Aug 1, 2024