Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Address deadlock when fetching pages. #37

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

kwadkore
Copy link
Contributor

The default size for a channel is 0, so anything sending on a default-sized channel will block until something else reads from the other side. Since workers read from the Jobs channel, but also send on the channel in the event of a failure, if all of them end up failing on a page and attempting to send the failed page through the same channel, there'll be no workers left to read from the channel and the whole process will hang. Increase the channel size to the expected number of pages to reduce the chance of workers blocking themselves.

The default size for a channel is 0, so anything sending on a
default-sized channel will block until something else reads from the
other side. Since workers read from the Jobs channel, but also send on
the channel in the event of a failure, if all of them end up failing on
a page and attempting to send the failed page through the same channel,
there'll be no workers left to read from the channel and the whole
process will hang. Increase the channel size to the expected number of
pages to reduce the chance of workers blocking themselves.
@Akenaide
Copy link
Owner

This is the expected behaviour, I don't think it's a good practice to spam request as a scraper. If you really want this feature you can add it as option.

@kwadkore
Copy link
Contributor Author

Hanging indefinitely does not sound like expected behavior. As the code stands right now, if a page fails, a worker can end up hanging indefinitely. In the worst case, all workers will hang and nothing will proceed further. Say for example there are 5 workers and 10 pages. In the worst case, the first five pages fail. All 5 workers will block at line 170 and nothing will proceed further because there's nothing else that will read from the Jobs channel (so pages 6-10 wouldn't even be attempted). If the intention is to not retry a failed page, then the solution is to take out line 170 instead. If the intention is to stop the scraper altogether if any page fails, then code should be added to stop all the workers when a failed page is encountered.

The code in this pull request doesn't increase the rate at which pages are tried. It just prevents the hanging issue by making it so that the workers can move past line 170 and try another page.

@Akenaide
Copy link
Owner

Akenaide commented Aug 1, 2024

At some point they were a retry functionality. I understand your point and you seems to understand mine but I'm quite busy to test it. I will do it ASAP.

Right now, it does not seems to be an issue. The main bottleneck is if there no proxy available or maybe it's linked.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants