Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Timeout Issue When Scraping Emails from Multiple URLs Using Python requests #6862

Closed
mominurr opened this issue Dec 24, 2024 · 1 comment
Closed
Labels
actions/autoclose-qa Used for automation to auto-close an issue Question/Not a bug

Comments

@mominurr
Copy link

I am working on a web scraping project using Python's requests library. The goal is to scrape emails from numerous URLs. To handle network delays, I set the timeout parameter as timeout=(10, 10).

However, when I run the script for multiple URLs, I encounter an issue where the program gets stuck on a request and does not respect the timeout settings. This results in the script hanging indefinitely, especially when scraping a large number of URLs.

Here’s the code snippet I’m using:

import requests  

urls = [  
    "http://example.com",  
    "http://anotherexample.com",  
    # ... more URLs  
]  
HEADERS={"user-agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36"}
for url in urls:  
    try:  
        response = requests.get(url, headers=HEADERS, timeout=(10, 10))  
        if response.status_code == 200:  
            # Extract emails (simplified for demonstration)  
            print(f"Emails from {url}: ", response.text)  
    except requests.exceptions.Timeout:  
        print(f"Timeout occurred for {url}")  
    except requests.exceptions.RequestException as e:  
        print(f"Error occurred for {url}: {e}")  

Despite using the timeout parameter, the script sometimes gets stuck indefinitely and doesn’t proceed to the next URL.

Steps Taken:

  1. Tried reducing the timeout values to (5, 5) but encountered the same issue.
    Ensured that the URLs are valid and accessible.

My Questions:

  1. Why might the timeout not work as expected in this case?

  2. How can I ensure that the script doesn't hang indefinitely when scraping a large number of URLs?

Any help or suggestions to resolve this issue would be greatly appreciated.

Environment:

Python version: 3.10.10

requests version: 2.32.3

@sigmavirus24
Copy link
Contributor

@sigmavirus24 sigmavirus24 closed this as not planned Won't fix, can't repro, duplicate, stale Dec 24, 2024
@sigmavirus24 sigmavirus24 added Question/Not a bug actions/autoclose-qa Used for automation to auto-close an issue labels Dec 24, 2024
@psf psf locked and limited conversation to collaborators Dec 24, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
actions/autoclose-qa Used for automation to auto-close an issue Question/Not a bug
Projects
None yet
Development

No branches or pull requests

2 participants