Timeout Issue When Scraping Emails from Multiple URLs Using Python requests #6862

mominurr · 2024-12-24T14:57:57Z

I am working on a web scraping project using Python's requests library. The goal is to scrape emails from numerous URLs. To handle network delays, I set the timeout parameter as timeout=(10, 10).

However, when I run the script for multiple URLs, I encounter an issue where the program gets stuck on a request and does not respect the timeout settings. This results in the script hanging indefinitely, especially when scraping a large number of URLs.

Here’s the code snippet I’m using:

import requests  

urls = [  
    "http://example.com",  
    "http://anotherexample.com",  
    # ... more URLs  
]  
HEADERS={"user-agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36"}
for url in urls:  
    try:  
        response = requests.get(url, headers=HEADERS, timeout=(10, 10))  
        if response.status_code == 200:  
            # Extract emails (simplified for demonstration)  
            print(f"Emails from {url}: ", response.text)  
    except requests.exceptions.Timeout:  
        print(f"Timeout occurred for {url}")  
    except requests.exceptions.RequestException as e:  
        print(f"Error occurred for {url}: {e}")

Despite using the timeout parameter, the script sometimes gets stuck indefinitely and doesn’t proceed to the next URL.

Steps Taken:

Tried reducing the timeout values to (5, 5) but encountered the same issue.
Ensured that the URLs are valid and accessible.

My Questions:

Why might the timeout not work as expected in this case?
How can I ensure that the script doesn't hang indefinitely when scraping a large number of URLs?

Any help or suggestions to resolve this issue would be greatly appreciated.

Environment:

Python version: 3.10.10

requests version: 2.32.3

The text was updated successfully, but these errors were encountered:

sigmavirus24 · 2024-12-24T15:36:23Z

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Timeout Issue When Scraping Emails from Multiple URLs Using Python requests #6862

Timeout Issue When Scraping Emails from Multiple URLs Using Python requests #6862

mominurr commented Dec 24, 2024

sigmavirus24 commented Dec 24, 2024

Timeout Issue When Scraping Emails from Multiple URLs Using Python requests #6862

Timeout Issue When Scraping Emails from Multiple URLs Using Python requests #6862

Comments

mominurr commented Dec 24, 2024

sigmavirus24 commented Dec 24, 2024