You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am working on a web scraping project using Python's requests library. The goal is to scrape emails from numerous URLs. To handle network delays, I set the timeout parameter as timeout=(10, 10).
However, when I run the script for multiple URLs, I encounter an issue where the program gets stuck on a request and does not respect the timeout settings. This results in the script hanging indefinitely, especially when scraping a large number of URLs.
Here’s the code snippet I’m using:
import requests
urls = [
"http://example.com",
"http://anotherexample.com",
# ... more URLs
]
HEADERS={"user-agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36"}
for url in urls:
try:
response = requests.get(url, headers=HEADERS, timeout=(10, 10))
if response.status_code == 200:
# Extract emails (simplified for demonstration)
print(f"Emails from {url}: ", response.text)
except requests.exceptions.Timeout:
print(f"Timeout occurred for {url}")
except requests.exceptions.RequestException as e:
print(f"Error occurred for {url}: {e}")
Despite using the timeout parameter, the script sometimes gets stuck indefinitely and doesn’t proceed to the next URL.
Steps Taken:
Tried reducing the timeout values to (5, 5) but encountered the same issue.
Ensured that the URLs are valid and accessible.
My Questions:
Why might the timeout not work as expected in this case?
How can I ensure that the script doesn't hang indefinitely when scraping a large number of URLs?
Any help or suggestions to resolve this issue would be greatly appreciated.
Environment:
Python version: 3.10.10
requests version: 2.32.3
The text was updated successfully, but these errors were encountered:
I am working on a web scraping project using Python's
requests
library. The goal is to scrape emails from numerous URLs. To handle network delays, I set the timeout parameter astimeout=(10, 10)
.However, when I run the script for multiple URLs, I encounter an issue where the program gets stuck on a request and does not respect the timeout settings. This results in the script hanging indefinitely, especially when scraping a large number of URLs.
Here’s the code snippet I’m using:
Despite using the timeout parameter, the script sometimes gets stuck indefinitely and doesn’t proceed to the next URL.
Steps Taken:
Ensured that the URLs are valid and accessible.
My Questions:
Why might the timeout not work as expected in this case?
How can I ensure that the script doesn't hang indefinitely when scraping a large number of URLs?
Any help or suggestions to resolve this issue would be greatly appreciated.
Environment:
Python version: 3.10.10
requests version: 2.32.3
The text was updated successfully, but these errors were encountered: