Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document how to handle action failures #157

Open
wants to merge 10 commits into
base: main
Choose a base branch
from
1 change: 1 addition & 0 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,7 @@ either :ref:`globally <transparent>` or :ref:`per request <automap>`, or
usage/retry
usage/scrapy-poet
usage/stats
usage/actions
usage/fingerprint
usage/proxy

Expand Down
111 changes: 111 additions & 0 deletions docs/usage/actions.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,111 @@
.. _actions:

======================
Handling action errors
======================
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a lot of code in this section; I wonder if we should make it a part of the library itself instead.

It seems like a general theme: if action failed, should the response be considered a success or a failure? The current default is success, but it seems that making it a failure is better (with an option to get back), as it allows to detect more issues.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

deedfff does this for the middleware. Do we want to do something about the cache policy, or is it OK to keep that documentation-only?


Zyte API responses are considered successful :ref:`even if some browser actions
fail <zyte-api-successful-responses>`.

If you wish to retry requests whose response contains actions that failed, you
can define a custom Scrapy middleware as follows:

.. code-block:: python
:caption: myproject/middlewares.py

from typing import Optional, Type, Union

from scrapy import Request, Spider
from scrapy.crawler import Crawler
from scrapy.http import Response
from scrapy.downloadermiddlewares.retry import get_retry_request
from scrapy.settings import BaseSettings
from scrapy_zyte_api.responses import ZyteAPIResponse, ZyteAPITextResponse

class ZyteAPIFailedActionsRetryMiddleware:

@classmethod
def from_crawler(cls, crawler: Crawler):
return cls(crawler.settings)

def __init__(self, settings: BaseSettings):
if not settings.getbool("RETRY_ENABLED"):
raise NotConfigured
self.max_retry_times = settings.getint("RETRY_TIMES")
self.priority_adjust = settings.getint("RETRY_PRIORITY_ADJUST")

def process_response(
self, request: Request, response: Response, spider: Spider
) -> Union[Request, Response]:
if not isinstance(response, (ZyteAPIResponse, ZyteAPITextResponse)):
return response
if request.meta.get("dont_retry", False):
return response
if any("error" in action for action in response.raw_api_response["actions"]):
reason = "An action failed"
new_request = self._retry(request, reason, spider)
if new_request:
return new_request
else:
return response
# Note: If you prefer requests that exceed all retries to
# be dropped, raise scrapy.exceptions.IgnoreRequest here,
# instead of returning the response.
return response

def _retry(
self,
request: Request,
reason: Union[str, Exception, Type[Exception]],
spider: Spider,
) -> Optional[Request]:
max_retry_times = request.meta.get("max_retry_times", self.max_retry_times)
priority_adjust = request.meta.get("priority_adjust", self.priority_adjust)
return get_retry_request(
request,
reason=reason,
spider=spider,
max_retry_times=max_retry_times,
priority_adjust=priority_adjust,
)

And enable it in your settings:

.. code-block:: python
:caption: myproject/settings.py


DOWNLOADER_MIDDLEWARES = {
"myproject.middlewares.ZyteAPIFailedActionsRetryMiddleware": 525,
}

If you use
:class:`~scrapy.downloadermiddlewares.httpcache.HttpCacheMiddleware`, you might
want to use a custom :setting:`HTTPCACHE_POLICY <scrapy:HTTPCACHE_POLICY>` to
prevent responses with failed actions (i.e. after exceeding retries) to be
cached:

.. code-block:: python
:caption: myproject/extensions.py

from scrapy import Request
from scrapy.extensions.httpcache import DummyPolicy
from scrapy.http import Response
from scrapy_zyte_api.responses import ZyteAPIResponse, ZyteAPITextResponse

class ZyteAPIFailedActionsPolicy(DummyPolicy):

def should_cache_response(self, response: Response, request: Request):
if (
isinstance(response, (ZyteAPIResponse, ZyteAPITextResponse))
and any("error" in action for action in response.raw_api_response["actions"])
):
return False
return super().should_cache_response(response, request)

And enable it in your settings:

.. code-block:: python
:caption: myproject/settings.py

HTTPCACHE_POLICY = "myproject.extensions.ZyteAPIFailedActionsPolicy"
Loading