Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upload cancellation causes long retry loop #3268

Open
1 task
andy-slac opened this issue Oct 1, 2024 · 4 comments
Open
1 task

Upload cancellation causes long retry loop #3268

andy-slac opened this issue Oct 1, 2024 · 4 comments
Labels
bug This issue is a confirmed bug. needs-review This issue or pull request needs review from a core team member. p2 This is a standard priority issue s3transfer

Comments

@andy-slac
Copy link

Describe the bug

I'm not exactly sure if this is botocore or s3transfer issue, more probably some inconsistency between the two. What I observe is that when there is an asynchronous exception (Control-C or similar "random" exception that we use to terminate the application) during S3 upload, the upload does not stop immediately, but instead it hangs for longer than a minute. This happens with multithreaded upload, single-threaded upload does not show the same issue.

Regression Issue

  • Select this option if this issue appears to be a regression.

Expected Behavior

I would expect that Control-C or other random exception should stop upload almost immediately.

Current Behavior

Running with debug logging enabled I think I see what happens:

  • When exception is raised, S3Transfer context manager __exit__ method is called which calls TransferManager.__exit__()
  • Latter calls TransferManager._shutdown with exception type FatalError or CancelledError depending on the type of original exception.
  • That in turn calls transfer coordinator managers' cancel() method passing the exception type to them.
  • Later when client tries to read something from input file, the exception in transfer coordinator is checked and if not Null then that exception is raised.
  • I think the intention is that this exception is supposed to cancel upload by that thread, instead it looks like the retry handler intercepts that exception and tries to restart the upload in each thread. This causes the same issue - exception is re-raised and intercepted again. It gives up after multiple iterations and that takes time

Here is an example of traceback dumped by boto from one of the threads when it goes into a retry loop:

2024-09-30 19:21:01,572 ThreadPoolExecutor-0_7 DEBUG botocore.httpsession (<...>/lib/python3.11/site-packages/botocore/httpsession.py:508) :: Exception received when sending urllib3 HTTP request
Traceback (most recent call last):
  File "<...>/lib/python3.11/site-packages/botocore/httpsession.py", line 464, in send
    urllib_response = conn.urlopen(
                      ^^^^^^^^^^^^^
  File "<...>/lib/python3.11/site-packages/urllib3/connectionpool.py", line 715, in urlopen
    httplib_response = self._make_request(
                       ^^^^^^^^^^^^^^^^^^^
  File "<...>/lib/python3.11/site-packages/urllib3/connectionpool.py", line 416, in _make_request
    conn.request(method, url, **httplib_request_kw)
  File "<...>/lib/python3.11/site-packages/botocore/awsrequest.py", line 96, in request
    rval = super().request(method, url, body, headers, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<...>/lib/python3.11/site-packages/urllib3/connection.py", line 244, in request
    super(HTTPConnection, self).request(method, url, body=body, headers=headers)
  File "<...>/lib/python3.11/http/client.py", line 1303, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "<...>/lib/python3.11/http/client.py", line 1349, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "<...>/lib/python3.11/http/client.py", line 1298, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "<...>/lib/python3.11/site-packages/botocore/awsrequest.py", line 130, in _send_output
    self._handle_expect_response(message_body)
  File "<...>/lib/python3.11/site-packages/botocore/awsrequest.py", line 176, in _handle_expect_response
    self._send_message_body(message_body)
  File "<...>/lib/python3.11/site-packages/botocore/awsrequest.py", line 209, in _send_message_body
    self.send(message_body)
  File "<...>/lib/python3.11/site-packages/botocore/awsrequest.py", line 223, in send
    return super().send(str)
           ^^^^^^^^^^^^^^^^^
  File "<...>/lib/python3.11/http/client.py", line 1009, in send
    datablock = data.read(self.blocksize)
                ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<...>/lib/python3.11/site-packages/s3transfer/utils.py", line 520, in read
    data = self._fileobj.read(amount_to_read)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<...>/lib/python3.11/site-packages/s3transfer/upload.py", line 90, in read
    raise self._transfer_coordinator.exception
  File "<...>/lib/python3.11/site-packages/urllib3/connectionpool.py", line 715, in urlopen
    httplib_response = self._make_request(
                       ^^^^^^^^^^^^^^^^^^^
  File "<...>/lib/python3.11/site-packages/urllib3/connectionpool.py", line 416, in _make_request
    conn.request(method, url, **httplib_request_kw)
  File "<...>/lib/python3.11/site-packages/botocore/awsrequest.py", line 96, in request
    rval = super().request(method, url, body, headers, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<...>/lib/python3.11/site-packages/urllib3/connection.py", line 244, in request
    super(HTTPConnection, self).request(method, url, body=body, headers=headers)
  File "<...>/lib/python3.11/http/client.py", line 1303, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "<...>/lib/python3.11/http/client.py", line 1349, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "<...>/lib/python3.11/http/client.py", line 1298, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "<...>/lib/python3.11/site-packages/botocore/awsrequest.py", line 130, in _send_output
    self._handle_expect_response(message_body)
  File "<...>/lib/python3.11/site-packages/botocore/awsrequest.py", line 176, in _handle_expect_response
    self._send_message_body(message_body)
  File "<...>/lib/python3.11/site-packages/botocore/awsrequest.py", line 209, in _send_message_body
    self.send(message_body)
  File "<...>/lib/python3.11/site-packages/botocore/awsrequest.py", line 223, in send
    return super().send(str)
           ^^^^^^^^^^^^^^^^^
  File "<...>/lib/python3.11/http/client.py", line 1009, in send
    datablock = data.read(self.blocksize)
                ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<...>/lib/python3.11/site-packages/s3transfer/utils.py", line 520, in read
    data = self._fileobj.read(amount_to_read)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<...>/lib/python3.11/site-packages/s3transfer/upload.py", line 90, in read
    raise self._transfer_coordinator.exception
  File "<...>/lib/python3.11/site-packages/urllib3/connectionpool.py", line 715, in urlopen
    httplib_response = self._make_request(
                       ^^^^^^^^^^^^^^^^^^^
  File "<...>/lib/python3.11/site-packages/urllib3/connectionpool.py", line 416, in _make_request
    conn.request(method, url, **httplib_request_kw)
  File "<...>/lib/python3.11/site-packages/botocore/awsrequest.py", line 96, in request
    rval = super().request(method, url, body, headers, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<...>/lib/python3.11/site-packages/urllib3/connection.py", line 244, in request
    super(HTTPConnection, self).request(method, url, body=body, headers=headers)
  File "<...>/lib/python3.11/http/client.py", line 1303, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "<...>/lib/python3.11/http/client.py", line 1349, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "<...>/lib/python3.11/http/client.py", line 1298, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "<...>/lib/python3.11/site-packages/botocore/awsrequest.py", line 130, in _send_output
    self._handle_expect_response(message_body)
  File "<...>/lib/python3.11/site-packages/botocore/awsrequest.py", line 176, in _handle_expect_response
    self._send_message_body(message_body)
  File "<...>/lib/python3.11/site-packages/botocore/awsrequest.py", line 209, in _send_message_body
    self.send(message_body)
  File "<...>/lib/python3.11/site-packages/botocore/awsrequest.py", line 223, in send
    return super().send(str)
           ^^^^^^^^^^^^^^^^^
  File "<...>/lib/python3.11/http/client.py", line 1009, in send
    datablock = data.read(self.blocksize)
                ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<...>/lib/python3.11/site-packages/s3transfer/utils.py", line 520, in read
    data = self._fileobj.read(amount_to_read)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<...>/lib/python3.11/site-packages/s3transfer/upload.py", line 90, in read
    raise self._transfer_coordinator.exception
s3transfer.exceptions.FatalError: Received signal SIGINT.

Reproduction Steps

Here is the code that I used to debug the issue, the code is actually a trivial call to file_upload method. The trick is to trigger upload cancellation by hitting Ctrl-C at a correct moment. In our setup I used ~100MB file which gave me a reasonable time window during the upload. If you are lucky that after hitting Ctrl-C you should see enormous amount of tracebacks printed like the one I included above.

#!/usr/bin/env python

import argparse
import logging
import signal
import traceback

import botocore
import boto3

# Configure logging to use DEBUG, this shows tracebacks from execptions
# handled by boto internally.
fmt = "%(asctime)s %(threadName)s %(levelname)s %(name)s (%(pathname)s:%(lineno)d) :: %(message)s"
logging.basicConfig(level=logging.DEBUG, format=fmt)

_log = logging.getLogger()

# In our case we try to gracefully terminate transfers when a signal is sent
# by the scheduler which executes all processing jobs. This is done by raising
# special exception which inherits BaseException so that it's not swalowed by
# regular `except Exception`, i.e. it behaves more like KeyboardInterrupt.
class GracefulShutdownInterrupt(BaseException):
    pass


def _signal_handler(signum, stack_frame):
    signame = signal.Signals(signum).name
    _log.info(f"************* Signal {signame} detected, cleaning up and shutting down. ************\n")
    traceback.print_stack(stack_frame)
    raise GracefulShutdownInterrupt(f"Received signal {signame}.")


def main():

    parser = argparse.ArgumentParser()
    parser.add_argument("file", help="File to upload to S3.")
    parser.add_argument("bucket", help="Name S3 bucket.")
    parser.add_argument("path", help="Name of the object in the bucket.")
    parser.add_argument("-p", "--profile", default=None, help="AWS profile name.")
    parser.add_argument("--endpoint-url", default=None, help="Endpoint URL.")
    parser.add_argument("--access-key-id", default=None, help="Access key ID.")
    parser.add_argument("--secret-access-key", default=None, help="Secret access key.")
    args = parser.parse_args()

    session = boto3.Session(profile_name=args.profile)

    # This configuration is what we use by default in our setup.
    config = botocore.config.Config(read_timeout=180, retries={"mode": "adaptive", "max_attempts": 10})
    client = session.client(
        "s3",
        endpoint_url=args.endpoint_url,
        aws_access_key_id=args.access_key_id,
        aws_secret_access_key=args.secret_access_key,
        config=config,
    )

    # Installing handler is optional, if KeyboardInterrupt is raised instead of
    # GracefulShutdownInterrupt then the exception raised by coordinators will
    # be different (CancelledError vs. FatalError).
    signal.signal(signal.SIGINT, _signal_handler)
    signal.signal(signal.SIGTERM, _signal_handler)
    client.upload_file(args.file, args.bucket, args.path)


if __name__ == "__main__":
    main()

Possible Solution

Maybe retry handler should understand that certain kinds of exceptions should result in immediate failure?

Additional Information/Context

Package versions:

boto3                         1.34.131
botocore                      1.34.131
s3transfer                    0.10.2
urllib3                       1.26.19

SDK version used

1.34.131

Environment details (OS name and version, etc.)

Red Hat Enterprise Linux release 8.6 (Ootpa), kernel 4.18.0-372.32.1.el8_6.x86_64

@andy-slac andy-slac added bug This issue is a confirmed bug. needs-triage This issue or PR still needs to be triaged. labels Oct 1, 2024
@andy-slac
Copy link
Author

I can attach full logs from actual execution of that script with and without signal handler, each about 10MB in size.

@tim-finnigan tim-finnigan self-assigned this Oct 2, 2024
@tim-finnigan
Copy link
Contributor

Thanks for reaching out. Can you share the full logs (with any sensitive info redacted) by adding boto3.set_stream_logger('') to your script? When you say "The trick is to trigger upload cancellation by hitting Ctrl-C at a correct moment", can you elaborate on that a bit more? Is this issue only reproducible when cancelling the upload at a certain point in the process?

I'll link the Boto3 docs on retries and timeout for Config for reference, although I see you already have those applied in your snippet. I'm wondering if specifying a lower timeout and max_attempts would reduce the delay you're seeing.

@tim-finnigan tim-finnigan added response-requested Waiting on additional info and feedback. s3transfer p2 This is a standard priority issue and removed needs-triage This issue or PR still needs to be triaged. labels Oct 2, 2024
@andy-slac
Copy link
Author

Attaching the log with debug output: log-exc5-handler.log.gz

Reducing the delay is not my goal, the retries are configured reasonably in our case and there is no reason to change those. What I want is for transfer to stop instantly without any delay after the signal is received and exception is raised, there is no reason to retry in that case.

Hitting Ctrl-C at a correct moment means that moment should happen when upload has started but before it has finished, i.e. when the threads are actively pushing the data. That window may be short depending on the size of the file. I used ~100MB file in my case which gave me relatively good chance.

@github-actions github-actions bot removed the response-requested Waiting on additional info and feedback. label Oct 4, 2024
@tim-finnigan
Copy link
Contributor

Thanks for following up. It looks like boto/s3transfer#249 may be related. We can plan to bring this issue up with the team for further review.

@tim-finnigan tim-finnigan added the needs-review This issue or pull request needs review from a core team member. label Oct 7, 2024
@tim-finnigan tim-finnigan removed their assignment Oct 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug This issue is a confirmed bug. needs-review This issue or pull request needs review from a core team member. p2 This is a standard priority issue s3transfer
Projects
None yet
Development

No branches or pull requests

2 participants