Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-128131: Completely support random read access of uncompressed unencrypted files in ZipFile #128143

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

5ec1cff
Copy link

@5ec1cff 5ec1cff commented Dec 21, 2024

@bedevere-app
Copy link

bedevere-app bot commented Dec 21, 2024

Most changes to Python require a NEWS entry. Add one using the blurb_it web app or the blurb command-line tool.

If this change has little impact on Python users, wait for a maintainer to apply the skip news label instead.

@picnixz
Copy link
Contributor

picnixz commented Dec 21, 2024

Closing as a duplicate of #128132. Please do not copy PRs made by others. Thanks.

EDIT: It appears that this is either the same person or two people collaborating with each other. Anyway, please do not update two identical PRs. Thanks

@picnixz picnixz closed this Dec 21, 2024
@5ec1cff 5ec1cff deleted the 5ec1cff-patch-1 branch December 21, 2024 01:59
@5ec1cff 5ec1cff restored the 5ec1cff-patch-1 branch December 21, 2024 03:19
@5ec1cff
Copy link
Author

5ec1cff commented Dec 21, 2024

Closing as a duplicate of #128132. Please do not copy PRs made by others. Thanks.

EDIT: It appears that this is either the same person or two people collaborating with each other. Anyway, please do not update two identical PRs. Thanks

We discussed about this issue and decide that I submit the pr. vvb2060 has already closed his pr. Please reopen my pr, sorry for inconvenience.

@picnixz picnixz reopened this Dec 21, 2024
Copy link
Contributor

@picnixz picnixz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have tests for that please?

@5ec1cff 5ec1cff changed the title GH-128131: Support forward seek in uncompressed unencrypted ZipExtFile GH-128131: Support fast forward seek in uncompressed unencrypted ZipExtFile Dec 21, 2024
@5ec1cff
Copy link
Author

5ec1cff commented Dec 21, 2024

Can we have tests for that please?

Added.

Copy link
Contributor

@picnixz picnixz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't looked at the actual implementation until the tests are (hopefully) simpler.

Lib/test/test_zipfile/test_core.py Outdated Show resolved Hide resolved
Lib/test/test_zipfile/test_core.py Outdated Show resolved Hide resolved
b = fp.read(100)
self.assertEqual(b, txt[:100])

# seek length must be greater than ZipExtFile.MIN_READ_SIZE (4096)
Copy link
Contributor

@picnixz picnixz Dec 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, I would add something like self.assertLess(ZipExtFile.MIN_READ_SIZE, 5000) to be sure that we update the tests if needed.

EDIT: check this outside the "with".

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use ZipExtFile.MIN_READ_SIZE directly ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are other tests that modify this value to make it much smaller:

# Make sure that the second read after seeking back beyond
# _readbuffer returns the same content (ie. rewind to the start of
# the file to read forward to the required position).
old_read_size = fp.MIN_READ_SIZE
fp.MIN_READ_SIZE = 1
fp._readbuffer = b''
fp._offset = 0
fp.seek(0, os.SEEK_SET)
self.assertEqual(fp.tell(), 0)
fp.seek(bloc, os.SEEK_CUR)
self.assertEqual(fp.read(bloc_len), txt[bloc:bloc+bloc_len])
fp.MIN_READ_SIZE = old_read_size

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean we can also modify this value?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep (but it's best to only change it on the instance so it doesn't leak to the rest of the tests :p )

Lib/test/test_zipfile/test_core.py Outdated Show resolved Hide resolved
Lib/test/test_zipfile/test_core.py Outdated Show resolved Hide resolved
Lib/test/test_zipfile/test_core.py Outdated Show resolved Hide resolved
Lib/test/test_zipfile/test_core.py Outdated Show resolved Hide resolved
Lib/test/test_zipfile/test_core.py Show resolved Hide resolved
Lib/test/test_zipfile/test_core.py Outdated Show resolved Hide resolved
@5ec1cff 5ec1cff changed the title GH-128131: Support fast forward seek in uncompressed unencrypted ZipExtFile GH-128131: Completely support random access of uncompressed unencrypted ZipExtFile Dec 22, 2024
@5ec1cff 5ec1cff changed the title GH-128131: Completely support random access of uncompressed unencrypted ZipExtFile GH-128131: Completely support random access of uncompressed unencrypted files inside ZipFile Dec 22, 2024
@5ec1cff 5ec1cff changed the title GH-128131: Completely support random access of uncompressed unencrypted files inside ZipFile GH-128131: Completely support random read access of uncompressed unencrypted files in ZipFile Dec 22, 2024
@5ec1cff 5ec1cff requested a review from picnixz December 22, 2024 14:59
Copy link
Contributor

@picnixz picnixz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lib/test/test_zipfile/test_core.py Outdated Show resolved Hide resolved
Lib/test/test_zipfile/test_core.py Outdated Show resolved Hide resolved
# disable CRC checking after first seeking - it would be invalid
self._expected_crc = None
# seek actual file taking already buffered data into account
read_offset -= len(self._readbuffer) - self._offset
self._fileobj.seek(read_offset, os.SEEK_CUR)
self._left -= read_offset
self._compress_left -= read_offset
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good catch that was a bug with read_offset > 0. Are you able to make a test for this too?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

Copy link
Contributor

@danifus danifus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I missed these before

Comment on lines +3486 to +3487
d = sio.bytes_read - old_count
self.assertLessEqual(d, min_size)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to think about why we were doing this and a comment would here would help. Feel free to change this if this wasn't the intent of this test :)

Suggested change
d = sio.bytes_read - old_count
self.assertLessEqual(d, min_size)
# Test this optimized read hasn't rewound and read from the
# start of the file (as in the case of the unoptimized path)
d = sio.bytes_read - old_count
self.assertLessEqual(d, min_size)

Comment on lines +3498 to +3499
d = sio.bytes_read - old_count
self.assertLessEqual(d, min_size)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above

Suggested change
d = sio.bytes_read - old_count
self.assertLessEqual(d, min_size)
# Test this optimized read hasn't rewound and read from the
# start of the file (as in the case of the unoptimized path)
d = sio.bytes_read - old_count
self.assertLessEqual(d, min_size)

# 20000 bytes
txt = b'0123456789' * 2000

# seek length must be greater than ZipExtFile.MIN_READ_SIZE (4096)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# seek length must be greater than ZipExtFile.MIN_READ_SIZE (4096)
# seek length must be greater than ZipExtFile.MIN_READ_SIZE (4096)
# as `ZipExtFile._read2()` reads in blocks of this size and we need to
# seek out of the buffered data

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, that comment wasn't very good. This one is better?

Suggested change
# seek length must be greater than ZipExtFile.MIN_READ_SIZE (4096)
# seek length must be greater than ZipExtFile.MIN_READ_SIZE (4096)
# as `ZipExtFile._read2()` reads in blocks of this size and this test relies
# on reads remaining within one of these blocks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to clarify, it was my suggested change that was not very good. It didn’t describe what was happening in the code as well as it could :)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the first one is better

@5ec1cff 5ec1cff requested a review from danifus December 24, 2024 06:17
Copy link
Contributor

@picnixz picnixz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The question is: should we treat this as a bug or a feature? it's half-half but I'm not a maintainer of this module so I don't really know. @jaraco WDYT?

Comment on lines +3467 to +3469
# The seek length must be greater than ZipExtFile.MIN_READ_SIZE (4096)
# as `ZipExtFile._read2()` reads in blocks of this size and we need to
# seek out of the buffered data
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# The seek length must be greater than ZipExtFile.MIN_READ_SIZE (4096)
# as `ZipExtFile._read2()` reads in blocks of this size and we need to
# seek out of the buffered data
# The seek length must be greater than ZipExtFile.MIN_READ_SIZE
# as `ZipExtFile._read2()` reads in blocks of this size and we
# need to seek out of the buffered data

Comment on lines +3471 to +3472
self.assertGreaterEqual(10002, min_size)
self.assertGreaterEqual(5003, min_size)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
self.assertGreaterEqual(10002, min_size)
self.assertGreaterEqual(5003, min_size)
self.assertGreaterEqual(10002, min_size) # for forward seek test
self.assertGreaterEqual(5003, min_size) # for backward seek test

self.assertGreaterEqual(5003, min_size)
# The read length must be less than MIN_READ_SIZE, since we assume that
# only 1 block is read in the test.
self.assertGreaterEqual(min_size, 100)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
self.assertGreaterEqual(min_size, 100)
self.assertGreaterEqual(min_size, 100) # for read() calls

# backward seek
old_count = sio.bytes_read
fp.seek(-5003, os.SEEK_CUR)
self.assertEqual(fp.tell(), 5099)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
self.assertEqual(fp.tell(), 5099)
self.assertEqual(fp.tell(), 5099) # 5099 = 10102 - 5003

Comment on lines +3452 to +3461
from _pyio import BytesIO
class StatIO(BytesIO):
def __init__(self):
super().__init__()
self.bytes_read = 0

def read(self, size=-1):
bs = super().read(size)
self.bytes_read += len(bs)
return bs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be made a non-local class so that you can re-use it in future tests.

@@ -3447,6 +3447,71 @@ def test_too_short(self):
self.assertEqual(
b"zzz", zipfile._Extra.strip(b"zzz", (self.ZIP64_EXTRA,)))

class StoredZipExtFileRandomReadTest(unittest.TestCase):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Separate classes by 2 blank lines instead of 1.

@picnixz picnixz dismissed their stale review December 24, 2024 15:55

The PR addressed the requested changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants