-
-
Notifications
You must be signed in to change notification settings - Fork 30.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-128131: Completely support random read access of uncompressed unencrypted files in ZipFile #128143
base: main
Are you sure you want to change the base?
Conversation
5ec1cff
commented
Dec 21, 2024
•
edited by bedevere-app
bot
Loading
edited by bedevere-app
bot
- Issue: random access uncompressed unencrypted ZipExtFile #128131
Most changes to Python require a NEWS entry. Add one using the blurb_it web app or the blurb command-line tool. If this change has little impact on Python users, wait for a maintainer to apply the |
Closing as a duplicate of #128132. Please do not copy PRs made by others. Thanks. EDIT: It appears that this is either the same person or two people collaborating with each other. Anyway, please do not update two identical PRs. Thanks |
We discussed about this issue and decide that I submit the pr. vvb2060 has already closed his pr. Please reopen my pr, sorry for inconvenience. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we have tests for that please?
Misc/NEWS.d/next/Library/2024-12-21-03-20-12.gh-issue-128131.QpPmNt.rst
Outdated
Show resolved
Hide resolved
Added. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't looked at the actual implementation until the tests are (hopefully) simpler.
…pPmNt.rst Co-authored-by: Bénédikt Tran <10796600+picnixz@users.noreply.github.com>
add unittest
2c34891
to
57cb51c
Compare
Lib/test/test_zipfile/test_core.py
Outdated
b = fp.read(100) | ||
self.assertEqual(b, txt[:100]) | ||
|
||
# seek length must be greater than ZipExtFile.MIN_READ_SIZE (4096) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case, I would add something like self.assertLess(ZipExtFile.MIN_READ_SIZE, 5000)
to be sure that we update the tests if needed.
EDIT: check this outside the "with".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use ZipExtFile.MIN_READ_SIZE
directly ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are other tests that modify this value to make it much smaller:
cpython/Lib/test/test_zipfile/test_core.py
Lines 2591 to 2602 in 831b6de
# Make sure that the second read after seeking back beyond | |
# _readbuffer returns the same content (ie. rewind to the start of | |
# the file to read forward to the required position). | |
old_read_size = fp.MIN_READ_SIZE | |
fp.MIN_READ_SIZE = 1 | |
fp._readbuffer = b'' | |
fp._offset = 0 | |
fp.seek(0, os.SEEK_SET) | |
self.assertEqual(fp.tell(), 0) | |
fp.seek(bloc, os.SEEK_CUR) | |
self.assertEqual(fp.read(bloc_len), txt[bloc:bloc+bloc_len]) | |
fp.MIN_READ_SIZE = old_read_size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean we can also modify this value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep (but it's best to only change it on the instance so it doesn't leak to the rest of the tests :p )
Misc/NEWS.d/next/Library/2024-12-21-03-20-12.gh-issue-128131.QpPmNt.rst
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please address this comment: https://github.com/python/cpython/pull/128143/files#r1894920324.
# disable CRC checking after first seeking - it would be invalid | ||
self._expected_crc = None | ||
# seek actual file taking already buffered data into account | ||
read_offset -= len(self._readbuffer) - self._offset | ||
self._fileobj.seek(read_offset, os.SEEK_CUR) | ||
self._left -= read_offset | ||
self._compress_left -= read_offset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good catch that was a bug with read_offset > 0
. Are you able to make a test for this too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added.
692109a
to
67f05de
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I missed these before
d = sio.bytes_read - old_count | ||
self.assertLessEqual(d, min_size) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had to think about why we were doing this and a comment would here would help. Feel free to change this if this wasn't the intent of this test :)
d = sio.bytes_read - old_count | |
self.assertLessEqual(d, min_size) | |
# Test this optimized read hasn't rewound and read from the | |
# start of the file (as in the case of the unoptimized path) | |
d = sio.bytes_read - old_count | |
self.assertLessEqual(d, min_size) |
d = sio.bytes_read - old_count | ||
self.assertLessEqual(d, min_size) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as above
d = sio.bytes_read - old_count | |
self.assertLessEqual(d, min_size) | |
# Test this optimized read hasn't rewound and read from the | |
# start of the file (as in the case of the unoptimized path) | |
d = sio.bytes_read - old_count | |
self.assertLessEqual(d, min_size) |
Lib/test/test_zipfile/test_core.py
Outdated
# 20000 bytes | ||
txt = b'0123456789' * 2000 | ||
|
||
# seek length must be greater than ZipExtFile.MIN_READ_SIZE (4096) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# seek length must be greater than ZipExtFile.MIN_READ_SIZE (4096) | |
# seek length must be greater than ZipExtFile.MIN_READ_SIZE (4096) | |
# as `ZipExtFile._read2()` reads in blocks of this size and we need to | |
# seek out of the buffered data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, that comment wasn't very good. This one is better?
# seek length must be greater than ZipExtFile.MIN_READ_SIZE (4096) | |
# seek length must be greater than ZipExtFile.MIN_READ_SIZE (4096) | |
# as `ZipExtFile._read2()` reads in blocks of this size and this test relies | |
# on reads remaining within one of these blocks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to clarify, it was my suggested change that was not very good. It didn’t describe what was happening in the code as well as it could :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the first one is better
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The question is: should we treat this as a bug or a feature? it's half-half but I'm not a maintainer of this module so I don't really know. @jaraco WDYT?
# The seek length must be greater than ZipExtFile.MIN_READ_SIZE (4096) | ||
# as `ZipExtFile._read2()` reads in blocks of this size and we need to | ||
# seek out of the buffered data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# The seek length must be greater than ZipExtFile.MIN_READ_SIZE (4096) | |
# as `ZipExtFile._read2()` reads in blocks of this size and we need to | |
# seek out of the buffered data | |
# The seek length must be greater than ZipExtFile.MIN_READ_SIZE | |
# as `ZipExtFile._read2()` reads in blocks of this size and we | |
# need to seek out of the buffered data |
self.assertGreaterEqual(10002, min_size) | ||
self.assertGreaterEqual(5003, min_size) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.assertGreaterEqual(10002, min_size) | |
self.assertGreaterEqual(5003, min_size) | |
self.assertGreaterEqual(10002, min_size) # for forward seek test | |
self.assertGreaterEqual(5003, min_size) # for backward seek test |
self.assertGreaterEqual(5003, min_size) | ||
# The read length must be less than MIN_READ_SIZE, since we assume that | ||
# only 1 block is read in the test. | ||
self.assertGreaterEqual(min_size, 100) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.assertGreaterEqual(min_size, 100) | |
self.assertGreaterEqual(min_size, 100) # for read() calls |
# backward seek | ||
old_count = sio.bytes_read | ||
fp.seek(-5003, os.SEEK_CUR) | ||
self.assertEqual(fp.tell(), 5099) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.assertEqual(fp.tell(), 5099) | |
self.assertEqual(fp.tell(), 5099) # 5099 = 10102 - 5003 |
from _pyio import BytesIO | ||
class StatIO(BytesIO): | ||
def __init__(self): | ||
super().__init__() | ||
self.bytes_read = 0 | ||
|
||
def read(self, size=-1): | ||
bs = super().read(size) | ||
self.bytes_read += len(bs) | ||
return bs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be made a non-local class so that you can re-use it in future tests.
@@ -3447,6 +3447,71 @@ def test_too_short(self): | |||
self.assertEqual( | |||
b"zzz", zipfile._Extra.strip(b"zzz", (self.ZIP64_EXTRA,))) | |||
|
|||
class StoredZipExtFileRandomReadTest(unittest.TestCase): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Separate classes by 2 blank lines instead of 1.
The PR addressed the requested changes.