-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
prototype dak.from_text
#7
Conversation
src/dask_awkward/lib/io/io.py
Outdated
if offsets == 0 and length is None: | ||
bytestring = f.read() | ||
else: | ||
bytestring = read_block(f, offsets, length, delimiter) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the upstream dask function which will find the next delimited after the start and stop of the block, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea these lines of code are from upstream dask (https://github.com/dask/dask/blob/4178feb58a7e708345ce4e41e018a746b0d1fd06/dask/bytes/core.py#L190)
src/dask_awkward/lib/io/io.py
Outdated
buffer = np.frombuffer(bytestring, dtype=np.uint8) | ||
array = ak.from_numpy(buffer) | ||
array = ak.unflatten(array, len(array)) | ||
array = ak.enforce_type(array, "string") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To ask our friends: it feels like we should be able to pass the buffer/bytestring directly to awkward and declare it a string rather than take four lines to do it. OTOH, I don't suppose these lines cost anything.
src/dask_awkward/lib/io/io.py
Outdated
array = ak.unflatten(array, len(array)) | ||
array = ak.enforce_type(array, "string") | ||
array_split = ak.str.split_pattern(array, "\n") | ||
lines = array_split[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-> list[string]? but what was array_split, then, why the [0]?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
array_split
ends up being of type 1 * var * string
; so by grabbing [0]
we get an array of strings N * string
(used the awkward docs https://awkward-array.org/doc/main/user-guide/how-to-strings-read-binary.html)
src/dask_awkward/lib/io/io.py
Outdated
delimiter, | ||
False, | ||
blocksize, | ||
"128 KiB", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
= sample size? It's a reasonable value, but user might need to change it. We don't actually need it at all, since we know the form of the output is array-of-string (if delimiter is None) or array-of-list-of-string (otherwise). Do we allow for delimiter=None?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea it's the sample size; definitely planning on making it user definable! On delimiter- right now I'm strictly passing in b"\n" (temporary just to get the ball rolling), how to handle delimiters (a sensible default and what to supprt in general) was something I had in mind to discuss
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually we make no use of it, so the value should be None/False (whatever it takes not to read it)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
delimiter=b"\n"
is a reasonable default
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
something I'm perhaps over complicating but I'm getting caught up wrapping my head around the difference between the bytes reading delimiter and then the "awkward-level" delimiter. At the bytes reading delimiter we use b"\n"
as a default; but then at the awkward level I currently have hardcoded in
ak.str.split_pattern(array, "\n")
. Is this something we should make user configurable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The two should be the same: a chunk should end on an element-ending delimiter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK great that was my feeling but was worried I wasn't accounting for some extra case
Codecov Report
❗ Your organization is not using the GitHub App Integration. As a result you may experience degraded service beginning May 15th. Please install the GitHub App Integration for your organization. Read more. @@ Coverage Diff @@
## more-json-dev #7 +/- ##
=================================================
+ Coverage 91.94% 91.97% +0.03%
=================================================
Files 15 20 +5
Lines 1626 2667 +1041
=================================================
+ Hits 1495 2453 +958
- Misses 131 214 +83
|
closing for dask-contrib#358 |
No description provided.