Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prototype dak.from_text #7

Closed
wants to merge 38 commits into from
Closed

prototype dak.from_text #7

wants to merge 38 commits into from

Conversation

douglasdavis
Copy link
Owner

No description provided.

douglasdavis and others added 30 commits March 7, 2023 16:15
if offsets == 0 and length is None:
bytestring = f.read()
else:
bytestring = read_block(f, offsets, length, delimiter)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the upstream dask function which will find the next delimited after the start and stop of the block, right?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

buffer = np.frombuffer(bytestring, dtype=np.uint8)
array = ak.from_numpy(buffer)
array = ak.unflatten(array, len(array))
array = ak.enforce_type(array, "string")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To ask our friends: it feels like we should be able to pass the buffer/bytestring directly to awkward and declare it a string rather than take four lines to do it. OTOH, I don't suppose these lines cost anything.

array = ak.unflatten(array, len(array))
array = ak.enforce_type(array, "string")
array_split = ak.str.split_pattern(array, "\n")
lines = array_split[0]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-> list[string]? but what was array_split, then, why the [0]?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

array_split ends up being of type 1 * var * string; so by grabbing [0] we get an array of strings N * string (used the awkward docs https://awkward-array.org/doc/main/user-guide/how-to-strings-read-binary.html)

delimiter,
False,
blocksize,
"128 KiB",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

= sample size? It's a reasonable value, but user might need to change it. We don't actually need it at all, since we know the form of the output is array-of-string (if delimiter is None) or array-of-list-of-string (otherwise). Do we allow for delimiter=None?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea it's the sample size; definitely planning on making it user definable! On delimiter- right now I'm strictly passing in b"\n" (temporary just to get the ball rolling), how to handle delimiters (a sensible default and what to supprt in general) was something I had in mind to discuss

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually we make no use of it, so the value should be None/False (whatever it takes not to read it)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delimiter=b"\n" is a reasonable default

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

something I'm perhaps over complicating but I'm getting caught up wrapping my head around the difference between the bytes reading delimiter and then the "awkward-level" delimiter. At the bytes reading delimiter we use b"\n" as a default; but then at the awkward level I currently have hardcoded in
ak.str.split_pattern(array, "\n"). Is this something we should make user configurable?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The two should be the same: a chunk should end on an element-ending delimiter.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK great that was my feeling but was worried I wasn't accounting for some extra case

@codecov-commenter
Copy link

codecov-commenter commented Sep 1, 2023

Codecov Report

Merging #7 (85316c3) into more-json-dev (2d89334) will increase coverage by 0.03%.
Report is 336 commits behind head on more-json-dev.
The diff coverage is 89.43%.

❗ Your organization is not using the GitHub App Integration. As a result you may experience degraded service beginning May 15th. Please install the GitHub App Integration for your organization. Read more.

@@                Coverage Diff                @@
##           more-json-dev       #7      +/-   ##
=================================================
+ Coverage          91.94%   91.97%   +0.03%     
=================================================
  Files                 15       20       +5     
  Lines               1626     2667    +1041     
=================================================
+ Hits                1495     2453     +958     
- Misses               131      214      +83     
Files Changed Coverage Δ
src/dask_awkward/typing.py 100.00% <ø> (ø)
src/dask_awkward/lib/io/io.py 83.52% <69.71%> (-16.48%) ⬇️
src/dask_awkward/lib/io/json.py 83.47% <79.65%> (+14.17%) ⬆️
src/dask_awkward/lib/_utils.py 88.23% <88.23%> (ø)
src/dask_awkward/lib/core.py 91.92% <90.45%> (-2.37%) ⬇️
src/dask_awkward/layers/layers.py 93.00% <91.46%> (-7.00%) ⬇️
src/dask_awkward/lib/structure.py 93.70% <92.77%> (-1.67%) ⬇️
src/dask_awkward/lib/reducers.py 96.80% <93.02%> (+1.51%) ⬆️
src/dask_awkward/lib/io/parquet.py 92.07% <93.10%> (+1.92%) ⬆️
src/dask_awkward/lib/inspect.py 93.54% <93.54%> (ø)
... and 9 more

@douglasdavis douglasdavis changed the title prototype dak.read_text prototype dak.from_text Sep 1, 2023
@douglasdavis douglasdavis changed the base branch from more-json-dev to main September 5, 2023 23:37
@douglasdavis
Copy link
Owner Author

closing for dask-contrib#358

@douglasdavis douglasdavis deleted the read-text branch September 5, 2023 23:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants