prototype `dak.from_text` #7

douglasdavis · 2023-09-01T16:56:22Z

No description provided.

martindurant · 2023-09-01T16:59:42Z

src/dask_awkward/lib/io/io.py

+            if offsets == 0 and length is None:
+                bytestring = f.read()
+            else:
+                bytestring = read_block(f, offsets, length, delimiter)


This is the upstream dask function which will find the next delimited after the start and stop of the block, right?

Yea these lines of code are from upstream dask (https://github.com/dask/dask/blob/4178feb58a7e708345ce4e41e018a746b0d1fd06/dask/bytes/core.py#L190)

martindurant · 2023-09-01T17:01:22Z

src/dask_awkward/lib/io/io.py

+            buffer = np.frombuffer(bytestring, dtype=np.uint8)
+            array = ak.from_numpy(buffer)
+            array = ak.unflatten(array, len(array))
+            array = ak.enforce_type(array, "string")


To ask our friends: it feels like we should be able to pass the buffer/bytestring directly to awkward and declare it a string rather than take four lines to do it. OTOH, I don't suppose these lines cost anything.

martindurant · 2023-09-01T17:02:05Z

src/dask_awkward/lib/io/io.py

+            array = ak.unflatten(array, len(array))
+            array = ak.enforce_type(array, "string")
+            array_split = ak.str.split_pattern(array, "\n")
+            lines = array_split[0]


-> list[string]? but what was array_split, then, why the [0]?

array_split ends up being of type 1 * var * string; so by grabbing [0] we get an array of strings N * string (used the awkward docs https://awkward-array.org/doc/main/user-guide/how-to-strings-read-binary.html)

martindurant · 2023-09-01T17:03:55Z

src/dask_awkward/lib/io/io.py

+        delimiter,
+        False,
+        blocksize,
+        "128 KiB",


= sample size? It's a reasonable value, but user might need to change it. We don't actually need it at all, since we know the form of the output is array-of-string (if delimiter is None) or array-of-list-of-string (otherwise). Do we allow for delimiter=None?

Yea it's the sample size; definitely planning on making it user definable! On delimiter- right now I'm strictly passing in b"\n" (temporary just to get the ball rolling), how to handle delimiters (a sensible default and what to supprt in general) was something I had in mind to discuss

Actually we make no use of it, so the value should be None/False (whatever it takes not to read it)

delimiter=b"\n" is a reasonable default

something I'm perhaps over complicating but I'm getting caught up wrapping my head around the difference between the bytes reading delimiter and then the "awkward-level" delimiter. At the bytes reading delimiter we use b"\n" as a default; but then at the awkward level I currently have hardcoded in
ak.str.split_pattern(array, "\n"). Is this something we should make user configurable?

The two should be the same: a chunk should end on an element-ending delimiter.

OK great that was my feeling but was worried I wasn't accounting for some extra case

codecov-commenter · 2023-09-01T17:39:50Z

Codecov Report

Merging #7 (85316c3) into more-json-dev (2d89334) will increase coverage by 0.03%.
Report is 336 commits behind head on more-json-dev.
The diff coverage is 89.43%.

❗ Your organization is not using the GitHub App Integration. As a result you may experience degraded service beginning May 15th. Please install the GitHub App Integration for your organization. Read more.

@@                Coverage Diff                @@
##           more-json-dev       #7      +/-   ##
=================================================
+ Coverage          91.94%   91.97%   +0.03%     
=================================================
  Files                 15       20       +5     
  Lines               1626     2667    +1041     
=================================================
+ Hits                1495     2453     +958     
- Misses               131      214      +83

Files Changed	Coverage Δ
src/dask_awkward/typing.py	`100.00% <ø> (ø)`
src/dask_awkward/lib/io/io.py	`83.52% <69.71%> (-16.48%)`	⬇️
src/dask_awkward/lib/io/json.py	`83.47% <79.65%> (+14.17%)`	⬆️
src/dask_awkward/lib/_utils.py	`88.23% <88.23%> (ø)`
src/dask_awkward/lib/core.py	`91.92% <90.45%> (-2.37%)`	⬇️
src/dask_awkward/layers/layers.py	`93.00% <91.46%> (-7.00%)`	⬇️
src/dask_awkward/lib/structure.py	`93.70% <92.77%> (-1.67%)`	⬇️
src/dask_awkward/lib/reducers.py	`96.80% <93.02%> (+1.51%)`	⬆️
src/dask_awkward/lib/io/parquet.py	`92.07% <93.10%> (+1.92%)`	⬆️
src/dask_awkward/lib/inspect.py	`93.54% <93.54%> (ø)`
... and 9 more

douglasdavis · 2023-09-05T23:42:52Z

closing for dask-contrib#358

douglasdavis and others added 30 commits March 7, 2023 16:15

schema <--> form/layout

7b00694

json IO dev

df3db18

lint

d2b7a67

Merge remote-tracking branch 'upstream/main' into more-json-dev

30e7748

Merge remote-tracking branch 'upstream/main' into more-json-dev

8826cf8

Merge remote-tracking branch 'upstream/main' into more-json-dev

5aa5d01

update w.r.t. upstream awkward changes

4e83fd2

Merge remote-tracking branch 'upstream/main' into more-json-dev

4e5e883

unpolished column projection works

bc686d2

Merge remote-tracking branch 'upstream/main' into more-json-dev

a235067

passing tests

31802a9

Merge branch 'main' into more-json-dev

8471d93

Merge remote-tracking branch 'upstream/main' into more-json-dev

c030a9b

Merge branch 'main' into more-json-dev

4ed1aff

Merge remote-tracking branch 'upstream/main' into more-json-dev

b197665

to_json correct function args; some typing

7a97c02

Merge remote-tracking branch 'upstream/main' into more-json-dev

a49c1cd

Merge branch 'main' into more-json-dev

ad97c61

more layout type supported

c8bc1ec

handle minItems maxItems for awkward's regular arrays

d1dcc72

Merge branch 'main' into more-json-dev

84a0dee

Merge branch 'main' into more-json-dev

d7957be

add tests (and move json specific tests)

f2cb69c

generic paths

bd65e00

typing

5a07892

handle unknown type; add layout_to_jsonschema to top level API

8764ec6

Merge remote-tracking branch 'upstream/main' into more-json-dev

1345d27

Merge branch 'main' into more-json-dev

ad3cb35

rework meta determination; remove single obj per file support

8ec7acb

Merge remote-tracking branch 'upstream/main' into more-json-dev

74c0d79

douglasdavis added 4 commits August 30, 2023 16:05

single obj per file back

66cf018

json byte chunks project_columns compatible

cd6168f

abstract out bytes reading ingredients

f9bbec1

rough outline for read_text API

4dc3832

martindurant reviewed Sep 1, 2023

View reviewed changes

typing, some sensible defaults

e3bff17

delimiter use

85316c3

douglasdavis changed the title ~~prototype dak.read_text~~ prototype dak.from_text Sep 1, 2023

support compressed files; drop trailing newline

94de75d

douglasdavis mentioned this pull request Sep 5, 2023

feat: add bytes_with_sample for creating bytes reading instructions dask-contrib/dask-awkward#357

Merged

Merge remote-tracking branch 'upstream/main' into read-text

50d4aaa

douglasdavis changed the base branch from more-json-dev to main September 5, 2023 23:37

douglasdavis closed this Sep 5, 2023

douglasdavis deleted the read-text branch September 5, 2023 23:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

prototype `dak.from_text` #7

prototype `dak.from_text` #7

douglasdavis commented Sep 1, 2023

martindurant Sep 1, 2023

douglasdavis Sep 1, 2023

martindurant Sep 1, 2023

martindurant Sep 1, 2023

douglasdavis Sep 1, 2023

martindurant Sep 1, 2023

douglasdavis Sep 1, 2023

martindurant Sep 1, 2023

martindurant Sep 1, 2023

douglasdavis Sep 1, 2023

martindurant Sep 1, 2023

douglasdavis Sep 1, 2023

codecov-commenter commented Sep 1, 2023 •

edited

Loading

douglasdavis commented Sep 5, 2023

prototype dak.from_text #7

prototype dak.from_text #7

Conversation

douglasdavis commented Sep 1, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov-commenter commented Sep 1, 2023 • edited Loading

Codecov Report

douglasdavis commented Sep 5, 2023

prototype `dak.from_text` #7

prototype `dak.from_text` #7

codecov-commenter commented Sep 1, 2023 •

edited

Loading