Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: support byte ranges #306

Merged
merged 15 commits into from
Jul 11, 2023
Merged

feat: support byte ranges #306

merged 15 commits into from
Jul 11, 2023

Conversation

rvagg
Copy link
Member

@rvagg rvagg commented Jun 9, 2023

Initial exploration work; it seems like this might not be too tricky after all.

@rvagg rvagg changed the title feat: support verifying byte range CARs feat: support byte ranges Jun 9, 2023
@rvagg
Copy link
Member Author

rvagg commented Jun 9, 2023

The failures in here are:

    --- FAIL: TestHttpFetch/graphsync_nested_large_sharded_file,_with_path,_dag-scope_entity (0.23s)
    --- FAIL: TestHttpFetch/bitswap_nested_large_sharded_file,_with_path,_dag-scope_entity (0.21s)
    --- FAIL: TestHttpFetch/http_nested_large_sharded_file,_with_path,_dag-scope_entity (0.13s)

The overlap they have with this is that they use a unixfs-preload with a matcher, and our verifier now does something with matches (read all bytes from a LargeBytesNode). Since dag-scope=entity is doing this already via unixfs-preload, so these failures should just be that we're seeing the same blocks twice because we read+discard them in two places. That'll need some unifying (or maybe just removing unixfs-preload when we're doing entity + byte ranges?).

@hannahhoward
Copy link
Collaborator

@rvagg completely forgot to ever push this -- I started a long while ago. Yes, I think what we need to do is turn off pre-load on range requests. The only question then is what to do about UnixFSDirectories. I don' t know how that is supposed to behave with dag-scope=entity & entity-bytes. #307

@hannahhoward
Copy link
Collaborator

That code is super old you don't need to use

@hannahhoward
Copy link
Collaborator

Looking at the IPIP-402 proposal, this is gonna be a bit tricky, cause when the node is a directory, entity-bytes is supposed to have no effect, i.e. send the whole thing. So then we need preload for directories, non-preload for files. Eek.

@hannahhoward
Copy link
Collaborator

One suggestion: maybe preload should only do directories. That's all you actually need it for.

Then if there is no byte range selector, we simply match the whole file.

That would be a breaking change to preload, so it'd need to get to boost for the purposes of GraphSync.

@rvagg
Copy link
Member Author

rvagg commented Jun 28, 2023

This now depends on ipfs/go-unixfsnode#52 and also ipld/go-ipld-prime#529; those are what are being pointed to in the current go.mod.

The test fixtures now include gobs of ranges reading from a file, including lots of boundary cases.

Things still to do:

  • Wire up the whole pipeline to handle these, currently it's just the verifier I'm testing but the other traversal(s) need to do this and http handler and cmdline need to accept byte ranges too.
  • Negative ranges — I could defer this feature to a separate batch of work, it's nontrivial
  • Handle the case where the entity at the path terminus is not a file and can't be read as a byte range — this is also potentially very nontrivial since selector's don't really have an "or" for this case and the slice matcher fails hard when it's not interpreting bytes.

@codecov-commenter
Copy link

codecov-commenter commented Jun 28, 2023

Codecov Report

Merging #306 (4b3e0fb) into main (013521e) will increase coverage by 0.42%.
The diff coverage is 64.37%.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #306      +/-   ##
==========================================
+ Coverage   76.24%   76.67%   +0.42%     
==========================================
  Files          85       86       +1     
  Lines        6378     6674     +296     
==========================================
+ Hits         4863     5117     +254     
- Misses       1245     1274      +29     
- Partials      270      283      +13     
Impacted Files Coverage Δ
pkg/internal/itest/testpeer/backedstore.go 22.00% <22.00%> (ø)
cmd/lassie/fetch.go 32.57% <28.57%> (-0.15%) ⬇️
pkg/types/types.go 73.38% <59.18%> (-6.88%) ⬇️
pkg/internal/itest/testpeer/peerhttpserver.go 68.93% <64.94%> (-11.07%) ⬇️
pkg/server/http/util.go 52.17% <66.66%> (+1.62%) ⬆️
pkg/verifiedcar/verifiedcar.go 80.62% <75.00%> (-2.61%) ⬇️
pkg/server/http/ipfs.go 67.29% <82.60%> (+1.32%) ⬆️
pkg/internal/itest/testpeer/generator.go 73.91% <100.00%> (+2.57%) ⬆️
pkg/internal/testutil/toblocks.go 75.67% <100.00%> (-0.65%) ⬇️
pkg/retriever/bitswapretriever.go 93.83% <100.00%> (-0.06%) ⬇️
... and 2 more

... and 9 files with indirect coverage changes

@rvagg
Copy link
Member Author

rvagg commented Jun 29, 2023

Added support for the "or" part of 402 - when specifying entity-bytes and the terminus isn't a file, it should behave as if you didn't specify entity-bytes. This required removing the hard error in the slice matcher; see ipld/go-ipld-prime#529.

@rvagg
Copy link
Member Author

rvagg commented Jun 30, 2023

Read aside from needing these merged:

The missing piece here is negative byte ranges. Those will be rejected as unsupported at the moment. Further work in a later PR.

Important to note that only Bitswap requests (and some HTTP!) are going to work with this at the moment:

  • HTTP is going to need eipfs to support this, we'll reject their CAR streams in the verifier without it.
  • Graphsync needs the go-ipld-prime and go-unixfsnode changes, or there's a disagreement between the peers about what the blocks to exchange are.

BUT ipld/frisbii#9 uses this, and I have a Frisbii node online with that, serving wikipedia and also birb.mp4 (bafybeic56z3yccnla3cutmvqsn5zy3g24muupcsjtoyp3pu5pm5amurjx4) and the fixture file in here (bafybeifrrglx2issn2had5rtstn3xltla6vxmpjfwfz7o3hapvkynh4zoq) so all of the queries in the file unixfs_20m_variety.md in this PR will work too.

@rvagg rvagg force-pushed the rvagg/byte-range branch 2 times, most recently from ed5ecc5 to ae84940 Compare June 30, 2023 06:22
rvagg added a commit to ipld/frisbii that referenced this pull request Jun 30, 2023
@rvagg rvagg marked this pull request as ready for review June 30, 2023 23:35
cmd/lassie/fetch.go Outdated Show resolved Hide resolved
pkg/internal/itest/testpeer/backedstore.go Outdated Show resolved Hide resolved
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need want to be putting a 20meg test fixture into this repo? it'll make the download quite a bit more for anyone including lassie

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I was thinking this was going to be handled by the go pkg install tooling and ignored since it's in testdata, but I confirmed that it's in my ~/go/pkg when used as a dependency, which is .. unfortunate. I don't really want this to be a burden for downloaders.

We could git submodule it and put it somewhere else (I have more to make for this). Or do you have a better suggestion?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since you have code for regenerating it, can you put the fixture itself in gitignore?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit bespoke; I might be able to regenerate it exactly as is but it was made to be fixed so I could inspect it and manually make fixtures that point to it.

I've been thinking that maybe putting this into ipld/ipld along with other fixtures might be the way to deal with this. Then either export a golang package from there or submodule it like we do with go-ipld-prime. We can have a section about paths traversal and unixfs and the IPLD specifics of the trustless gateway spec as it relates to these matters. I was toying with the idea of an ipld/ipld-fixtures repo, but that's really what ipld/ipld was meant to be so maybe try and use that. I'll experiment a bit and see.

Copy link
Collaborator

@hannahhoward hannahhoward Jul 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see comment in my review: my question is: how would we add to this fixture set? It seems like if we can't find a way to regenerate, that makes the work a bit harder.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like I think we can safely assume there will be more test cases we haven't thought of yet coming.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, and I wanted to build a corpus of actual data and descriptive tests that can be run against them—so they can be run any place, including Frisbii and others. We currently have a lot of random-generated on-the-fly data (which I keep on having to squash flakies on because of edge cases) but the intent here is to have something fixed that can be reasoned about and inspected separately to the code that generated it.

Copy link
Collaborator

@hannahhoward hannahhoward left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I don't see any big red flags, but I think we need to talk through the implications of deploying this (esp in Saturn) before we merge.

@@ -0,0 +1,985 @@
# unixfs_20m_variety (test fixture)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are the test cases results generated with your unixfs utility?

how would someone who wanted to make a change reproduce this or modify it? I'm concerned a little bit about lock in.

pkg/server/http/util.go Outdated Show resolved Hide resolved
@@ -55,8 +55,6 @@ type Config struct {
MaxBlocks uint64 // set a budget for the traversal
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

eek so, I just realized that we're gonna need to figure out what to do with .storage, as they are not yet supporting ranges. I can pop over and try to add it but... in the meantime, maybe we need to figure out some content negotiation in the http retriever.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

otherwise we're looking at 100% failure here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not 100%. Bitswap will work, Frisbii will work, but yeah the rest will bork if we start pushing them through.

I had some thoughts about negotiation over in 412: ipfs/specs#412 (comment)

If we were to request a byte range but then be told the server doesn't support it, at least we could switch modes to parsing their response. What we do on the way out of Lassie is another matter—do we give the user more than they asked for and just give our own negotiation signal that we don't support range cause we couldn't find anyone else that does? Either way this is really hard.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

technically if we knew that we were getting a full response, we could modify the verifier to just skip blocks till you got to the expected blocks in a range request (since technically a full request contains all the blocks for a range request in the correct order, just with other blocks in between). In fact, turning this on when entity-bytes is present by default might be the hack needed to get this over the line.

pkg/verifiedcar/verifiedcar.go Outdated Show resolved Hide resolved
@rvagg
Copy link
Member Author

rvagg commented Jul 5, 2023

Rebased to main and I extracted the large fixture file to ipld/ipld#290 (open for discussion!). I removed and squashed it out of the commits on this branch so it shouldn't weigh it down:

$ git rev-list --disk-usage --objects HEAD...main
391083

@rvagg
Copy link
Member Author

rvagg commented Jul 7, 2023

Ready for deeper review, current blockers for merge are these:

Copy link
Collaborator

@hannahhoward hannahhoward left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@hannahhoward hannahhoward merged commit 6c8a3dd into main Jul 11, 2023
@hannahhoward hannahhoward deleted the rvagg/byte-range branch July 11, 2023 12:51
rvagg added a commit to ipld/frisbii that referenced this pull request Sep 2, 2023
rvagg added a commit to ipld/frisbii that referenced this pull request Sep 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants