Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resumable downloads #274

Open
rvagg opened this issue May 31, 2023 · 3 comments
Open

Resumable downloads #274

rvagg opened this issue May 31, 2023 · 3 comments
Labels
enhancement New feature or request

Comments

@rvagg
Copy link
Member

rvagg commented May 31, 2023

We currently have no option to restart a download, which makes lassie pretty fussy and problematic for large downloads. If you fail, you have to start from scratch. At least with Kubo, you have the data in a blockstore so it can resume from there.

Challenges to be solved:

  • If you "resume" from an existing CAR, do you have to run a traversal over it to verify that the CAR DAG it has is correct up to the point that it ends (presumably prematurely)?
  • Can you "resume with bundle of blocks" where you supply a CAR (or multiple?) that have blocks that may be needed in your traversal, but the output CAR is still new?
  • What do we do about HTTP retrievals in this case since we have no "I already have this" facility, do we just document this behaviour and suggest removing the HTTP retriever?

As an experiment I've been trying to download a copy of wikipedia (bafybeiaysi4s6lnjev27ln5icwm6tueaw2vdykrtjkwiphwekaywqhcjze) and can't get more than ~500Mb in with lassie before I get timeouts or other errors and I have no way of resuming. Kubo gets much further although it slows to a crawl for me at a certain point, but at least I know I can cancel it and start again and it'll have what it already fetched in its blockstore.

There's a general problem set of "large data" that I don't think lassie is up to the challenge of solving yet.

@rvagg rvagg added the enhancement New feature or request label May 31, 2023
@SgtPooki
Copy link

SgtPooki commented Jun 13, 2023

I just started fetching the .zim file used for that wikipedia root with Lassie (./lassie fetch bafybeibkzwf3ffl44yfej6ak44i7aly7rb4udhz5taskreec7qemmw5jiu). It passed 500Mb for me only some minor griping (multiple intermittent error messages in console: 2023-06-12T18:47:39.837-0700 ERROR dt_graphsync graphsync/graphsync.go:203 normal shutdown of state machine)

It's going almost half as fast as the only web2 mirror I found hosting that file: https://mirror.netcologne.de/kiwix/zim/wikipedia/wikipedia_en_all_maxi_2021-02.zim

-rw-r--r--   1 sgtpooki  staff   3.6G Jun 12 19:01 bafybeibkzwf3ffl44yfej6ak44i7aly7rb4udhz5taskreec7qemmw5jiu.car

vs

image

The mirror download was started at 6:43pm PST, lassie at 6:46pm PST.

@rvagg
Copy link
Member Author

rvagg commented Jun 13, 2023

What's happening with the graphsync errors is that it's attempting multiple protocols but eventually giving up on ones that aren't yielding results - because this content is stored on multiple filecoin providers it's trying each one of them at the same time as fetching it from bitswap, but as they all fail for various reasons it leaves only bitswap. But I keep on getting context cancelled after some period of time on large downloads from the bitswap one too, regardless of --global-timeout and --provider-timeout values; I haven't worked out what's going on there yet.

@rvagg
Copy link
Member Author

rvagg commented Jun 13, 2023

@hannahhoward had the idea of a --blockstore flag for lassie fetch, I imagine that when you use this mode, it doesn't bother trying to do the nicely-ordered CAR thing and will take an existing CAR (if it exists) under the name it's using ({cid}.car or whatever -o is specified as) and uses that as the LinkSystem to start from, so if blocks exist in it then it should skip over them in Graphsync and Bitswap traversals. For HTTP it'll have to re-fetch them but it shouldn't bother putting them into the CAR output. We have all the mechanics for this internally so it really shouldn't be hard to do. I think this is probably the easiest path to some level of resilience. I want to recover from fatal fetches without starting from scratch, especially when I have a multi-Gb file sitting in front of me (I'm experiencing this today).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants