Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error: Network connection lost. #69

Open
pierre818181 opened this issue Oct 21, 2024 · 12 comments · May be fixed by #78
Open

Error: Network connection lost. #69

pierre818181 opened this issue Oct 21, 2024 · 12 comments · May be fixed by #78

Comments

@pierre818181
Copy link

pierre818181 commented Oct 21, 2024

I get this error once all the PATCH requests are done and the finishUpload is called. I can consistently reproduce this at the same stage during push.

Is this something to do with my Cloudflare plan?

Stack:

error error Error: Network connection lost.: undefined: Error: Network connection lost.
    at async R2Registry.finishUpload (index.js:6416:7)
    at async index.js:5246:22
    at async fetch (index.js:13:27)
    at async Object.fetch (index.js:13:27)
    at async Object.fetch (index.js:6482:19)

This is the object that is there when it errors:

obj {"storageClass":"Standard","range":{"offset":0,"length":6134971208},"customMetadata":{},"httpMetadata":{},"uploaded":"2024-10-21T21:43:08.398Z","checksums":{},"httpEtag":"\"3cc9dd57fa02f8e14c383472bfabc35a-62\"","etag":"3cc9dd57fa02f8e14c383472bfabc35a-62","size":6134971208,"version":"7e6d4efe355af1ee966e8cd0f43699e2","key":"7e97b421-1689-4691-83c0-aeef01faf2b4"}

Is it because the size is too big?

@gabivlj
Copy link
Collaborator

gabivlj commented Oct 22, 2024

Hello! Thank you for the issue. I am curious to see how are you pushing layers? Docker push? Whats your container image like?

@gabivlj
Copy link
Collaborator

gabivlj commented Oct 22, 2024

I found out the issue, the layer you are trying to push is more than 5gib, which goes against the R2 limits as we try to do a copy to change the path of the object. I will try to have some fix for this.

@pierre818181
Copy link
Author

Sure @gabivlj

@pierre818181
Copy link
Author

Hello! Thank you for the issue. I am curious to see how are you pushing layers? Docker push? Whats your container image like?

we are using the index.ts inside push folder to push

@pierre818181
Copy link
Author

@gabivlj any chance you have been able to look into this?

@gabivlj
Copy link
Collaborator

gabivlj commented Nov 15, 2024

Hi @pierre818181, I looked into this, I think I know a possible solution, just haven't gotten to implementing yet! Sorry for the issue, will get to this when I get some time.

@gabivlj gabivlj linked a pull request Nov 25, 2024 that will close this issue
@gabivlj
Copy link
Collaborator

gabivlj commented Nov 25, 2024

Hi @pierre818181, can you check if #78 works for you?

@pierre818181
Copy link
Author

pierre818181 commented Nov 25, 2024

Hi @gabivlj I just tried and see the same issue on the cloudflare logs. Here are the upload logs:

2024-11-25 14:39:33
[INFO]
sha256:09efbcfccf416036d57e24719b0884df9b5c1d381b87d53b82757abb26252899: 7434927308 upload bytes left.
2024-11-25 14:39:45
[INFO]
sha256:09efbcfccf416036d57e24719b0884df9b5c1d381b87d53b82757abb26252899: 7334927308 upload bytes left.
2024-11-25 14:39:53
[INFO]
sha256:09efbcfccf416036d57e24719b0884df9b5c1d381b87d53b82757abb26252899: 7234927308 upload bytes left.
2024-11-25 14:40:05
[INFO]
sha256:09efbcfccf416036d57e24719b0884df9b5c1d381b87d53b82757abb26252899: 7134927308 upload bytes left.
2024-11-25 14:40:17
[INFO]
sha256:09efbcfccf416036d57e24719b0884df9b5c1d381b87d53b82757abb26252899: 7034927308 upload bytes left.
2024-11-25 14:40:31
[INFO]
sha256:09efbcfccf416036d57e24719b0884df9b5c1d381b87d53b82757abb26252899: 6934927308 upload bytes left.
2024-11-25 14:40:42
[INFO]
sha256:09efbcfccf416036d57e24719b0884df9b5c1d381b87d53b82757abb26252899: 6834927308 upload bytes left.
2024-11-25 14:40:51
[INFO]
sha256:09efbcfccf416036d57e24719b0884df9b5c1d381b87d53b82757abb26252899: 6734927308 upload bytes left.
2024-11-25 14:41:02
[INFO]
sha256:09efbcfccf416036d57e24719b0884df9b5c1d381b87d53b82757abb26252899: 6634927308 upload bytes left.
2024-11-25 14:41:15
[INFO]
sha256:09efbcfccf416036d57e24719b0884df9b5c1d381b87d53b82757abb26252899: 6534927308 upload bytes left.
2024-11-25 14:41:25
[INFO]
sha256:09efbcfccf416036d57e24719b0884df9b5c1d381b87d53b82757abb26252899: 6434927308 upload bytes left.
2024-11-25 14:41:37
[INFO]
sha256:09efbcfccf416036d57e24719b0884df9b5c1d381b87d53b82757abb26252899: 6334927308 upload bytes left.
[INFO]
sha256:09efbcfccf416036d57e24719b0884df9b5c1d381b87d53b82757abb26252899: 7434927308 upload bytes left.

From the logs we can see that once you get to 6334927308 bytes left, it goes back to uploading everything again because the chunk upload failed?

Here are the cloudflare logs:

"Uploading chunk: error Error: Network connection lost.: undefined: Error: Network connection lost. at async appendStreamKnownLength (index.js:6383:22) at async R2Registry.uploadChunk (index.js:6452:17) at async index.js:5586:22 at async fetch (index.js:13:27) at async Object.fetch (index.js:13:27) at async Object.fetch (index.js:6610:19)",

Previously it used to fail once all the bytes have been uploaded. Now its failing much earlier - not sure what the reason is.
If it helps, can you fork this repo (https://github.com/runpod-workers/worker-vllm) and try pushing it to your CF repository? It produces one 8.5 G layer

@gabivlj
Copy link
Collaborator

gabivlj commented Nov 26, 2024

Hello @pierre818181, in that same branch I added a WIP commit that I tested that has retry logic. Is this transient issue still occurring after that?

Where is this image being pushed from?

@gabivlj
Copy link
Collaborator

gabivlj commented Nov 26, 2024

I just noticed that the Worker might throw a Workers CPU exception when calculating the digest; I will try another approach later.

@gabivlj
Copy link
Collaborator

gabivlj commented Nov 27, 2024

Hello. I added another commit that should allow you to circumvent this issue (sorry for all the debugging in this issue!)

@Dramelac
Copy link
Contributor

Dramelac commented Jan 6, 2025

Hello @gabivlj
I've reproduce the same error as @pierre818181 on the main branch and yes I can confirm you, the fix on the branch gv/69 works (tested with a 10Gb~ layer) !
When can we hope for a merge to main ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants