You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi! I'm Internet Archive staff helping out with this project. Hope it is alright for me to leave some direct code review here.
These are comments on the existing state of the repo (commit ad108a0), plus peaking at PR #14, and looking at the uploaded test item: https://archive.org/details/osf-io-p8deu.
IA_consume_logs.py
The current behavior is to fetch JSON log entries via API pagination, and write each page as a separate JSON file. Combining all the log entries into a single JSON file would result in a simpler output file structure. However, this isn't important or blocking.
IA_upload.py
Overall, the upload implementation uses low-level routines in the boto3 S3 library to implement a partially aysnc version of multi-part uploads. At a high level, I would recommend using either the internetarchive python library or boto3's higher-level upload routines to do file transfers. While I don't think either of these are natively async, then can be wrapped in an executor at the item or file level. Implementing chunked uploading yourself (including all the corner cases of retries and partial success) seems fragile and error prone. Having two upload paths also seems error prone; code to set headers is currently duplicated. With boto3, the multipart_theshold config value can control whether any particular upload goes directly or multipart. This is our recommendation, but not a requirement.
The upload function is async. For IA buckets in particular (compared to AWS S3), the boto3 call conn.create_bucket() will probably block for a few seconds. I think it needs to be wrapped in an executor? I'm not very experienced with the python3 async/await patterns, so I may be wrong.
When using the Authentication header, it is especially important to use transport encryption. The current code uses http://, but we strongly recommend using https:// instead.
CHUNK_SIZE defaults to 1000 bytes, which I believe is too small for production use (though it is probably helpful in development). Default is often around 15 MB; anywhere from 5 MB to 200 MByte is probably reasonable.
You are checking for HTTP 503 status and trying; great!
It looks like .DS_Store OS X files got uploaded as part of the Bag. The directory walking code should probably filter these and any other OS-specific files. I think the internetarchive library may do this automatically.
You are setting some upload metadata headers; great! Here are some more you might want to break out as config flags:
x-archive-queue-derive (0or1): controls whether "derivative" files (like OCR and fulltext search for PDFs) should be run on the files uploaded. IA default is 1`.
x-archive-keep-old-version (0 or 1): controls whether old copies of files are kept if you re-upload. 1 is safer, but consumes more storage space.
x-archive-size-hint (size of file in bytes): hints to our server how large the file you are uploading is. If you upload files more than 50 GByte or so, this is helpful for us to pre-allocate disk space an ensure the upload is successful.
Chunked Upload Stubs
I noticed that the osf-io-p8deu item had a bunch of chunked upload files left around:
The __ia_spool directory is used by our servers to store partial upload state.
Because these happened a day before the final uploads, and all the files seem to be in the correct place now, i'm guessing that these were due to a bug in an early version of IA_upload.py, and our server code kept the chunks around "just in case". For example, the upload completion didn't succeed. Is that safe to assume, or should I check if there was a problem on our end?
BagPack/BagIt Structure
My memory is that in bagpack, metadata files like the datacite.xml should
end up in a top-level metadata/ directory. In the current implementation
they are ending up under data/.
Some of the nesting of directories might also be removable; eg instead of data/p8deu/files/Archive of OSF Storage/originally published e08245.pdf,
could just be data/files/originally published e08245.pdf.
In both of these cases just commenting. We are mostly agnostic to the
layout and would defer to you or any consulted librarians.
The text was updated successfully, but these errors were encountered:
@bnewbold Thank you! We'll go through this and make tickets to update the code. I'm not sure yet if we're going to do these updates independently, or wait for the completion of the surveys of potential users so we can just do a single round. However, this is all great.
Also, I think your hypothesis on the chunked uploads being due to a bug are likely correct. Once we are more settled in with the code, we'll keep better notes on failed executions so we can ask about any future situations, but for now, I think it should be fine.
Hi! I'm Internet Archive staff helping out with this project. Hope it is alright for me to leave some direct code review here.
These are comments on the existing state of the repo (commit
ad108a0
), plus peaking at PR #14, and looking at the uploaded test item: https://archive.org/details/osf-io-p8deu.IA_consume_logs.py
The current behavior is to fetch JSON log entries via API pagination, and write each page as a separate JSON file. Combining all the log entries into a single JSON file would result in a simpler output file structure. However, this isn't important or blocking.
IA_upload.py
Overall, the upload implementation uses low-level routines in the
boto3
S3 library to implement a partiallyaysnc
version of multi-part uploads. At a high level, I would recommend using either theinternetarchive
python library orboto3
's higher-level upload routines to do file transfers. While I don't think either of these are nativelyasync
, then can be wrapped in an executor at the item or file level. Implementing chunked uploading yourself (including all the corner cases of retries and partial success) seems fragile and error prone. Having two upload paths also seems error prone; code to set headers is currently duplicated. Withboto3
, themultipart_theshold
config value can control whether any particular upload goes directly or multipart. This is our recommendation, but not a requirement.The upload function is
async
. For IA buckets in particular (compared to AWS S3), the boto3 callconn.create_bucket()
will probably block for a few seconds. I think it needs to be wrapped in an executor? I'm not very experienced with the python3 async/await patterns, so I may be wrong.When using the
Authentication
header, it is especially important to use transport encryption. The current code useshttp://
, but we strongly recommend usinghttps://
instead.CHUNK_SIZE
defaults to 1000 bytes, which I believe is too small for production use (though it is probably helpful in development). Default is often around 15 MB; anywhere from 5 MB to 200 MByte is probably reasonable.You are checking for HTTP 503 status and trying; great!
It looks like
.DS_Store
OS X files got uploaded as part of the Bag. The directory walking code should probably filter these and any other OS-specific files. I think theinternetarchive
library may do this automatically.You are setting some upload metadata headers; great! Here are some more you might want to break out as config flags:
x-archive-queue-derive (
0or
1): controls whether "derivative" files (like OCR and fulltext search for PDFs) should be run on the files uploaded. IA default is
1`.x-archive-keep-old-version
(0
or1
): controls whether old copies of files are kept if you re-upload.1
is safer, but consumes more storage space.x-archive-size-hint
(size of file in bytes): hints to our server how large the file you are uploading is. If you upload files more than 50 GByte or so, this is helpful for us to pre-allocate disk space an ensure the upload is successful.Chunked Upload Stubs
I noticed that the
osf-io-p8deu
item had a bunch of chunked upload files left around:https://archive.org/download/osf-io-p8deu/__ia_spool/
https://ia601401.us.archive.org/21/items/osf-io-p8deu/__ia_spool/2c044e9b-afc0-47f2-a452-d859fbfe83ca_ias3_part_meta.txt
The
__ia_spool
directory is used by our servers to store partial upload state.Because these happened a day before the final uploads, and all the files seem to be in the correct place now, i'm guessing that these were due to a bug in an early version of
IA_upload.py
, and our server code kept the chunks around "just in case". For example, the upload completion didn't succeed. Is that safe to assume, or should I check if there was a problem on our end?BagPack/BagIt Structure
My memory is that in bagpack, metadata files like the datacite.xml should
end up in a top-level
metadata/
directory. In the current implementationthey are ending up under
data/
.Some of the nesting of directories might also be removable; eg instead of
data/p8deu/files/Archive of OSF Storage/originally published e08245.pdf
,could just be
data/files/originally published e08245.pdf
.In both of these cases just commenting. We are mostly agnostic to the
layout and would defer to you or any consulted librarians.
The text was updated successfully, but these errors were encountered: