Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IA Code Review (2019-12-31) #15

Open
bnewbold opened this issue Dec 31, 2019 · 2 comments
Open

IA Code Review (2019-12-31) #15

bnewbold opened this issue Dec 31, 2019 · 2 comments

Comments

@bnewbold
Copy link

Hi! I'm Internet Archive staff helping out with this project. Hope it is alright for me to leave some direct code review here.

These are comments on the existing state of the repo (commit ad108a0), plus peaking at PR #14, and looking at the uploaded test item: https://archive.org/details/osf-io-p8deu.

IA_consume_logs.py

The current behavior is to fetch JSON log entries via API pagination, and write each page as a separate JSON file. Combining all the log entries into a single JSON file would result in a simpler output file structure. However, this isn't important or blocking.

IA_upload.py

Overall, the upload implementation uses low-level routines in the boto3 S3 library to implement a partially aysnc version of multi-part uploads. At a high level, I would recommend using either the internetarchive python library or boto3's higher-level upload routines to do file transfers. While I don't think either of these are natively async, then can be wrapped in an executor at the item or file level. Implementing chunked uploading yourself (including all the corner cases of retries and partial success) seems fragile and error prone. Having two upload paths also seems error prone; code to set headers is currently duplicated. With boto3, the multipart_theshold config value can control whether any particular upload goes directly or multipart. This is our recommendation, but not a requirement.

The upload function is async. For IA buckets in particular (compared to AWS S3), the boto3 call conn.create_bucket() will probably block for a few seconds. I think it needs to be wrapped in an executor? I'm not very experienced with the python3 async/await patterns, so I may be wrong.

When using the Authentication header, it is especially important to use transport encryption. The current code uses http://, but we strongly recommend using https:// instead.

CHUNK_SIZE defaults to 1000 bytes, which I believe is too small for production use (though it is probably helpful in development). Default is often around 15 MB; anywhere from 5 MB to 200 MByte is probably reasonable.

You are checking for HTTP 503 status and trying; great!

It looks like .DS_Store OS X files got uploaded as part of the Bag. The directory walking code should probably filter these and any other OS-specific files. I think the internetarchive library may do this automatically.

You are setting some upload metadata headers; great! Here are some more you might want to break out as config flags:

  • x-archive-queue-derive (0or1): controls whether "derivative" files (like OCR and fulltext search for PDFs) should be run on the files uploaded. IA default is 1`.
  • x-archive-keep-old-version (0 or 1): controls whether old copies of files are kept if you re-upload. 1 is safer, but consumes more storage space.
  • x-archive-size-hint (size of file in bytes): hints to our server how large the file you are uploading is. If you upload files more than 50 GByte or so, this is helpful for us to pre-allocate disk space an ensure the upload is successful.

Chunked Upload Stubs

I noticed that the osf-io-p8deu item had a bunch of chunked upload files left around:

https://archive.org/download/osf-io-p8deu/__ia_spool/
https://ia601401.us.archive.org/21/items/osf-io-p8deu/__ia_spool/2c044e9b-afc0-47f2-a452-d859fbfe83ca_ias3_part_meta.txt

The __ia_spool directory is used by our servers to store partial upload state.

Because these happened a day before the final uploads, and all the files seem to be in the correct place now, i'm guessing that these were due to a bug in an early version of IA_upload.py, and our server code kept the chunks around "just in case". For example, the upload completion didn't succeed. Is that safe to assume, or should I check if there was a problem on our end?

BagPack/BagIt Structure

My memory is that in bagpack, metadata files like the datacite.xml should
end up in a top-level metadata/ directory. In the current implementation
they are ending up under data/.

Some of the nesting of directories might also be removable; eg instead of
data/p8deu/files/Archive of OSF Storage/originally published e08245.pdf,
could just be data/files/originally published e08245.pdf.

In both of these cases just commenting. We are mostly agnostic to the
layout and would defer to you or any consulted librarians.

@brianjgeiger
Copy link
Collaborator

@bnewbold Thank you! We'll go through this and make tickets to update the code. I'm not sure yet if we're going to do these updates independently, or wait for the completion of the surveys of potential users so we can just do a single round. However, this is all great.

@brianjgeiger
Copy link
Collaborator

Also, I think your hypothesis on the chunked uploads being due to a bug are likely correct. Once we are more settled in with the code, we'll keep better notes on failed executions so we can ask about any future situations, but for now, I think it should be fine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants