Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

File names generated via Nuxeo harvest include unneeded characters #1655

Open
elopatin-uc3 opened this issue Oct 30, 2023 · 7 comments
Open
Assignees

Comments

@elopatin-uc3
Copy link
Contributor

It seems like the harvester is deriving the filename from an href value in the feed, instead of using the dc:title field. 

For example:

<link href="https://nuxeo.cdlib.org/Nuxeo/nxfile/default/273d436f-7036-4354-a1f8-0ba5c7855a2b/file:content/MS-F044_accn2022_005_bag.zip?changeToken=4-0" rel="alternate" title="Main content file">
      <opensearch:checksum algorithm="MD5">210648d666d520d81a9c5725aed79106</opensearch:checksum>
    </link>
    <dc:creator>Parham, Thomas A.</dc:creator>
    <dc:title>MS-F044_accn2022_005_bag</dc:title>

Note changeToken=4-0 at the end of the zip file URL

Unfortunately the .zip extension is not present in dc:title
@elopatin-uc3
Copy link
Contributor Author

elopatin-uc3 commented Nov 6, 2023

@elopatin-uc3 to file separate issue for a different Nuxeo harvest file name issue where ARK is included and file extension is excluded.

https://github.com/CDLUC3/mrt-doc-private/issues/66

@elopatin-uc3 elopatin-uc3 self-assigned this Nov 13, 2023
@terrywbrady
Copy link
Contributor

We will investigate if the compose plugin can be installed on linux

@dloy
Copy link
Contributor

dloy commented Dec 5, 2023

Comment:
The S3 key used by these files includes the query portion of the URL.

<key>ark:/13030/m5r89gbn|1|producer/nuxeo.cdlib.org/Nuxeo/nxfile/default/cede1001-e4f1-4223-9a34-243da2296bcd/file:content/highlander_19731004_027.tif?changeToken=6-0</key>

Doing an ingest change on the pathname handling to remove the URL property will generate content under the correct pathname/key but original key and content will remain with the earlier version.

@elopatin-uc3
Copy link
Contributor Author

@elopatin-uc3 should set up a separate meeting to discuss. Consider inviting AT for Nuxeo details.

@elopatin-uc3
Copy link
Contributor Author

elopatin-uc3 commented Jan 16, 2024

Discussed on 1/12. Subsequent meeting to be scheduled to talk about possible solutions. Initial work should include fixing the harvester so it disregards URL parameters and excludes them from file names; also should consider no longer using Add, but Update endpoint instead.

@mreyescdl
Copy link
Contributor

Work is being scheduled for eliminating the change token data currently in S3.
In tandem, we will need to eliminate the creation of new data with query parameters.

Will modify Nuxeo client to eliminate any query parameter in filename.

mreyescdl added a commit to CDLUC3/mrt-atom that referenced this issue Mar 15, 2024
@mreyescdl
Copy link
Contributor

Nuxeo client change to eliminate changeToken
CDLUC3/mrt-atom#8

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants