Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Identify and remediate published items that are missing public cocina in the purl-fetcher db #5181

Open
andrewjbtw opened this issue Oct 1, 2024 · 8 comments
Assignees

Comments

@andrewjbtw
Copy link

This comes out of the discussion in this Slack thread about an item that was failing publish because of missing public Cocina JSON in the purl-fetcher DB. There are probably more items with a similar problem.

More specifically, items that meet these criteria are likely to fail publish until remediated:

  1. Item is closed (Accessioned for in accessioning)
  2. Item has been previously published, but prior to the versioning work
  3. Item has rights that are not dark (i.e. there should be a purl)
  4. Item does not have public cocina in the purl-fetcher DB (but may have public cocina as a static file on purl)

The publish problem seems to be that the publish step tries to diff the cocina for the currently accessioning version with the cocina for the previously-published version. But if there's no cocina in the purl-fetcher DB, the diff can't proceed.

We think it's likely that items that lack public cocina in purl-fetcher ran into a migration problem when we were populating the purl-fetcher db during the versioning work.

Additional background

There are also "Opened" items that don't have public cocina in the purl-fetcher DB even though they have purls. We should treat those differently and not include them in this remediation. Those items will fail a standalone publish (triggered from Argo) but if they are closed, they should successfully be accessioned.

We had to treat "Opened" items differently when migrating to the new versioning model for reasons that I'm finding difficult to summarize concisely in this issue. I may file a separate issue about them but need to gather more information. Ideally, we can solve the issue by closing them but they may contain unfinished in-progress changes.

@lwrubel
Copy link
Contributor

lwrubel commented Oct 2, 2024

I haven't yet figured out how to determine if the item has been previously published since I'm bumping into some problems with querying the workflow service. I'm also trying to figure out in the meantime the best way to know if that previous publishing happened prior to the versioning work.

But in the meantime, here is a report on the druids that don't have public json in the purl-fetcher DB. It includes:

  • closed?
  • publishable? Meaning it has a last closed version and there's cocina for that last closed version--it has been closed at least once since we moved to the new version model.
  • not dark?

publish_status_report.csv

@lwrubel lwrubel self-assigned this Oct 2, 2024
@lwrubel
Copy link
Contributor

lwrubel commented Oct 2, 2024

I spot-checked some druids and every one I checked had the problem with releaseTags being in the cocina on purl-fetcher. Correction: I thought that data should have been removed when app/services/migrators/remove_release_tags.rb was run, but that only updated the ActiveRecord in DSA, not the cocina on purl-fetcher.

@lwrubel
Copy link
Contributor

lwrubel commented Oct 3, 2024

@andrewjbtw I've updated the report to include each druid's version 1 "published" milestone date. It looks like all of these were published at least once before migration (the latest "first published" date is 2023-07-06).

publish_status_report.csv

https://github.com/sul-dlss/dor-services-app/pull/5182/files

@lwrubel
Copy link
Contributor

lwrubel commented Oct 4, 2024

To actually remediate these (or a subset, such as those that are closed) we need to come up with a migration strategy for migrating public cocina in the future. Since we unmounted /purl and /stacks from DSA, we need to go through purl-fetcher. It will need an endpoint that will rewrite the cocina.json and create a new PublicJson record. It's probably straightforward-ish for those druids not using the versioned layout, but if we need to handle versions of cocina, that's an issue to figure out.

@lwrubel
Copy link
Contributor

lwrubel commented Oct 11, 2024

Rather than migrating, will allow these to proceed without erroring via: sul-dlss/purl-fetcher#932, pending testing on stage.

@lwrubel
Copy link
Contributor

lwrubel commented Oct 15, 2024

@andrewjbtw these items are now possible to republish without raising an error. Republishing will create a public JSON record and update the cocina.json on purl.

@andrewjbtw
Copy link
Author

@lwrubel sorry for the delay in getting back to this, but I finally went through the druids in the report and something seems off. There are 11,686 druids in the list and all but 9 of them are Google scans, and those 9 were all in an Opened state until I closed them today (which led immediately to publish errors).

I'm going to republish the items anyway, since that could turn up issues. But I don't recall Google items being among the items that have been problems. Maybe I don't understand what creates a "problem" druid. I'm likely to end up republishing literally every accessioned item in the next few months, as that appears to be the only way to approach certainty in SDR.

@lwrubel
Copy link
Contributor

lwrubel commented Oct 18, 2024

I agree it's odd that these are all Google Books, which have not typically had problems. I suspect these got into the state they're in on purl-fetcher not through any previous publishing problem but from a cocina.json migration problem that was undetected. But I'm not completely sure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In Progress (Not Ordered)
Development

No branches or pull requests

2 participants