Skip to content

S3 location already has content. Perhaps a failed replication was pruned from the database, but was not removed from the cloud. If so, prune this content from the database again...

Johnathan Martin edited this page Oct 20, 2023 · 1 revision

Full alert message:

WARNING: S3 location already has content. Perhaps a failed replication was pruned from the database, but was not removed from the cloud. If so, prune this content from the database again (there's a rake task), ask ops to delete the bad replicated content, and try again.

This has happened in the past in the following situations:

  1. QA or stage was reset, some important druids were re-used (e.g. Hydrus Ur-APO), and such seed content was re-accessioned before the cloud archives for the reset environment got purged. Hopefully this is the more common cause of this error going forward.
  2. In situations where replication jobs have been inadvertently re-tried (bc Sidekiq, or before the switch, Resque, weirdness; or in the distant past before tightening of a particular database constraint, because duplicate replications were attempted for a druid version because duplicate rows for a given endpoint could be created; ¯\_(ツ)_/¯)).

The suggestion given in the Honeybadger alert was successfully fleshed out for cleanup after QA and stage were recently both reset, and cloud archive cleanup lagged the rest of the reset by a week or two.

# export the errors from the Honeybadger alert.  You'll be emailed a .jsonl.gz file.
# wrangle that JSON to a list of druid version hashes using your favorite text editor or combo of nix CLI tools.
druid_version_list = [
  { druid: 'bb866bz4708', version: 24 },
  { druid: 'bc060pc0851', version: 21 },
  # ...
  { druid: 'zz365wj9396', version: 1 },
  { druid: 'zz550wm7530', version: 1 }
]

druid_version_list.map do |dv|
  next unless PreservedObject.exists?(druid: dv[:druid])
  Replication::FailureRemediator.prune_replication_failures(druid: dv[:druid], version: dv[:version], verify_expiration: false) # verify_expiration false because the DB rows are likely younger than the expiry period, unless you're doing this cleanup way after the fact
end

If the results of that look good, you can make sure the current objects get replicated by doing:

druid_version_list.map do |dv|
  next unless PreservedObject.exists?(druid: dv[:druid])
  PreservedObject.find_by!(druid: dv[:druid]).create_zipped_moab_versions!
end

If you watch Sidekiq and see that whatever replication was kicked off by the above has finished, you can audit for good measure:

druid_version_list.map do |dv|
  next unless PreservedObject.exists?(druid: dv[:druid])
  # runs the audit synchronously instead of queuing jobs, so may take a minute, but you'll get results back in terminal once it finishes
  Audit::ReplicationSupport.zip_part_debug_info(dv[:druid])
end

If you also need to delete actual cloud content, see the README's reset instructions for a method (and some helper methods it needs) that you can define in Rails console to purge cloud content for a QA or stage druid. If you need to purge cloud content in prod, you'll need to get ops to do it, as prod has more restricted access for the pres cat credentials (allows pres cat to write under a given key initially, but not to update or delete).

Clone this wiki locally