Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Skip errors during harvest and report to Sentry #551

Merged
merged 5 commits into from
Nov 20, 2023

Conversation

ghukill
Copy link
Contributor

@ghukill ghukill commented Nov 16, 2023

What does this PR do?

Allows graceful skip of errors during method=get / GetRecords style harvest.

If errors are encountered, and the count is below a new config value MAX_ALLOWED_ERRORS, a message is sent to Sentry that looks roughly like:

Screenshot 2023-11-15 at 4 37 15 PM

However, if more errors are encountered than defined by MAX_ALLOWED_ERRORS, the harvested is stopped and a message like the following will be sent to Sentry:

Screenshot 2023-11-15 at 4 42 21 PM

In both scenarios, an additional data point failed_records will be present in the Sentry message that shows which records failed, e.g.:

Screenshot 2023-11-15 at 4 37 40 PM

Helpful background context

As noted in both TIMX-257 and TIMX-258 (linked below), until now an external "skip list" has been used to skip records that were consistently throwing errors. This was functional, but has some overhead for maintaining that list; more detail can be found in the tickets.

How can a reviewer manually see the effects of these changes?

The most straightforward way is to perform a local harvest that targets some known (at time of writing) problematic records:

from harvester.oai import OAIClient

oai_client = OAIClient(
    "https://dspace.mit.edu/oai/request",
    metadata_format="oai_dc",
    retry_status_codes=tuple() # sets to empty tuple to avoid numerous retries
)
identifiers = [
    "oai:dspace.mit.edu:1721.1/152958",
    "oai:dspace.mit.edu:1721.1/152786", # BAD
    "oai:dspace.mit.edu:1721.1/147573", # BAD
    "oai:dspace.mit.edu:1721.1/152939",
]
records = list(oai_client.get_records(identifiers))

This should complete, but records should only be length 2 for the two that completed successfully.

To confirm that messages are sent to Sentry, you can do the following:

  • set .env
WORKSPACE=dev
SENTRY_DSN=<ADD HERE>

And then prefix the following to python code above:

from harvester.config import configure_sentry
configure_sentry()

# rest of code...

After re-running, there should be a new message here in Sentry.

Includes new or updated dependencies?

YES

NOTE: both vcrpy and urllib3 was pinned until work is completed for this issue: #550

What are the relevant tickets?

Developer checklist

  • All new ENV is documented in README
  • Stakeholder approval has been confirmed (or is not needed)

Code reviewer checklist

  • The commit message is clear and follows our guidelines (not just this pull request message)
  • There are appropriate tests covering any new functionality
  • The documentation has been updated or is unnecessary
  • The changes have been verified
  • New dependencies are appropriate or there were no changes

Why these changes are being introduced:
vcrpy 5.x has breaking changes with our test suite and pipenv setup. This will
be addressed during maintenance and not current work.

How this addresses that need:
* pins vcrpy and urllib3

Side effects of this change:
* pinned versions until updates and maintenance

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/TIMX-258
Why these changes are being introduced:
During a harvest that uses the GetRecords approach, if a particular record throws
and error the harvest fails.  The former workaround was an external skip list, but this
had overhead to maintain.

By gracefully skipping records and reporting them to Sentry, the harvest can continue
even for known problematic records.

How this addresses that need:
* the GetRecords harvest skips errors and sends logs to Sentry
* a new MAX_ALLOWED_ERRORS config value is added to set the upper limit of failed records allowed

Side effects of this change:
* Harvests will complete WITHOUT an external skip list even with problematic records
* DataEng is responsible for monitoring Sentry and triaging skipped records
* We can stop maintaining an external skip list as an SSM parameter

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/TIMX-258
* https://mitlibraries.atlassian.net/browse/TIMX-257
Copy link
Contributor

@ehanson8 ehanson8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good and works as expected, one suggestion!

harvester/utils.py Show resolved Hide resolved
Copy link

github-actions bot commented Nov 17, 2023

Pull Request Test Coverage Report for Build 6931383261

  • 32 of 32 (100.0%) changed or added relevant lines in 4 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage remained the same at 100.0%

Totals Coverage Status
Change from base Build 4670459934: 0.0%
Covered Lines: 179
Relevant Lines: 179

💛 - Coveralls

harvester/utils.py Outdated Show resolved Hide resolved
@ghukill ghukill merged commit 0b5c2ab into main Nov 20, 2023
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants