Handling of DateTimeParseException in WARCSpout #1140
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Hello all,
while loading and parsing external WARC files with the WARCSpout, StormCrawler crashed again and again, and the topology on the Storm cluster was restarted. The reason was a unhandled DateTimeParseException in the WARCSpout, which is thrown in case the WARC-Date of the WARC record is invalid (e.g. some random number instead of a proper year). DateTimeParseException extends RuntimeException, I assume this is the reason why StormCrawler shuts down automatically as soon as this error is thrown and not catched.
I haven't encountered it until now, so this error seems to be rather rare. But for example in this WARC file from Common Crawl's robots.txt dumps of 2016, there is indeed an unparsable WARC-Date.
By surrounding the parsing of WARC-Date with a try-catch block, the record with the invalid date is only skipped and the crawler continues with the next record without restart.
Example for invalid WARC-Date: