-
Notifications
You must be signed in to change notification settings - Fork 70
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Global deduplication for specific URLs #443
Comments
An alternative solution would be to dedupe based on data type or size, but that would require a new download every time and might slow down some crawls massively. If we go down this road, we should write revisit records for those; wpull already has support for that, it would just have to be activated and the remote calls implemented through a custom URLTable. |
|
Alternative for a proper global dedupe (which would likely require changes in wpull because the URLTable methods aren't async): special ignores that send the URL to a logger. Then we regularly dedupe what the logger receives and run those URLs separately in |
This igset is only intended as a temporary workaround until ArchiveTeam#443 is implemented properly. Does not include the JW Player customer videos as those are not as frequent as the FastCo ones.
While global deduplication for everything in ArchiveBot is not feasible, we should consider adding something for certain URLs that waste a lot of disk space, probably shouldn't be ignored entirely, but are regrabbed needlessly and repeatedly. Two examples come to mind:
^https?://mp3\.cbc\.ca/
and^https?://podcast-a\.akamaihd\.net/mp3/
(pending further investigation whether the latter also has non-CBC content)^https?://content\.jwplatform\.com/videos/
Currently, these ignores are typically manually added when someone sees it. I know we've grabbed some of those URLs thousands of times, but others were never covered before. Because the contents on these hosts don't change with time, ignoring them if they've ever been grabbed before by some AB job should be fine. However, job starting URLs should not be checked against the dedupe list so that they can be saved again if needed – specifically, this means that URL table entries with
level = 0
would always be retrieved.An implementation would probably keep the dedupe DB and the list of URL patterns to be checked against it on the control node. The latter is pushed to the pipelines (and updated if it changes), then the pipeline queries the DB on encountering a matching URL. TBD is whether the pipeline should be able to directly add entries to the DB or whether they should come from the CDXs in the AB collection. The latter is more trustworthy (and also covers the unfortunate case when archives are lost between retrieval and IA upload) but adds a delay which can still lead to repeated retrieval. Alternatively, pipelines could add a temporary entry which gets dropped after a few days if it isn't confirmed by the CDXs.
The text was updated successfully, but these errors were encountered: