Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: update kingfisher process URLs and use metadata endpoint #343

Merged
merged 32 commits into from
Apr 25, 2024

Conversation

yolile
Copy link
Member

@yolile yolile commented Apr 22, 2024

ref #291
closes #345

@jpmckinney
Copy link
Member

Ah, because of my refactoring we'll also have to do #345 as part of this.

@jpmckinney
Copy link
Member

I have to switch tasks for a bit, so let me know if you want to do that @yolile.

@yolile
Copy link
Member Author

yolile commented Apr 23, 2024

I have to switch tasks for a bit, so let me know if you want to do that @yolile.

Sure, If you have pushed all your changes I can take it from here.

- Collect: Don't try to request a null URL if scrapy_log is not set
- Collect: Don't set process_id if not found in crawl log (task fails later, in any case)
- Process: Task should not start if process_id is not set
- Pelican: pelican_dataset_name should not be set if spider or process_data_version are not set
- Pelican: If "id" is in the response, it is guaranteed to not be None

Cleanup:

- Collect: Clarify variables names
- Visually group as: request preparation, request, request parsing, response usage
- Don't use "json" as a variable name, as it can shadow the json library (if used)
- Rewrite code to avoid potential refactoring errors, like:

    x = mydict.get("x")
    if not x:
        # Oops! Should have set `x = calculate()` and then `mydict["x"] = x`.
        mydict["x"] = calculate()
    return x
@jpmckinney jpmckinney force-pushed the update-kingfisher-process-metadata branch from 2e18bae to 2f3b057 Compare April 23, 2024 18:10
@jpmckinney
Copy link
Member

All pushed now

Also check for procces_id_pelican and process_data_version before wiping, as they are not set
if Kingfisher Collect failed
@yolile yolile requested a review from jpmckinney April 23, 2024 19:22
…n in dependency injections style, without any actual injection). Extract scrapyd_url method.
…run()

- Save the pelican_dataset_name to the job context only if the API request succeeded
- pelican_dataset_name is guaranteed to be set by the time get_status() runs
- Check whether run() was called before attempting to wipe()
- Merge get_pelican_dataset_name() into __init__(), for readability
… docs: Update TaskManager class. Add comments about where job context fields are set.
… task manager with a task. Clarify variable names.
This also sets note="" when a job completes successfully. This handles another case like #168 (i.e. the task recovers in a completed status, not a waiting/running status).
@jpmckinney jpmckinney force-pushed the update-kingfisher-process-metadata branch from f5f7241 to 96baf3d Compare April 24, 2024 22:30
Copy link
Member

@jpmckinney jpmckinney left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Phew, should be easier to make changes to this next time

@yolile
Copy link
Member Author

yolile commented Apr 25, 2024

What a refactoring! The code looks more consistent and easier to maintain indeed!

@yolile yolile merged commit a601fae into main Apr 25, 2024
14 checks passed
@yolile yolile deleted the update-kingfisher-process-metadata branch April 25, 2024 17:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Move update_collection_availability() to Pelican task
2 participants