Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

In 1105 finalize command #80

Open
wants to merge 7 commits into
base: main
Choose a base branch
from
Open

In 1105 finalize command #80

wants to merge 7 commits into from

Conversation

ehanson8
Copy link
Contributor

Purpose and background context

Refactor clients, config, and CLI functionality in addition adding a finalize CLI command. The changes have been logically grouped into several commits for easier review.

How can a reviewer manually see the effects of these changes?

Not possible yet

Includes new or updated dependencies?

YES

Changes expectations for external applications?

NO

What are the relevant tickets?

Developer

  • All new ENV is documented in README
  • All new ENV has been added to staging and production environments
  • All related Jira tickets are linked in commit message(s)
  • Stakeholder approval has been confirmed (or is not needed)

Code Reviewer(s)

  • The commit message is clear and follows our guidelines (not just this PR message)
  • There are appropriate tests covering any new functionality
  • The provided documentation is sufficient for understanding any new functionality introduced
  • Any manual tests have been performed or provided examples verified
  • New dependencies are appropriate or there were no changes

* Rename recipient_email_address > recipient_email_addresses and update type hint to list[str] to reflect expected usage
* Update unit tests to account for these changes
* Remove __getattr__ method
* Add workspace, sentry_dsn, aws_region_name, dss_input_queue, and dsc_source_email properties and update calls across repo
* Remove duplicate addHandler call in configure_logger method
Why these changes are being introduced:
* A CLI command is need to process the submissions results that are sent to a DSS output queue

How this addresses that need:
* Rename stream > log_stream
* Add finalize CLI command and corresponding CLI test
* Add process_results and send_logs methods to Workflow class and corresponding unit tests
* Add Config call in Workflow module
* Remove SimpleCSV.process_deposit_results method as the class will default to Workflow.process_results
* Reorder fixtures

Side effects of this change:
* None

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/IN-1105
Copy link

@ghukill ghukill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All config changes looking great to me, thanks for pushing on that and leading a discussion.

I did leave a couple of comments about the CLI / Workflow relationship, and the approach of capturing logs as the primary output. I can remain open to it, but would be remiss without a bit more discussion either in this PR or a call about the approach.

dsc/config.py Show resolved Hide resolved
dsc/config.py Show resolved Hide resolved
dsc/utilities/aws/ses.py Show resolved Hide resolved
def process_deposit_results(self) -> list[str]:
"""Process results generated by the deposit according to the workflow subclass.
def process_results(self) -> dict[str, Any]:
"""Process results in the workflow's output queue.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we expand on this docstring a bit? Maybe an opportunity to update the run() method as well. I think it could be helpful if the docstring touches on two things:

  1. what this method acheives for the Workflow, very high level, for humans
  2. a tiny bit of how it accomplishes this technically (e.g. mentioning SQS queues)

For this PR, I'm looking at the base Workflow class from kind of a 30ft view. My understanding is that:

  • CLI command deposit --> Workflow method run()
  • CLI command finalize --> Workflow method process_results()

I'm not strictly opposed to this naming mismatch, but I think for someone who is not actively in the codebase (e.g. myself) I wonder why this naming disconnect. I had interpreted run() to be the primary method for the Workflow in previous PRs, but I think this seemingly equally important method process_results() pushes on that a bit.

I'm not sure renaming is required (but open to it), but at the very least I think the docstring could hold your hand a little more when you land here from the CLI commands.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree this could use more detail, I'll add that. @jonavellecuerdo and I were also discussing possibly updating run > submit_items and the deposit CLI command to submit to be more accurate. deposit was another holdover from wiley-deposits that I've been moving away from but hadn't updated the CLI command yet

Copy link
Contributor

@jonavellecuerdo jonavellecuerdo Jan 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wanted to note that at some point @ehanson8 mentioned maybe changing the CLI command to submit, but there was some hesitancy in that DSS also uses submit -> upload submissions to DSpace@MIT.

From my recollection, these CLI commands were taken from Wiley deposits. I'm in support of alignment in names, if anything:

  • CLI command submit -> Workflow.submit
  • CLI command process-results -> Workflow.process_results

Also agree comments re: Expanding info docstrings!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jonavellecuerdo Jinx 🙂

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Side note: I just finished reading @ghukill 's comment on how the CLI command finalize calls two methods from Workflow and now am not sure about the second bullet point suggesting renaming finalize -> process_results. 🤔

Comment on lines +137 to +138
workflow.process_results()
workflow.send_logs(ctx.obj["log_stream"])
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Going to locate this comment here, but it certainly has tendrils beyond these two lines. It may have connection to another comment about the CLI / Workflow class naming disconnect as well.

I suppose I'm still feeling a bit uneasy about setting up a StringIO logger, capturing virtually all logs while the Workflow works, and then shipping them off via email as the primary output of this app.

A few concerns come to mind:

  • it feels like it would be easy for someone not super familiar with this approach, to add a logging statement to code, that would start showing up in emails, given that logging is conventionally not directly shipped by an application
  • the order of logging matters: if we ever wanted to reorganize or cleanup the email sent out, we're kind of at the mercy of when and how things are logged

I suppose this feels like it has connection to the CLI/Workflow method naming, as why wouldn't we have a Workflow method like finalize(), that could call methods like process_results() and send_logs() (or maybe that gets reworked into something like report()).

From my POV, it would arguably be easier to capture messages -- e.g. on the Workflow instances -- as things happen, and then prepare a final "report" that gets emailed out. If this report generation were encapsulated in a method, you then have the ability to very easily modify it over time as requirements (invariably) change.

Happy to discuss more. I almost proposed it could wait, maybe be a ticket for returning to later, but it feels kind of fundamental to this last major action the Workflow takes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hear your concerns and it needs to be made clear in the eventual README.md how the logs are used, but let's huddle later today to discuss this in more detail

Copy link
Contributor

@jonavellecuerdo jonavellecuerdo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This review counts as an initial pass! I have one minor change request in addition to the 1-2 potential change requests proposed by @ghukill (updating Config to handle env vars without defaults and aligning names for CLI commands + Workflow methods).

I will also wait until we have a discussion re: how to capture emailed logs!

@@ -52,6 +52,7 @@ SENTRY_DSN=### If set to a valid Sentry DSN, enables Sentry exception monitoring
WORKSPACE=### Set to `dev` for local development, this will be set to `stage` and `prod` in those environments by Terraform.
AWS_REGION_NAME=### Default AWS region.
DSS_INPUT_QUEUE=### The DSS SQS input queue to which submission messages are sent.
DSC_SOURCE_EMAIL=### The email address from which SES emails are sent.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"The email address(es) to which SES sends emails."

Copy link
Contributor Author

@ehanson8 ehanson8 Jan 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This actually refers to what is in From: which needs to be specified for SES, not To: which is populated by --email_recipients

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants