-
Notifications
You must be signed in to change notification settings - Fork 84
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added parameters and exception behaviour to pqdm #792
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your work on this PR @Sherwin-14!
There are still several places where you have to expose the new fail_fast
parameter. So far, you have added it only to "protected" methods, so it is not yet exposed to the user in the "public" function and methods. You need to keep following the call chains up to the top (public) API so that users can use the parameter, if they want to choose False
instead of the default True
.
For example, you added fail_fast
to Store._open_files
, but you have not added it to all places that call Store._open_files
. You did add it to Store._open_urls
, which calls _open_files
, but you did not yet add it to Store._open_granules
, which also calls _open_files
.
Further, both _open_files
and _open_granules
are dispatched when Store._open
is called, so you must find all places where Store._open
is called. In this case, it is called from Store.open
, which is a public method, so you must add fail_fast
to Store.open
so the user can override the default, if they wish.
Further still, Store.open
is called by the open
top-level function in the api
module, so you must add fail_fast
there as well.
You must follow all such call chains from all points where pqdm
is called, all the way up to the top of the call chains (up to the public functions and methods), adding fail_fast
to each function and method along all such call chains.
@mfisher87, I see lots of Ruff error messages regarding a number of files that @Sherwin-14 has not touched in this PR. Is there something already on |
Not yet, but that should be an easy fix. I'll take care of it now :) #788 |
Excellent! Thank you for the quick fix. @Sherwin-14, Matt's fix is now merged into |
@Sherwin-14, do you have enough information to proceed with the changes I requested? |
Yeah I have the info I needed, I had put this on hold since I was preoccupied with some other things for the last couple of days. I'll resume my work on this asap. |
Would it be possible and in scope of this PR to use # code in api.py and something similar in store.py
# in this example we leave threads out to ensure backwards compatibility and will use it in the store class if present
def open(granules, threads, provider, pqdm_opts):
default_pqdm_opts = {
n_jobs=threads,
exception_behaviour='immediate'
colour="green",
disable=False # True means no output including the progress bar.
}
if not pqdm_opts;
pqdm_opts = default_pqdm_opts
earthaccess.__store__.open(granules, treads, pqdm_opts) and actually we should do the same with |
@Sherwin-14 No rush! We really appreciate any time you can spare, and don't want to ask for more than that :) |
@betolink I really like that idea. Basically "Allow customizing all pqdm behaviors" (and separately, "Allow customizing all fsspec behaviors")? I'm not sure how I feel about including that as part of this PR. I lean towards separate out of respect for @Sherwin-14 's time, but I don't want to speak for you, Sherwin! |
We decided during hack day today that we want to allow users to customize pqdm behaviors like so: def open(something, threads=16, *, pqdm_kwargs={...}, ...):
pqdm_kwargs_defaults = {"exception_behavior": "immediate", "threads": threads}
pqdm_kwargs = pqdm_kwargs_defaults.update(pqdm_kwargs)
... We considered the |
How should we now proceed with the changes we discussed during the hackday? |
Do the same thing you did in the PR you already started for this, but instead of pqdm_kwargs = {
"exception_behavior": "immediate",
"n_jobs": threads,
**pqdm_kwargs,
}
...
pqdm(<other args>, **pqdm_kwargs) |
Great point! Much clearer with the double splat too ;) |
@Sherwin-14, do you need some help with this? |
I will try to accomodate the change asap, I was caught up in something else up untill now |
No rush. I just wanted to check in to see if you need any help or if something else was requiring you to postpone your continuation. |
def download(
granules: Union[DataGranule, List[DataGranule], str, List[str]],
local_path: Optional[str],
provider: Optional[str] = None,
threads: int = 8,
pqdm_kwargs: dict = {}
) -> List[str]:
"""Retrieves data granules from a remote storage system.
* If we run this in the cloud, we will be using S3 to move data to `local_path`.
* If we run it outside AWS (us-west-2 region) and the dataset is cloud hosted,
we'll use HTTP links.
Parameters:
granules: a granule, list of granules, a granule link (HTTP), or a list of granule links (HTTP)
local_path: local directory to store the remote data granules
provider: if we download a list of URLs, we need to specify the provider.
threads: parallel number of threads to use to download the files, adjust as necessary, default = 8
Returns:
List of downloaded files
Raises:
Exception: A file download failed.
"""
provider = _normalize_location(provider)
pqdm_kwargs = {
"exception_behavior": "immediate",
"n_jobs": threads,
**pqdm_kwargs,
}
if isinstance(granules, DataGranule):
granules = [granules]
elif isinstance(granules, str):
granules = [granules]
try:
results = earthaccess.__store__.get(
granules, local_path, provider, threads, pqdm_kwargs
)
except AttributeError as err:
logger.error(
f"{err}: You must call earthaccess.login() before you can download data"
)
return []
return results Just for clarification, the public API function download would look something like this. The |
Correct, but you almost never want to use mutable values for default values, and this is one of those cases (among other references, see https://docs.python-guide.org/writing/gotchas/#mutable-default-arguments). So, do NOT do this: def download(
granules: Union[DataGranule, List[DataGranule], str, List[str]],
local_path: Optional[str],
provider: Optional[str] = None,
threads: int = 8,
pqdm_kwargs: dict = {}
) -> List[str]:
pqdm_kwargs = {
"exception_behavior": "immediate",
"n_jobs": threads,
**pqdm_kwargs,
} Instead, DO THIS: def download(
granules: Union[DataGranule, List[DataGranule], str, List[str]],
local_path: Optional[str],
provider: Optional[str] = None,
threads: int = 8,
pqdm_kwargs: Optional[Mapping[str, Any]] = None,
) -> List[str]:
pqdm_kwargs = {
"exception_behavior": "immediate",
"n_jobs": threads,
**(pqdm_kwargs or {}),
} |
Codecov ReportAttention: Patch coverage is
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.
Additional details and impacted files@@ Coverage Diff @@
## main #792 +/- ##
==========================================
- Coverage 73.88% 68.78% -5.11%
==========================================
Files 31 32 +1
Lines 2003 2153 +150
==========================================
+ Hits 1480 1481 +1
- Misses 523 672 +149 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Sherwin-14. In addition to my other comments, please add at least 1 unit test. I don't think you need to test passing all possible pqdm kwargs, but perhaps just test that you can produce an error for downloading and have it raise an error (now that the default exception behavior is "immediate") instead of returning with a list containing errors.
earthaccess/store.py
Outdated
access = "direct" | ||
provider = granules[0]["meta"]["provider-id"] | ||
# if the data has its own S3 credentials endpoint, we will use it | ||
endpoint = self._own_s3_credentials(granules[0]["umm"]["RelatedUrls"]) | ||
if endpoint is not None: | ||
logger.info(f"using endpoint: {endpoint}") | ||
s3_fs = self.get_s3_filesystem(endpoint=endpoint) | ||
else: | ||
logger.info(f"using provider: {provider}") | ||
s3_fs = self.get_s3_filesystem(provider=provider) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where did this come from?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Sherwin-14, it seems you've marked many suggestions as resolved, but I don't see the changes. Perhaps I'm not looking in the right spot in the GitHub UI, but are you sure you pushed your changes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Sherwin-14, I suspect you marked many suggestions as "resolved," but forgot to push the changes that resolve the suggestions.
Also, please remember to go back to the Conversations tab of your PR and, in the upper-right area of the tab, click the circular double-arrows next to a reviewer's name to request a re-review after the reviewer has requested changes. Otherwise, the reviewer doesn't know when you're finished making changes and ready for a follow-up review.
Clicking that icon will send another email indicating that you want another review.
earthaccess/store.py
Outdated
access = "direct" | ||
provider = granules[0]["meta"]["provider-id"] | ||
# if the data has its own S3 credentials endpoint, we will use it | ||
endpoint = self._own_s3_credentials(granules[0]["umm"]["RelatedUrls"]) | ||
if endpoint is not None: | ||
logger.info(f"using endpoint: {endpoint}") | ||
s3_fs = self.get_s3_filesystem(endpoint=endpoint) | ||
else: | ||
logger.info(f"using provider: {provider}") | ||
s3_fs = self.get_s3_filesystem(provider=provider) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Sherwin-14, it seems you've marked many suggestions as resolved, but I don't see the changes. Perhaps I'm not looking in the right spot in the GitHub UI, but are you sure you pushed your changes?
earthaccess/store.py
Outdated
fail_fast: if set to True, the download process will stop immediately | ||
upon encountering the first error. If set to False, errors will be | ||
deferred, allowing the download of remaining files to continue. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see a change. Unresolving.
@Sherwin-14, several other PRs have landed on |
@chuckwondo Hey Chuck, I have made the appropriate changes but I haven't pushed the code yet since I was writing unittest for this one which you suggested earlier. I will re-request for a review once I feel this is ready to be reviewed |
Sounds good, but please wait to mark any suggestions as resolved until you push the corresponding code changes, otherwise it is confusing to others. |
class TestEarthAccessDownload(unittest.TestCase):
def test_download(self):
earthaccess.login()
results = earthaccess.search_data(
short_name='ATL06', # ATLAS/ICESat-2 L3A Land Ice Height
bounding_box=(-10, 20, 10, 50), # Only include files in area of interest...
temporal=("1999-02", "2019-03"), # ...and time period of interest.
count=10
)
with patch.object(earthaccess, '__store__', return_value= Mock()) as mock_store:
with patch.object(mock_store, 'get') as mock_get:
mock_get.side_effect = Exception('Download failed')
files = earthaccess.download(results, "/home/download-folder")
self.assertEqual(files, [])
if __name__ == '__main__':
unittest.main() This is the test function I came up with and this is giving me an error log that looks something like this. Is this the thing we want to achieve at this point? Also if the test is correct then where should I place it should I make a completely new file for this like |
Hi @Sherwin-14. Thanks for working on the unit tests for this PR. The first thing I suggest you do is merge the latest from Regarding the test you've written, it looks like you're generally heading in the right direction. I have some suggestions:
|
I have taken out unittest but it seems that it is necessary to use unittest for importing the likes of Mock. Another option here would be to use the pytest mock library but that means we will need to add an additional dependency. |
You can still use unittest Mocks, but there's no need for creating a subclass of unittest.TestCase. |
👈 Launch a binder notebook on this branch for commit d22f2ac I will automatically update this comment whenever this PR is modified 👈 Launch a binder notebook on this branch for commit 374f303 👈 Launch a binder notebook on this branch for commit de8df1c 👈 Launch a binder notebook on this branch for commit 767e119 👈 Launch a binder notebook on this branch for commit 2d485b3 👈 Launch a binder notebook on this branch for commit 1061743 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In addition to my comments, please also make corrections to fix the pre-commit issues: https://results.pre-commit.ci/run/github/399867529/1730125212.zyU74SY_QNaQKq_PZmUrIQ
It looks like you perhaps have not installed pre-commit hooks locally, otherwise your attempt to commit your changes should have produced those same pre-commit issues locally, thus preventing your commit from completing.
Please install pre-commit
, if you haven't already done so, an then run the following from your forked/cloned repo to install the hooks configured in the repo: pre-commit install --install-hooks
. Then run pre-commit run -a
to run the pre-commit checks against all the files, which should then help you fix the pre-commit errors.
tests/unit/test_api.py
Outdated
from earthaccess import api,download | ||
from unittest.mock import Mock | ||
|
||
def test_download(monkeypatch): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def test_download(monkeypatch): | |
def test_download_immediate_failure(monkeypatch): |
Co-authored-by: Chuck Daniels <cjdaniels4@gmail.com>
Co-authored-by: Chuck Daniels <cjdaniels4@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like you have not correctly merged/rebased from main
. I see changes that I should not see. For example, in CHANGELOG.md, I see a new Fixed section, which is not new. It is already on main
, so should not show up as something new on your branch.
I think there are some other spots that seem out of sync too, such as in the earthacces/store.py
method _open_files
, where the type of the threads
parameter seems to have been clobbered. It should be threads: int = 8
(a recent change), instead of threads: Optional[int] = 8
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Sherwin-14.
I'm approving, but do @mfisher87, @betolink, or @jhkennedy want to take a pass over this before it gets merged?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks amazing, @Sherwin-14 ! Thank you! A little nitpick, and let's merge this!
…diate_failure Co-authored-by: Matt Fisher <3608264+mfisher87@users.noreply.github.com>
User pre-commit-ci[bot] does not have permission to run integration tests. A maintainer must perform a security review of the code changes in this pull request and re-run the failed integration tests jobs, if the code is deemed safe. |
I did all the steps outlined in the issue #581, we can now build upon this.
📚 Documentation preview 📚: https://earthaccess--792.org.readthedocs.build/en/792/