Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Factorize client interface and provide "download all in range" #64

Open
wants to merge 25 commits into
base: master
Choose a base branch
from

Conversation

LucHermitte
Copy link
Contributor

@LucHermitte LucHermitte commented Sep 10, 2024

This PR is quite big, I had to make some choices while trying to leave the current API as it is. More on the topic below.

The main changes I propose are:

  • a new method to query the list of all the files that intersect a time range -- the current API only permits to request a set of files given a finite and exhaustive list of datetimes and missions
  • an harmonization of ASFClient and DataspaceClient by:
    • introducing a root class Client that factorizes a common interface for querying and authenticating
  • a new design:
    • clients instances are not authenticated but they permit to query what is available on their respective servers
    • when we call the authenticate() method on each clients we obtain a session object that provides the unique download_all method

A few smaller changes are included:

  • .netrc file is now either explicitly given as parameter, or obtained from $NETRC environment variable (name already used by other tools), or defaulted to ~/.netrc
  • I've added typing information in quite a few places (to silence my LSP client, and to remove possible errors)
  • Tests of ASFClient API avoid to query the list of available files again and again. Instead, a cached baseline is used. And only one test queries the lists of available files, and makes sure the new list contains that the cached baseline. EDIT: at this point I missed the usage of VCR. Yet I think we don't need multiple and redundant slow requests when recording the cassettes.
  • Cached EOF in ASFClient are no longer static and global, but local to each client instance -- this way we know exactly what can be expected from each test, and we can compare the EOF list in the cache from the list of remote EOFs available on ASF server.
  • I've introduced an enum OrbitType to avoid typo errors and assertions to silence pylint.

Finally, there are things that are half-way done as I didn't want to completely modify the current API in case you were explicitly relying on it.

  • Query methods return objects of different types List[str] in ASF case and List[dict] in Copernicus dataspace case. Ideally, we would return a List[SentinelOrbit] which all download*() methods would know how to convert into something that makes sense at their level
  • I have the feeling both clients did not handle the T0/T1 offsets in the same way. I'd recommend to centralize this at Client level, but I'd prefer to know exactly how you want it done
  • The implementation of the parallel download isn't the same in both cases. As I don't know which approach was the one you prefer, I haven't tried to factorize anything. IMO, download_all should be at Session level, and call download_one which would be specialized for each client.

Luc Hermitte added 22 commits September 6, 2024 13:57
There is no need to authenticate is order to search for available products
So far, `query_orbit` and `query_orbit_by_dt` (on Dataspace side) return only
one orbit file: the one that contains the requested time range.

The new `query_orbits_by_dt_range` method returns all orbit file contained in
the requested time range.

In order to factorize code, internal `_QueryOrbitFile` classes are defined.
EOF files on ASF server are listed once only in tests. Ideally this should have
been mocked, but interrogation is still kept.

Other tests will rely on a snapshot of precise orbit files that is cached.
A test dedicated to testing the use of the cached baseline has been defined.
...instead of global and shared among all ASFClient instances.
And factorize out the creation of CDSE Access Token in order to to be able to
inject 2FA token from the command line.
In the end, no token nor secrets are recorded.
This will concern tests in test_eof.py. DB searching is tested in
test_asf_client.py
@LucHermitte
Copy link
Contributor Author

Also. I've curated the cassettes from all authentication related secrets.

The strange design in test_eof around cdse_access_token is to support the recording and the usage of cassettes with or without 2FA activated on CDSE account.

@scottstanie
Copy link
Owner

This is quite a PR @LucHermitte ! it might take a a little awhile to get through it all, but it looks like a lot of improvements. my only initial questions are

  • I'm wondering if there's a way to do that orbit selection without adding the new dependency of sortedcontainers, since it looks like that's only used one in the code (and once in a test)
  • I've been reminded by ASF that they now have an easier, password-free storage of orbits on S3 :
 aws --no-sign-request s3 ls s3://s1-orbits/AUX_POEORB/ | head -3
2024-07-26 22:04:55    4409614 S1A_OPER_AUX_POEORB_OPOD_20210203T122423_V20210113T225942_20210115T005942.EOF
2024-07-26 22:05:06    4409601 S1A_OPER_AUX_POEORB_OPOD_20210204T122413_V20210114T225942_20210116T005942.EOF
2024-07-26 22:06:36    4409874 S1A_OPER_AUX_POEORB_OPOD_20210205T122416_V20210115T225942_20210117T005942.EOF

do you think the interface changes would still accommodate that addition (which I expect to start using as my own default, since not needing a password is nice, and it'll go much faster on EC2 instances)?

@LucHermitte
Copy link
Contributor Author

it might take a a little awhile to get through it all

Of course. I know. :(

I'm wondering if there's a way to do that orbit selection without adding the new dependency of sortedcontainers, since it looks like that's only used one in the code (and once in a test)

In the tests definitively. I guess set should be enough. In the code however it's quite practical as I didn't want to add more complexity with a hash on SentinelOrbit (and use set). Also may be there is a way to sort on one criteria and apply a kind of uniq function on another criteria (given a priority order). If you know/see of a standard way of doing it, we could easily remove the dependency here. Any idea?

I've been reminded by ASF that they now have an easier, password-free storage of orbits on S3 [...] do you think the interface changes would still accommodate that addition (which I expect to start using as my own default, since not needing a password is nice, and it'll go much faster on EC2 instances)?

Quite certainly. The idea behind the changes was to keep a single and unique interface which is quite simple:

  1. search according to one criteria or another
  2. download the references found

It should even be possible to add a new S3ASFClient that has the same interface and which doesn't uses a cache. This could be a wrapper around boto3.

Also, IIRC, I've seen another REST based interface to request references to filenames on earthdata server. As I didn't want to bring more changes, I didn't try to replace the current approach. I didn't have much time to investigate if there is indeed another way to search for files according to whatever time and mission based criteria.

@scottstanie
Copy link
Owner

I'm wondering if there's a way to do that orbit selection without adding the new dependency of sortedcontainers, since it looks like that's only used one in the code (and once in a test)

In the tests definitively. I guess set should be enough. In the code however it's quite practical as I didn't want to add more complexity with a hash on SentinelOrbit (and use set). Also may be there is a way to sort on one criteria and apply a kind of uniq function on another criteria (given a priority order). If you know/see of a standard way of doing it, we could easily remove the dependency here. Any idea?

what about something like this using groupby (not fully tested)

    candidates = [
        item
        for item in data
        if item.start_time <= (t0 - margin0) and item.stop_time >= (t1 + margin1)
    ]

    if not candidates:
        raise ValidityError(
            "none of the input products completely covers the requested "
            "time interval: [t0={}, t1={}]".format(t0, t1)
        )

    # Sort candidates by all attributes except created_time
    candidates.sort(key=lambda x: (x.start_time, x.stop_time, x.id))

    # Group by everything except created_time and select the one with the latest created_time
    result = []
    for _, group in groupby(candidates, key=lambda x: (x.start_time, x.stop_time, x.id)):
        result.append(max(group, key=lambda x: x.created_time))

    return result

(unsure if that's the exact key we want to group on)

@LucHermitte
Copy link
Contributor Author

I've removed the dependency to sortedcontainers. Using sorted() instead of SortedList is a bit slower on my machine : 40ms. This is negligible.

Regarding my manual implementation of uniq, it does it's job is even less time (2ms). As I'm not used to group_by and as there is no id field in SentinelOrbit, I let it as it is for now.

(I can join my flawed benchmarking test if you're interested -- I don't think it makes sense in sentineleof code base)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants