Unify IOC/COOPS API #125

pmav99 · 2024-01-26T17:51:48Z

Addresses #99

pmav99 · 2024-01-26T17:53:26Z

@SorooshMani-NOAA I pushed on the dev branch. Have a look.
I added the code on ioc.py but implemented the tests on a different file so that it is easier to run them.
Feel free to change anything. Don't worry too much about commit messages, we will rebase before we merge.

@brey & @tomsail if you find the time to checkout the dev branch and try it out, I would appreciate any feedback.

codecov · 2024-01-29T13:38:56Z

Codecov Report

Attention: Patch coverage is 91.27182% with 35 lines in your changes missing coverage. Please review.

Project coverage is 89.89%. Comparing base (13ba6be) to head (1495264).
Report is 12 commits behind head on master.

❗ Current head 1495264 differs from pull request most recent head 536ddbc

Please upload reports for the commit 536ddbc to get more accurate results.

Files	Patch %	Lines
searvey/_coops_api.py	86.75%	13 Missing and 7 partials ⚠️
searvey/_ioc_api.py	93.04%	4 Missing and 4 partials ⚠️
searvey/utils.py	37.50%	5 Missing ⚠️
searvey/coops.py	96.42%	1 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #125      +/-   ##
==========================================
- Coverage   90.18%   89.89%   -0.29%     
==========================================
  Files          13       16       +3     
  Lines        1049     1366     +317     
  Branches      155      190      +35     
==========================================
+ Hits          946     1228     +282     
- Misses         56       78      +22     
- Partials       47       60      +13

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

searvey/ioc.py

SorooshMani-NOAA · 2024-01-29T21:09:58Z

COOPS web API uses redirects, so we need to make sure we always follow redirects for COOPS:
https://www.python-httpx.org/compatibility/

SorooshMani-NOAA · 2024-01-31T21:56:26Z

@pmav99 do you have any suggestions for debugging issues in the multithread/process calls by breakpoint?

pmav99 · 2024-01-31T23:28:03Z

From the top of my head, no, not really. Do the functions work when you run them directly?

SorooshMani-NOAA · 2024-02-01T13:33:43Z

I wanted to attach the debugger and step through the code when testing different inputs. For now I got away with print when I see issues, but I wanted to know if there's any better way of doing it.

pmav99 · 2024-02-01T15:17:35Z

The problem with using breakpoint() or pdb.set_trace() is that the breakpoint will get executed as many times as you call the function. I doubt this is what you want.

But, TBH, this question feels a bit strange. Just write and test the functions that retrieve and parse the data without any concurrency. As soon as the retrieve_data and the parse_data functions are correct, you can plug in multifutures (or whatever) to add concurrency and make the whole thing faster. But trying to integrate concurrency from the get go is probably overcomplicating things.

That being said, what can help with debugging multiprocessing/multithreading code is using the logging module, which if configured correctly, can print information about the current thread/process. For example:

import logging
import time
import multifutures as mf

logging.basicConfig(
    level=10,
    style="{",
    format="{levelname:8s}; {process:6d}; {threadName:8s}; {asctime:s}; {name:<15s} {lineno:4d}; {message:s}",
    force=True,
)
logger = logging.getLogger(__name__)

def foo(a):
    time.sleep(0.3)
    logger.info("The value of a=%d", a)

logger.info("Start Multiprocessing")
results = mf.multiprocess(foo, [{"a": i} for i in range(5)])
logger.info("Start Multithreading")
results = mf.multithread(foo, [{"a": i} for i in range(5)])
logger.info("Done")

searvey/coops.py

pmav99

Looks good.

If there are going to be so many functions that are shared between COOPS and IOC we should extract them to a separate module and import them from there.

searvey/coops.py

SorooshMani-NOAA · 2024-02-02T13:56:50Z

... we should extract them ...

I agree, I tried to stay as close to the updated IOC code as possible, that's why I basically copy pasted everything and then started modifying. Some of these can be easily put in utils to be used by all the datasource APIs, but also I don't want to start abstracting too soon.

pmav99 · 2024-02-05T10:34:20Z

WRT to the names mapping, I haven't tried this but I was wondering whether it is actually simpler to do something like this:

Run the code we have in order to generate the final dictionaries
Hardcode the final dictionaries in the file.
Any time a product gets added/removed we add/remove a line from the respective dictionary.

Yes, the code will be longer and repetitive, but we have no code generation, no .update(), no ugly comprehensions, no ** etc. Just declarative code.

SorooshMani-NOAA · 2024-02-05T14:40:28Z

@pmav99 it's a good idea, I'll do that!

SorooshMani-NOAA · 2024-02-06T16:11:52Z

@pmav99 I'm trying to decide what to do with defaults for datetimes. The issue is for some products:

I should include dates that include more than one month before current time to be able to fetch, e.g. high_low, hourly_height, and monthly_mean.
Or I should not include only dates more than 1 month prior to current dates, e.g. one_minute_water_level.

To resolve the first issue I can just create coops specific _resolve_start_date and specify the default to be 60 days. For example as of today (February 6th 2024) for monthly_mean I should specify 2023/12/31 to 2024/02/06 so that I get data. If instead I specify 2024/01/01 it fails!

But specifying 60 days cause problem for one_minute_water_level. The max interval for this product is 4 days, but there's no data before less than 1 month from now (at least for the station I'm testing). So if I specify 2024/01/09 to 2024/02/06, it fails because the chunk from 9th to 13th doesn't have data.

I'm not sure what's best. Should I just leave it to the user and say the default might not work for some products? Or should we add more dictionaries for info about what is the default start date range (relative to now) for each product? I'm not even sure if these are true for all stations! Should I first consult with COOPS?

If it's the latter, I think it makes sense to for now go with "default might not work" and later amend.

Actually another problem is whether I have to raise error if there's no data or just warn the user? Coops gives me an error message when there's no data.

SorooshMani-NOAA · 2024-02-07T17:00:54Z

searvey/coops.py

+        if "error" in result.result:
+            msg = json.loads(result.result)["error"]["message"]
+            logger.error("Encountered an error response!")
+            raise ValueError(msg)


@pmav99 should I raise or ignore? What do you think?

I think ignore is the correct approach. When dealing with historical data it's not that uncommon to have big gaps. E.g. a station might have been offline for a few months (broken instrument or whatever). This does not mean that the final dataframe will be empty. other requests/urls might return data. Even if the final dataframe is empty, it is not difficult for the user deal with it.

I guess we could add a raise_if_empty argument to fetch_*() but I don't think it offers that much value.

SorooshMani-NOAA · 2024-02-07T18:17:37Z

@pmav99 except for the two questions above, I'm done with COOPS API. I know that the metadata API needs some adjustment, but I'm assuming we want to keep that separate from the data API update, right?

pmav99 · 2024-02-07T20:24:26Z

WRT the default dates. I have to admit that I haven't really looked into this, but couldn't we have a mapping with the default durations? If the data are not too big we could even return the maximum interval by default. Am I missing something?

SorooshMani-NOAA · 2024-02-07T21:58:11Z

I agree that we might be able to cover most of cases with dictionaries, but I don't want to add too many dicts, like one for how long should I go back before I start having data (e.g. in case of mean, hi/lo, etc.), or how long can I go before there's no data (old data removed (e.g. one minute heights), and so on. I just don't want to go around adding a dict for each of these limitations, especially for these ones that don't have anything in documentation and I just found by trial and error! For now I think I'll just go with let the user figure out ... let's address this later. This PR is already a lot of changes and dictionaries!

SorooshMani-NOAA · 2024-06-05T20:42:05Z

@pmav99 I blobbed all of the changes for COOPS:

adding new API for COOPS
refactoring API (and the tests you added there)
Adding COOPS tests for the new & refactored API

Note that in this case the tests will fail after first and second commit, since all the coops related tests are updated in the third commit. Since refactoring requires rewriting the tests I cannot combine the 3rd and 1st commits either, it still fails. Unless you prefer I split the 3rd commit and add each part to the 1st and 2nd commit

Please let me know what to do

SorooshMani-NOAA · 2024-06-10T12:26:17Z

@pmav99 please let me know if I need to change anything else

pmav99 · 2024-06-19T10:41:24Z

@SorooshMani-NOAA I disabled the progress bar by default on the new api. let me know what you think about this

SorooshMani-NOAA · 2024-06-20T16:02:24Z

@pmav99 looks good to me. Please go ahead and merge if you're happy with everything as well, thank you!

- COOPS ignore errors vs raise - Add more documentation for COOPS - Get all types of stations

IOC is always on UTC.

This argument controls whether the tqdm based progress bar will be shown or not.

Fixes #145

pmav99 requested review from brey and tomsail January 26, 2024 17:51

pmav99 assigned pmav99 and SorooshMani-NOAA Jan 26, 2024

pmav99 marked this pull request as draft January 26, 2024 17:52

pmav99 force-pushed the dev branch from ff53cd5 to 6a598b5 Compare January 29, 2024 14:43

SorooshMani-NOAA reviewed Jan 29, 2024

View reviewed changes

searvey/ioc.py Outdated Show resolved Hide resolved