Configuration to retrieve selected forecast times for model data #811

yt87 · 2023-11-05T19:12:14Z

yt87
Nov 5, 2023

The documentation is not very clear what is allowed in subtopic value. All the examples have only
. == /, * == anything and # == anything as wildcard symbols. BTW, what is the difference between * and #?
In my use case, I want to retrieve GEM regional every 3 hours. So far, I managed to achieve this by adding a reject line, but that means I have to process all incoming messages and immediately reject two-thirds of them.
Here is what I think might work (to be tested soon):

acceptUnmatched False

subtopic model_gem_regional.10km.grib2.#
# FIXME: change this
directory /tmp/metpx/reg
reject .*/[01][0-9]/0([036][^0369]|[147][^258]|[258][^147])/
reject .*_[A-Z]{3,4}_ISBL_(50|[3-5]50|[1-2][27]5)_
accept .*_(TMP|SPFH|DEPR|UGRD|VGRD)_(ISBL|TGL)_
accept .*_(ABSV|VVEL|HGT)_ISBL_
accept .*_CWAT_EATM_
accept .*_PRMSL_
accept .*(4LFTX|SHOWA)_SFC
accept .*(CAPE|HLCY)_ETAL
accept .*_(PRES|ICEC|SNOD|WTMP|AC?PCP|SPFH|TCDC|PRATE|WEA..)_SFC

Is there a more efficient way?
A minor quirk: the documentation (Command Line Guide) states that the default for acceptUnmatched is False, which seems to be false.

petersilva · 2023-11-05T20:25:54Z

petersilva
Nov 5, 2023
Maintainer

topics and wildcard grammar is defined by AMQP (e.g. discussion here: https://www.rabbitmq.com/tutorials/tutorial-five-python.html )

Each topic is a word separated by period character (the topic separator.)
the asterisk (*) matches a single topic in the pattern.
the hash mark (#) matches the rest of the topic.

Also described in the command line guide: https://metpx.github.io/sarracenia/Explanation/CommandLineGuide.html#subtopic-amqp-pattern-default:


 where:  
       *                matches a single directory name 
       #                matches any remaining tree of directories.

The server-side / AMQP topic filtering just provides a gross filtering, to minimize the amount of filtering done client side, but it doesn't eliminate it. It is expected that users need to do some finer grained filtering on the client side. It sounds like the filtering you are doing is correct. Don't worry about filtering 2/3'rd of the messages, worry about filtering 99.9%. The messages are not that big, especially compared to the grib files you are downloading.

for acceptUnmatched. yup. that's an error. Indeed, in v2 the default for acceptUnmatched depended on which component was in use, so there was no global default. in sr3 it was changed to default to true everywhere to make it easier to understand, reason about, and explain. Some of the documentation was updated, but apparently not all. thanks for mentioning it.

0 replies

petersilva · 2023-11-06T03:10:56Z

petersilva
Nov 6, 2023
Maintainer

Not recommending, just putting out options. For someone comfortable with python, and who is hitting the limits of regular expressions, one has the option of going ... um... full python

You can place the python module below in ~/.config/sr3/plugins/grib3rdhour.py

import logging
from sarracenia.flowcb import FlowCB

logger = logging.getLogger(__name__)


class Grib3rdhour(FlowCB):
    """
        Selecting only every third hour of GRIB regional model outputs for download
    """

    def after_accept(self, worklist):

        new_incoming=[]
        for message in worklist.incoming:
            accepted=False

            # arbitrary logic here to accept/reject using full python on the message body. Sample:
            # baseURl=https://dd.weather.gc.ca/  
            # relPath=/model_gem_regional/10km/grib2/00/000/CMC_reg_ABSV_ISBL_250_ps10km_2023110500_P000.grib2
            # element       1             2     3    4   5  
            relative_path = message['relPath'].split('/')

            logger.info( f" received: {relative_path} " )

            o=1
            # on hpfx.collab, set o=3 because the directory is a two levels deeper.
            # baseUrl=https://hpfx.collab.science.gc.ca/
            # relpath=/20231105/WXO-DD/model_gem_regional/10km/grib2/12/006/CMC_reg_ABSV_ISBL_250_ps10km_2023110512_P006.grib2
            if (relative_path[o] != 'model_gem_regional') or (relative_path[o+1] != '10km') or (relative_path[o+2] != 'grib2'):
               message.setReport( 201, 'Grib3rdhour: only want 10km resolution regional model in grib 2 format.' )
               worklist.rejected.append(message)
               continue

            # only want 0z or 12z run.
            if relative_path[o+3] not in [ '12', '00' ]:
               message.setReport( 201, 'Grib3rdhour: only want resolution regional model outputs from 00 or 12Z runs.' )
               worklist.rejected.append(message)
               continue

            try:
               hour = int(relative_path[o+4])

            except:
               message.setReport( 201, f'Grib3rdhour: field {o+4}: {relative_path[o+4]} should be an integer.' )
               worklist.rejected.append(message)
               continue

            if (hour % 3 ) != 0:
               message.setReport( 201, f'Grib3rdhour: field {o+4}: want reports every 3rd hour' )
               worklist.rejected.append(message)
               continue

            new_incoming.append(message)

        worklist.incoming=new_incoming

And add:


logreject on
callback grib3rdhour

to your subscriber configuration file.

It will then call the python callback module entry_point (called after_accept) after a message has gone through the accept & reject clauses, but before the file download is initiated.
the python callback modules re-writes the list of accepted messages (called worklist.incoming) removing some entries and adding them to the list of rejected ones (worklist.rejected.)
it will print a message giving the reason for the rejection of each message.

This provides much more fine-grained and powerful, and perhaps much clearer way of doing selection, but it requires use of python code, as opposed to what is usually considered a simpler configuration Domain Specific Language (DSL.) What is easier is perhaps a matter of taste.

Is that easier to understand than the regex/configuration file base configuration... I don't know.

0 replies

petersilva · 2023-11-06T03:21:05Z

petersilva
Nov 6, 2023
Maintainer

some sample logs from the above (they are different snippets, unrelated to each other, just to illustrate the sorts
of messages that would be in the log.)


023-11-05 22:00:57,727 [INFO] grib3rdhour after_accept  received: ['', 'model_gem_regional', '10km', 'grib2', '00', '031', 'CMC_reg_VGRD_ISBL_985_ps10km_2023110600_P031.grib2'] 
2023-11-05 22:00:57,727 [INFO] grib3rdhour after_accept  received: ['', 'model_gem_regional', '10km', 'grib2', '00', '036', 'CMC_reg_WDIR_ISBL_200_ps10km_2023110600_P036.grib2'] 
2023-11-05 22:00:57,727 [INFO] grib3rdhour after_accept  received: ['', 'model_gem_regional', '10km', 'grib2', '00', '036', 'CMC_reg_UGRD_ISBL_450_ps10km_2023110600_P036.grib2'] 
2023-11-05 22:00:57,727 [INFO] grib3rdhour after_accept  received: ['', 'model_gem_regional', '10km', 'grib2', '00', '035', 'CMC_reg_SHTFL_SFC_0_ps10km_2023110600_P035.grib2'] 
2023-11-05 22:00:57,727 [INFO] grib3rdhour after_accept  received: ['', 'model_gem_regional', '10km', 'grib2', '00', '031', 'CMC_reg_TSOIL_SFC_0_ps10km_2023110600_P031.grib2'] 
.
.
.
2023-11-05 22:00:51,907 [INFO] sarracenia.flowcb.log after_accept /model_gem_regional/10km/grib2/00/032/CMC_reg_WIND_ISBL_600_ps10km_2023110600_P032.grib2 rejected: 201 Grib3rdhour: field 5: want reports every 3rd hour 
2023-11-05 22:00:51,907 [INFO] sarracenia.flowcb.log after_accept /model_gem_regional/10km/grib2/00/031/CMC_reg_WIND_ISBL_10_ps10km_2023110600_P031.grib2 rejected: 201 Grib3rdhour: field 5: want reports every 3rd hour 
2023-11-05 22:00:51,907 [INFO] sarracenia.flowcb.log after_accept /model_gem_regional/10km/grib2/00/032/CMC_reg_SPFH_ISBL_100_ps10km_2023110600_P032.grib2 rejected: 201 Grib3rdhour: field 5: want reports every 3rd hour 
2023-11-05 22:00:51,907 [INFO] sarracenia.flowcb.log after_accept /model_gem_regional/10km/grib2/00/032/CMC_reg_WDIR_ISBL_275_ps10km_2023110600_P032.grib2 rejected: 201 Grib3rdhour: field 5: want reports every 3rd hour 
2023-11-05 22:00:51,907 [INFO] sarracenia.flowcb.log after_accept /model_gem_regional/10km/grib2/00/032/CMC_reg_WDIR_ISBL_400_ps10km_2023110600_P032.grib2 rejected: 201 Grib3rdhour: field 5: want reports every 3rd hour 
2023-11-05 22:00:51,907 [INFO] sarracenia.flowcb.log after_accept /model_gem_regional/10km/grib2/00/032/CMC_reg_PRES_PVU_1.5_ps10km_2023110600_P032.grib2 rejected: 201 Grib3rdhour: field 5: want reports every 3rd hour 
2023-11-05 22:00:51,907 [INFO] sarracenia.flowcb.log after_accept /model_gem_regional/10km/grib2/00/032/CMC_reg_VGRD_ISBL_1015_ps10km_2023110600_P032.grib2 rejected: 201 Grib3rdhour: field 5: want reports every 3rd hour 
2023-11-05 22:00:51,907 [INFO] sarracenia.flowcb.log after_accept /model_gem_regional/10km/grib2/00/032/CMC_reg_WEASN_SFC_0_ps10km_2023110600_P032.grib2 rejected: 201 Grib3rdhour: field 5: want reports every 3rd hour 
.
.
.

2023-11-05 22:00:57,114 [INFO] sarracenia.flowcb.log after_work downloaded ok: /tmp/dd_rdps/CMC_reg_WDIR_ISBL_550_ps10km_2023110600_P036.grib2 
2023-11-05 22:00:57,114 [INFO] sarracenia.flowcb.log after_work downloaded ok: /tmp/dd_rdps/CMC_reg_TMP_ISBL_900_ps10km_2023110600_P033.grib2 
2023-11-05 22:00:57,114 [INFO] sarracenia.flowcb.log after_work downloaded ok: /tmp/dd_rdps/CMC_reg_WIND_ISBL_200_ps10km_2023110600_P033.grib2 
2023-11-05 22:00:57,114 [INFO] sarracenia.flowcb.log after_work downloaded ok: /tmp/dd_rdps/CMC_reg_UGRD_ISBL_10_ps10km_2023110600_P036.grib2 
2023-11-05 22:00:57,114 [INFO] sarracenia.flowcb.log after_work downloaded ok: /tmp/dd_rdps/CMC_reg_TMP_ISBL_20_ps10km_2023110600_P036.grib2 
2023-11-05 22:00:57,114 [INFO] sarracenia.flowcb.log after_work downloaded ok: /tmp/dd_rdps/CMC_reg_VGRD_TGL_40_ps10km_2023110600_P036.grib2

0 replies

petersilva · 2023-11-06T06:00:44Z

petersilva
Nov 6, 2023
Maintainer

I forgot to address the original question! Efficiency-wise, each regex compute complixity os O(n) (assuming this is right: https://stackoverflow.com/questions/5892115/whats-the-time-complexity-of-average-regex-algorithms ) and the average url is, say 100 characters in length (n) and we have 9 of them in your sample configuration, that's O(900) operations per message. If you have plain accept .* as your only accept statement, it actually short circuits, so no regexes are evaluated, processing falls through to the callback.
The callback is a for loop that does a single split and some if statements, I suspect it is more efficient than 9 regexes as described above.
It is certainly not more costly. but would have to instrument it up to compare...

otoh, for the volume of files in question, it doesn't make a significant difference... microseconds perhaps?
The whole thing is going to be dominated by transfer time. If the interest is speed, then add instances.

0 replies

yt87 · 2023-11-08T04:03:16Z

yt87
Nov 8, 2023
Author

Well, I tried the other server, retrieved 3560 files, which is slightly above 60% of those requested. NWS Eastern Region is also trying, maybe they will have more luck. The distance from Dorval to Long Island, NY is much shorter than to Anchorage.

1 reply

petersilva Nov 8, 2023
Maintainer

I didn't realize your were with NWS. .. We could make some arrangements to get you non anonymous access, and provide higher queueing limits.

yt87 · 2023-11-08T04:17:23Z

yt87
Nov 8, 2023
Author

I wonder whether it would make sense to put the onus on the client to handle the queue. What I mean is that the client would store the received messages on a disk, with something like litequeue, and then proceed to retrieve the files. One thread is busy with receiving the messages and putting them into a queue, the other reads the queue and schedules retrievals.
I have no idea whether this is feasible without major changes to the code.
BTW, I am pushing for NWS to centralise retrieval of the CMC models.

2 replies

yt87 Nov 8, 2023
Author

Thanks, that would be great. Although likely it would be likely be Eastern Region who would pull the data from CMC and then distribute to other sites.

petersilva Nov 8, 2023
Maintainer

We've never wanted to do that, because we get automatic self-levelling load distribution by having each instance fetch more messages after they finish downloading what the ones they have. There are many ways to approach this. first method:

No Code, Just Add Broker.

If you have the ability to spin up containers, we can help you set up a broker on the other side, so the message transfer will be to your local broker, and then have the subscribers subscribe to that broker. No code change required at all.

You would have a layer of subscribers (aka shovels) that just download the messages from the Canadian broker and post them on the NWS broker. Actual downloaders, would subscribe to the NWS broker. Message traffic is relatively lightweight, so unlikely to fall to far behind. You preserve the auto load balancing among your download instances, and reducing the queueing load on the Canadian broker. Within ECCC, we have regional storm prediction centres from Vancouver to Gander, and each one has a local broker for that purpose. Second method:

Prefetch

If we don't care about auto-levelling, then as @reidsunderland suggested, we can set the:


prefetch 10000

I'm not sure how high it will work before it becomes ineffective, but the idea is that each instance
will download prefetch messages. This means that if the files have wildly different sizes, or i/o rates, then some instances may be much slower than others, and the load balancing is much more coarse grained... but it definitely gets the messages downstream quickly. With four instances, all 40000 messages will be downloaded pretty much immediately, then they download at their leisure.

Not currently working: write the messages to a file.

in v2... we had means of downloading messages to a file, and then reading them back.
I remember part of that functionality was kind of dropped on the floor for sr3, so I would have to review and see about picking it up again. After it was re-enabled, could then just make the download from source disjoint from the downloading that way.

This on the to do list to restore, it just hasn't been something we have ended up using much, so priority is relatively low.

There are probably other ways, but those are the ones that spring immediately to mind.

yt87 · 2023-11-08T04:57:07Z

yt87
Nov 8, 2023
Author

The third option is what I had in mind, but with a file replaced by a disk-backed FIFO. Python module litequeue could be a candidate.
Configuring a local broker might be a good option for us.

0 replies

petersilva · 2023-11-08T05:08:22Z

petersilva
Nov 8, 2023
Maintainer

I think that's an easy thing to do... but it's code... would have to be written. I think could have two flowcallback plugins:

sarracenia/flowcb/post/liteq.py --- pushes stuff onto the FIFO.
sarracenia/flowcb/gather/liteq.py --- pops stuff off the FIFO.

The name of the file could be an option supported by both of them, and you have two flows, one that reads from the Canadian broker and posts to the FIFO, and a second subscriber, that reads from the FIFO.

I think it's pretty straight-forward... but it doesn't exist. If that sounds useful to you, we can put it on our todo list (aka create an issue for it.)

1 reply

petersilva Nov 8, 2023
Maintainer

or... a contribution would be even more welcome ;-)

yt87 · 2023-11-08T05:14:37Z

yt87
Nov 8, 2023
Author

I'll have a look.

2 replies

petersilva Nov 8, 2023
Maintainer

hints:

Instead of a subcriber, you would have a flow/ file, and the flow/get_messages_from_CMC.conf
config would have something like this in it:

broker amqps://dd.weather.gc.ca
exchange xpublic
topicPrefix v02.post
subtopic  as before... 

liteqFile /tmp/the_queue_name

callback gather.message
callback post.liteq

the above would be the config for the thing to download from the broker. the second flow/download_files_from_cmc.conf:


liteqFile /tmp/the_queue_name

callback gather.liteq
flowmain download

The flowmain refers to a third needed file: flow/download.py

from sarracenia.flow import Flow
import logging

logger = logging.getLogger(__name__)

default_options = {'acceptUnmatched': True, 'download': True, 'mirror': False}

class Download(Flow):
    """
       * download files (gathering and posting is someone else's problem.)
    """

    def do(self):

        if self.o.download:
            self.do_download()
        else:
            # mark all remaining messages as done.
            self.worklist.ok = self.worklist.incoming
            self.worklist.incoming = []

        logger.debug('processing %d messages worked!' % len(self.worklist.ok))

in the flowcb plugins, the messages can be turned into json like so: jsonpickle.encode(message) + '\n'
(examples available in sarracenia/diskqueue.py and redisqueue.py)

petersilva Nov 8, 2023
Maintainer

another hint:

worklist.ok -> list of messages to push onto the FIFO.
worklist.incoming <- list of messages to pop onto from the FIFO.

yt87 · 2023-11-09T21:32:03Z

yt87
Nov 9, 2023
Author

Testing close with comment

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Configuration to retrieve selected forecast times for model data #811

{{title}}

Replies: 10 comments 6 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Configuration to retrieve selected forecast times for model data #811

yt87 Nov 5, 2023

Replies: 10 comments · 6 replies

petersilva Nov 5, 2023 Maintainer

petersilva Nov 6, 2023 Maintainer

petersilva Nov 6, 2023 Maintainer

petersilva Nov 6, 2023 Maintainer

yt87 Nov 8, 2023 Author

petersilva Nov 8, 2023 Maintainer

yt87 Nov 8, 2023 Author

yt87 Nov 8, 2023 Author

petersilva Nov 8, 2023 Maintainer

No Code, Just Add Broker.

Prefetch

Not currently working: write the messages to a file.

yt87 Nov 8, 2023 Author

petersilva Nov 8, 2023 Maintainer

petersilva Nov 8, 2023 Maintainer

yt87 Nov 8, 2023 Author

petersilva Nov 8, 2023 Maintainer

petersilva Nov 8, 2023 Maintainer

yt87 Nov 9, 2023 Author

yt87
Nov 5, 2023

Replies: 10 comments 6 replies

petersilva
Nov 5, 2023
Maintainer

petersilva
Nov 6, 2023
Maintainer

petersilva
Nov 6, 2023
Maintainer

petersilva
Nov 6, 2023
Maintainer

yt87
Nov 8, 2023
Author

petersilva Nov 8, 2023
Maintainer

yt87
Nov 8, 2023
Author

yt87 Nov 8, 2023
Author

petersilva Nov 8, 2023
Maintainer

yt87
Nov 8, 2023
Author

petersilva
Nov 8, 2023
Maintainer

petersilva Nov 8, 2023
Maintainer

yt87
Nov 8, 2023
Author

petersilva Nov 8, 2023
Maintainer

petersilva Nov 8, 2023
Maintainer

yt87
Nov 9, 2023
Author