Replies: 10 comments 6 replies
-
topics and wildcard grammar is defined by AMQP (e.g. discussion here: https://www.rabbitmq.com/tutorials/tutorial-five-python.html )
Also described in the command line guide: https://metpx.github.io/sarracenia/Explanation/CommandLineGuide.html#subtopic-amqp-pattern-default:
The server-side / AMQP topic filtering just provides a gross filtering, to minimize the amount of filtering done client side, but it doesn't eliminate it. It is expected that users need to do some finer grained filtering on the client side. It sounds like the filtering you are doing is correct. Don't worry about filtering 2/3'rd of the messages, worry about filtering 99.9%. The messages are not that big, especially compared to the grib files you are downloading. for acceptUnmatched. yup. that's an error. Indeed, in v2 the default for acceptUnmatched depended on which component was in use, so there was no global default. in sr3 it was changed to default to true everywhere to make it easier to understand, reason about, and explain. Some of the documentation was updated, but apparently not all. thanks for mentioning it. |
Beta Was this translation helpful? Give feedback.
-
Not recommending, just putting out options. For someone comfortable with python, and who is hitting the limits of regular expressions, one has the option of going ... um... full python You can place the python module below in ~/.config/sr3/plugins/grib3rdhour.py import logging
from sarracenia.flowcb import FlowCB
logger = logging.getLogger(__name__)
class Grib3rdhour(FlowCB):
"""
Selecting only every third hour of GRIB regional model outputs for download
"""
def after_accept(self, worklist):
new_incoming=[]
for message in worklist.incoming:
accepted=False
# arbitrary logic here to accept/reject using full python on the message body. Sample:
# baseURl=https://dd.weather.gc.ca/
# relPath=/model_gem_regional/10km/grib2/00/000/CMC_reg_ABSV_ISBL_250_ps10km_2023110500_P000.grib2
# element 1 2 3 4 5
relative_path = message['relPath'].split('/')
logger.info( f" received: {relative_path} " )
o=1
# on hpfx.collab, set o=3 because the directory is a two levels deeper.
# baseUrl=https://hpfx.collab.science.gc.ca/
# relpath=/20231105/WXO-DD/model_gem_regional/10km/grib2/12/006/CMC_reg_ABSV_ISBL_250_ps10km_2023110512_P006.grib2
if (relative_path[o] != 'model_gem_regional') or (relative_path[o+1] != '10km') or (relative_path[o+2] != 'grib2'):
message.setReport( 201, 'Grib3rdhour: only want 10km resolution regional model in grib 2 format.' )
worklist.rejected.append(message)
continue
# only want 0z or 12z run.
if relative_path[o+3] not in [ '12', '00' ]:
message.setReport( 201, 'Grib3rdhour: only want resolution regional model outputs from 00 or 12Z runs.' )
worklist.rejected.append(message)
continue
try:
hour = int(relative_path[o+4])
except:
message.setReport( 201, f'Grib3rdhour: field {o+4}: {relative_path[o+4]} should be an integer.' )
worklist.rejected.append(message)
continue
if (hour % 3 ) != 0:
message.setReport( 201, f'Grib3rdhour: field {o+4}: want reports every 3rd hour' )
worklist.rejected.append(message)
continue
new_incoming.append(message)
worklist.incoming=new_incoming And add:
to your subscriber configuration file.
This provides much more fine-grained and powerful, and perhaps much clearer way of doing selection, but it requires use of python code, as opposed to what is usually considered a simpler configuration Domain Specific Language (DSL.) What is easier is perhaps a matter of taste. Is that easier to understand than the regex/configuration file base configuration... I don't know. |
Beta Was this translation helpful? Give feedback.
-
some sample logs from the above (they are different snippets, unrelated to each other, just to illustrate the sorts
|
Beta Was this translation helpful? Give feedback.
-
I forgot to address the original question! Efficiency-wise, each regex compute complixity os O(n) (assuming this is right: https://stackoverflow.com/questions/5892115/whats-the-time-complexity-of-average-regex-algorithms ) and the average url is, say 100 characters in length (n) and we have 9 of them in your sample configuration, that's O(900) operations per message. If you have plain accept .* as your only accept statement, it actually short circuits, so no regexes are evaluated, processing falls through to the callback. otoh, for the volume of files in question, it doesn't make a significant difference... microseconds perhaps? |
Beta Was this translation helpful? Give feedback.
-
Well, I tried the other server, retrieved 3560 files, which is slightly above 60% of those requested. NWS Eastern Region is also trying, maybe they will have more luck. The distance from Dorval to Long Island, NY is much shorter than to Anchorage. |
Beta Was this translation helpful? Give feedback.
-
I wonder whether it would make sense to put the onus on the client to handle the queue. What I mean is that the client would store the received messages on a disk, with something like litequeue, and then proceed to retrieve the files. One thread is busy with receiving the messages and putting them into a queue, the other reads the queue and schedules retrievals. |
Beta Was this translation helpful? Give feedback.
-
The third option is what I had in mind, but with a file replaced by a disk-backed FIFO. Python module |
Beta Was this translation helpful? Give feedback.
-
I think that's an easy thing to do... but it's code... would have to be written. I think could have two flowcallback plugins:
The name of the file could be an option supported by both of them, and you have two flows, one that reads from the Canadian broker and posts to the FIFO, and a second subscriber, that reads from the FIFO. I think it's pretty straight-forward... but it doesn't exist. If that sounds useful to you, we can put it on our todo list (aka create an issue for it.) |
Beta Was this translation helpful? Give feedback.
-
I'll have a look. |
Beta Was this translation helpful? Give feedback.
-
Testing close with comment |
Beta Was this translation helpful? Give feedback.
-
The documentation is not very clear what is allowed in
subtopic
value. All the examples have only. == /
,* == anything
and# == anything
as wildcard symbols. BTW, what is the difference between*
and#
?In my use case, I want to retrieve GEM regional every 3 hours. So far, I managed to achieve this by adding a
reject
line, but that means I have to process all incoming messages and immediately reject two-thirds of them.Here is what I think might work (to be tested soon):
Is there a more efficient way?
A minor quirk: the documentation (Command Line Guide) states that the default for
acceptUnmatched
isFalse
, which seems to be false.Beta Was this translation helpful? Give feedback.
All reactions