-
Notifications
You must be signed in to change notification settings - Fork 75
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Images #606
base: master
Are you sure you want to change the base?
Add Images #606
Conversation
@addie9800 Thanks for pushing this idea 👍 I think this would be an awesome addition to Fundus! 🚀 It looks like you accidentally pushed a lot of unrelated files to the draft, making it harder for me to focus on the core idea. Would you mind removing those files so we can talk about the changes? |
Ah well, sorry about that. I did only intend to push one extra file :)... I cleaned up a bit now, in case you want to have a look at it, but I haven't reached a real milestone yet, since you last had a peek at it. I think the issue I am struggling with most at the moment is the dynamic rescaling of images where some publishers change the path of the url according to the necessary resolution for the given screen, making it difficult to come up with a selector for the corresponding |
Update: As of now, I have verified the functionality for TheNamibian, DerStandard, ORF, NineNews and CBCNews. They can be used to get an impression of the intended functionality. |
Update `FreiePresse`
src/fundus/parser/utility.py
Outdated
# parse description | ||
description = nodes_to_text(alt_selector(node)) | ||
|
||
# parse authors | ||
authors = [] | ||
if isinstance(author_selector, Pattern): | ||
# author is part of the caption | ||
if caption and (match := re.search(author_selector, caption)): | ||
authors = [match.group("credits")] | ||
caption = re.sub(author_selector, "", caption).strip() or None | ||
elif description and (match := re.search(author_selector, description)): | ||
authors = [match.group("credits")] | ||
description = re.sub(author_selector, "", description).strip() or None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest leaving description as is and not applying filters here. If I remember correctly we stated in the documentation, that its the parsed alt
attribute of the image, so I would argue one would expect the raw data.
src/fundus/parser/utility.py
Outdated
try: | ||
width = float(source.get("width") or 0) or None | ||
except ValueError: | ||
width = None | ||
try: | ||
height = float(source.get("height") or 0) or None | ||
except ValueError: | ||
height = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you know which case leads to a ValueError
? In general, I would like to avoid try ... except
blocks, especially in highly frequented code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it happens for Taipei Times. They use '100%' as width, which cannot be parsed as a float and I think there is no good way to extract a proper width to use as a value for the size parameter from this without physically considering the image. But I guess there's no necessity to rely on a try ... except. I have seen some approaches relying on regex or string replacement and then calling isdigit()
Update: I checked stackoverflow and someone did a benchmark test and I used his recommended solution as an alternative: https://stackoverflow.com/questions/354038/how-do-i-check-if-a-string-represents-a-number-float-or-int
No description provided.