Add Images #606

addie9800 · 2024-09-04T09:58:51Z

No description provided.

MaxDall · 2024-09-13T11:35:12Z

@addie9800 Thanks for pushing this idea 👍 I think this would be an awesome addition to Fundus! 🚀

It looks like you accidentally pushed a lot of unrelated files to the draft, making it harder for me to focus on the core idea. Would you mind removing those files so we can talk about the changes?

addie9800 · 2024-09-14T15:35:11Z

Ah well, sorry about that. I did only intend to push one extra file :)... I cleaned up a bit now, in case you want to have a look at it, but I haven't reached a real milestone yet, since you last had a peek at it. I think the issue I am struggling with most at the moment is the dynamic rescaling of images where some publishers change the path of the url according to the necessary resolution for the given screen, making it difficult to come up with a selector for the corresponding img element. If you have any idea, shoot ;)

addie9800 · 2024-09-17T14:56:26Z

Here are some examples, of problematic cases:

The file name is changed to reflect resolution and width of an image: The article: https://www.spiegel.de/netzwelt/apps/instagram-beschraenkt-accounts-von-teenagern-a-3b65f364-e98b-4a31-8cfc-2d7ec4fc19bf#ref=rss has three versions of the same image in it's JSON: https://cdn.prod.www.spiegel.de/images/50b64eef-9d82-4021-93dc-732207c263cc_w1200_r1.778_fpx70_fpy50.99.jpg , https://cdn.prod.www.spiegel.de/images/50b64eef-9d82-4021-93dc-732207c263cc_w1200_r1.33_fpx70_fpy50.99.jpg and https://cdn.prod.www.spiegel.de/images/50b64eef-9d82-4021-93dc-732207c263cc_w1200_r1_fpx70_fpy50.99.jpg . Yet none of these links are actually in an img element on the website. In my browser the actual link is: https://cdn.prod.www.spiegel.de/images/50b64eef-9d82-4021-93dc-732207c263cc_w960_r1.778_fpx70_fpy50.99.jpg
Similarly, some publishers change the path to the image based on the image resolution: in this article: https://www.welt.de/politik/deutschland/article253552426/Christian-Lindners-Steuerplaene-entlasten-Gutverdiener-am-staerksten.html there is one image in the JSON: https://img.welt.de/img/politik/deutschland/mobile253552538/4727938867-ci16x9-w1200/Session-of-the-lower-house-of-German-parliament-Bundestag-in-Ber.jpg but in my browser, this image is used: https://img.welt.de/img/politik/deutschland/mobile253552538/4727938867-ci23x11-w20/Session-of-the-lower-house-of-German-parliament-Bundestag-in-Ber.jpg

addie9800 · 2024-09-24T21:57:59Z

Update: As of now, I have verified the functionality for TheNamibian, DerStandard, ORF, NineNews and CBCNews. They can be used to get an impression of the intended functionality.

Update `FreiePresse`

MaxDall · 2024-12-21T17:48:24Z

src/fundus/parser/utility.py

+        # parse description
+        description = nodes_to_text(alt_selector(node))
+
        # parse authors
        authors = []
        if isinstance(author_selector, Pattern):
            # author is part of the caption
            if caption and (match := re.search(author_selector, caption)):
                authors = [match.group("credits")]
                caption = re.sub(author_selector, "", caption).strip() or None
+            elif description and (match := re.search(author_selector, description)):
+                authors = [match.group("credits")]
+                description = re.sub(author_selector, "", description).strip() or None


I would suggest leaving description as is and not applying filters here. If I remember correctly we stated in the documentation, that its the parsed alt attribute of the image, so I would argue one would expect the raw data.

MaxDall · 2024-12-21T17:55:20Z

src/fundus/parser/utility.py

+    try:
+        width = float(source.get("width") or 0) or None
+    except ValueError:
+        width = None
+    try:
+        height = float(source.get("height") or 0) or None
+    except ValueError:
+        height = None


Do you know which case leads to a ValueError? In general, I would like to avoid try ... except blocks, especially in highly frequented code.

Yes, it happens for Taipei Times. They use '100%' as width, which cannot be parsed as a float and I think there is no good way to extract a proper width to use as a value for the size parameter from this without physically considering the image. But I guess there's no necessity to rely on a try ... except. I have seen some approaches relying on regex or string replacement and then calling isdigit()

Update: I checked stackoverflow and someone did a benchmark test and I used his recommended solution as an alternative: https://stackoverflow.com/questions/354038/how-do-i-check-if-a-string-represents-a-number-float-or-int

addie9800 added 6 commits September 3, 2024 15:41

add Image datatype

673627b

parse first images

02348df

simplify images

c2ad68d

first working version

95136f0

first working version for all publishers with ld json

3fcc108

format comment

1bf8344

addie9800 marked this pull request as draft September 4, 2024 09:58

addie9800 and others added 4 commits September 11, 2024 13:09

simplify image parsing to only support standard formats

ab34220

add cover property

01f862f

save

0905b7a

Update documentation from @ 7aa4a47

00272e5

addie9800 added 2 commits September 14, 2024 17:26

remove bloat files

9c3082f

remove bloat files

754f38d

addie9800 added 9 commits September 17, 2024 18:35

identify img element by url similarity

ccb619a

data extraction from html

35ac88e

restrict to Article and Blog JSON objects

1d5d53c

rework utility methods for more flexibility

0edab02

add functionality to merge image objects

bc8f7c0

implement images for the namibian

6559f1d

documentation

f7fdf9a

implement images for at, na

4c398a1

cbc

a9b6e07

addie9800 added 4 commits September 25, 2024 14:48

finish ca

00be4de

documentation

ad9dc96

remove json extraction

de231a7

author cleaning

d1cbfce

MaxDall and others added 17 commits December 16, 2024 12:12

Merge pull request #663 from flairNLP/update-freie-presse

dd2ddfb

Update `FreiePresse`

add image extraction for ElPais

c540cbc

some improvements regarding printouts and documentation

8bb273d

fix FreeBeacon

fc40318

fix FrankfurterRundschau

1d519db

JSON reordering, clean image authors

97e7bb6

update Merkur

a2255bc

json reordering

e8aa4d8

json reordering

6a7ffcd

improve image author parsing

3343e1f

remove selected image author bloat

1baee60

update WDR

cf34efc

Merge remote-tracking branch 'origin/images' into images

25f79d6

remove author_filter

3799401

remove author_filter

4b40a3a

black

27a6766

fix pytest

3d64b09

addie9800 marked this pull request as ready for review December 16, 2024 21:55

addie9800 added 4 commits December 16, 2024 23:04

simplify credit_keywords

ab846d7

Merge branch 'master' into images

fdc74ac

update metro tests

95f5424

catch invalid width and height values

c7c5d33

MaxDall requested changes Dec 21, 2024

View reviewed changes

addie9800 added 6 commits December 21, 2024 19:53

remove author replacement in description

489a6aa

Merge remote-tracking branch 'origin/images' into images

c64ec68

remove try - except from float parsing

a123bce

Merge branch 'master' into images

12d2895

update test data

8957e9b

mypy

c5c9c98

addie9800 requested a review from MaxDall December 21, 2024 19:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Images #606

Add Images #606

addie9800 commented Sep 4, 2024

MaxDall commented Sep 13, 2024 •

edited

Loading

addie9800 commented Sep 14, 2024

addie9800 commented Sep 17, 2024

addie9800 commented Sep 24, 2024

MaxDall Dec 21, 2024

MaxDall Dec 21, 2024

addie9800 Dec 21, 2024 •

edited

Loading

Add Images #606

Are you sure you want to change the base?

Add Images #606

Conversation

addie9800 commented Sep 4, 2024

MaxDall commented Sep 13, 2024 • edited Loading

addie9800 commented Sep 14, 2024

addie9800 commented Sep 17, 2024

addie9800 commented Sep 24, 2024

MaxDall Dec 21, 2024

Choose a reason for hiding this comment

MaxDall Dec 21, 2024

Choose a reason for hiding this comment

addie9800 Dec 21, 2024 • edited Loading

Choose a reason for hiding this comment

MaxDall commented Sep 13, 2024 •

edited

Loading

addie9800 Dec 21, 2024 •

edited

Loading