Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Images #606

Open
wants to merge 154 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 148 commits
Commits
Show all changes
154 commits
Select commit Hold shift + click to select a range
673627b
add Image datatype
addie9800 Sep 3, 2024
02348df
parse first images
addie9800 Sep 3, 2024
c2ad68d
simplify images
addie9800 Sep 3, 2024
95136f0
first working version
addie9800 Sep 3, 2024
3fcc108
first working version for all publishers with ld json
addie9800 Sep 3, 2024
1bf8344
format comment
addie9800 Sep 3, 2024
ab34220
simplify image parsing to only support standard formats
addie9800 Sep 11, 2024
01f862f
add cover property
addie9800 Sep 11, 2024
0905b7a
save
addie9800 Sep 12, 2024
00272e5
Update documentation from @ 7aa4a47284cfbe5f02cc78a61b5173ddaae07665
addie9800 Sep 12, 2024
9c3082f
remove bloat files
addie9800 Sep 14, 2024
754f38d
remove bloat files
addie9800 Sep 14, 2024
ccb619a
identify img element by url similarity
addie9800 Sep 17, 2024
35ac88e
data extraction from html
addie9800 Sep 17, 2024
1d5d53c
restrict to Article and Blog JSON objects
addie9800 Sep 19, 2024
0edab02
rework utility methods for more flexibility
addie9800 Sep 23, 2024
bc8f7c0
add functionality to merge image objects
addie9800 Sep 23, 2024
6559f1d
implement images for the namibian
addie9800 Sep 24, 2024
f7fdf9a
documentation
addie9800 Sep 24, 2024
4c398a1
implement images for at, na
addie9800 Sep 24, 2024
a9b6e07
cbc
addie9800 Sep 24, 2024
00be4de
finish ca
addie9800 Sep 25, 2024
ad9dc96
documentation
addie9800 Sep 25, 2024
de231a7
remove json extraction
addie9800 Oct 1, 2024
d1cbfce
author cleaning
addie9800 Oct 1, 2024
32dbcc0
fix default images
addie9800 Oct 1, 2024
c81be6c
add default author selector and remove author from caption
addie9800 Oct 1, 2024
cb43637
br
addie9800 Oct 1, 2024
d5700ea
add author filter
addie9800 Oct 1, 2024
4235520
funke
addie9800 Oct 1, 2024
b745eae
publisher bis BSZ
addie9800 Oct 1, 2024
b2fccdd
ch
addie9800 Oct 1, 2024
ab7c5ca
cn
addie9800 Oct 1, 2024
27b15c7
bi_de
addie9800 Oct 1, 2024
44645f1
remove url parameter
addie9800 Oct 5, 2024
d512c47
no
addie9800 Oct 13, 2024
6fbe3dd
rewrite core logic
MaxDall Oct 15, 2024
2adfa3a
add `images` attribute to guidelines and `Article`
MaxDall Oct 15, 2024
b460866
add serialization for `Image` class
MaxDall Oct 15, 2024
b2d298d
fix image extraction for `TheNamibian`
MaxDall Oct 15, 2024
d5acc61
add `images` to unit tests
MaxDall Oct 15, 2024
f26d7c8
Update documentation from @ d512c4791f40706a86e02c0f85519e789f8f8cf2
MaxDall Oct 15, 2024
2a0967e
Merge branch 'images' into images-suggestions
MaxDall Oct 15, 2024
b5b6444
add test cases for `no` publishers
MaxDall Oct 15, 2024
977bd66
Update src/fundus/parser/utility.py
MaxDall Oct 17, 2024
83affdc
rename `parse_image_node` -> `parse_image_nodes`
MaxDall Oct 17, 2024
cfcc480
Merge remote-tracking branch 'origin/images-suggestions' into images-…
MaxDall Oct 17, 2024
b9b1e49
Merge pull request #640 from flairNLP/images-suggestions
MaxDall Oct 17, 2024
6b6cac1
boersenzeitung
addie9800 Oct 22, 2024
f4c5397
add images to dw - focus
addie9800 Oct 23, 2024
c01f9e2
strip urls
addie9800 Oct 23, 2024
da1bae1
Update documentation from @ 5d3f301cd4077a4b7f3fb92d8da1ae368438b273
addie9800 Oct 23, 2024
b54914a
FAZ
addie9800 Oct 23, 2024
8fad8d3
Update documentation from @ 5d3f301cd4077a4b7f3fb92d8da1ae368438b273
addie9800 Oct 23, 2024
9fa1a32
add comment about images in v1
addie9800 Oct 28, 2024
6168c02
fr
addie9800 Oct 28, 2024
63f98ae
minor changes to images utility
addie9800 Oct 29, 2024
d34725f
add images to `FreiePresse` - `MitteldeutscheZeitung`
addie9800 Oct 29, 2024
c8db15f
Update documentation from @ 6168c02013124257d4b7d0007b6c5c00354bdbc1
addie9800 Oct 29, 2024
2752bee
Add images to `MDR` - `RuhrNachrichten`
addie9800 Oct 30, 2024
79c49f3
add images for `UK` publishers
MaxDall Nov 4, 2024
a926cf3
apply patch
MaxDall Nov 5, 2024
a8617fc
Update documentation from @ 79c49f345a2976fea052852d55b85ea8550280e8
MaxDall Nov 5, 2024
e82b325
simplify kicker image extraction
MaxDall Nov 5, 2024
779e62b
Merge remote-tracking branch 'origin/images-suggestions' into images-…
MaxDall Nov 5, 2024
738af53
Merge pull request #653 from flairNLP/images-suggestions
MaxDall Nov 5, 2024
1308d2f
finish `at`, `ca`, `fr` and `ind`
MaxDall Nov 5, 2024
be17481
Update documentation from @ db2d4c594a3d139b2bb71634b2fe20b6cef6c8a2
MaxDall Nov 5, 2024
7fe1fdd
add `lt`, `my` and `tr`
MaxDall Nov 5, 2024
14718ff
`People`
addie9800 Nov 7, 2024
5290449
`People`
addie9800 Nov 7, 2024
8e2786f
`RuhrNachrichten` - `WDR`
addie9800 Nov 7, 2024
7449c7c
Update documentation from @ f06969f1c0ead73f7a8b2ec2bbbe73e79df42c66
addie9800 Nov 7, 2024
a36a0db
Finish `DE`
addie9800 Nov 8, 2024
8f4c727
`JungeWelt`, `Merkur` - `RheinischePost`
addie9800 Nov 8, 2024
4aa680e
`NDR`
addie9800 Nov 8, 2024
3125fcf
`APNews`, `BusinessInsider`
addie9800 Nov 8, 2024
5d7d913
remove video preview images from `Welt`
MaxDall Nov 12, 2024
2f60495
adjust image selector for `TheIndependent`
MaxDall Nov 12, 2024
9e46bd9
`TheNewYorker` - `Wired`
addie9800 Nov 12, 2024
bcc3d2d
add version parsing
MaxDall Nov 13, 2024
62a57c6
`TheNation`
addie9800 Nov 13, 2024
3f5f9ac
`FoxNews` - `TheIntercept`
addie9800 Nov 14, 2024
5d390bb
Update documentation from @ f4b31d90b017a22a1b57892c7924f3adc8aed707
addie9800 Nov 14, 2024
3f66e49
fix typo
MaxDall Nov 15, 2024
246e74c
parse `max-width` and rename `min-width` -> `query-width`
MaxDall Nov 15, 2024
f8338ae
Merge branch 'images' into add-version-parsing
addie9800 Nov 18, 2024
3296f17
Update utility.py
addie9800 Nov 18, 2024
978c7c3
Update utility.py
addie9800 Nov 18, 2024
3a7fed7
resolve forwarded types
MaxDall Nov 19, 2024
5cd8439
Merge pull request #661 from flairNLP/add-version-parsing
MaxDall Nov 19, 2024
f69217f
Merge branch 'master' into images
MaxDall Nov 19, 2024
272d840
Update documentation from @ f4b31d90b017a22a1b57892c7924f3adc8aed707
MaxDall Nov 19, 2024
2a51169
remove leftover test case
MaxDall Nov 19, 2024
5966036
bug fixes
MaxDall Nov 19, 2024
4e50454
fix `__lt__` for `ImageVersion`
MaxDall Nov 20, 2024
e6e4ef4
fix a bug in `src` and `srcset` parsing
MaxDall Nov 21, 2024
4b656d8
Fix ˋWDRˋ, add testcase for ˋORFˋ
addie9800 Nov 22, 2024
c473901
Overwrite test-case for ˋTheIndependentˋ
addie9800 Nov 22, 2024
c938b33
ˋLeFigaroˋ test case overwrite
addie9800 Nov 22, 2024
f3f8f54
ˋMalayMailˋ testcase update
addie9800 Nov 22, 2024
a673c71
fix image extraction for `APNews` and `TheNation`
MaxDall Nov 22, 2024
5b4ca5d
fix a bug with sorting test jsons
MaxDall Nov 22, 2024
c8e2cc5
add image extraction for `WestAustralian`
MaxDall Nov 26, 2024
35986ac
Update documentation from @ 0957415dc22763e7fea7397af2635779a636fbe5
MaxDall Nov 26, 2024
162cd0a
Update `FreiePresse`
addie9800 Nov 26, 2024
0243a9e
remove duplicate selectors
addie9800 Nov 26, 2024
f64587b
remove test files
addie9800 Nov 26, 2024
f80bce6
beatify list comprehension
addie9800 Nov 26, 2024
79a5cbf
add test file for `FreiePresse` version `V1_1`
MaxDall Nov 29, 2024
4605e15
add immage extraction for `TagesAnzeiger`
MaxDall Nov 29, 2024
fddaee7
Update documentation from @ 0957415dc22763e7fea7397af2635779a636fbe5
MaxDall Nov 29, 2024
5eb93ab
fix v1 images parsing
addie9800 Nov 29, 2024
eda8323
overwrite json
addie9800 Nov 29, 2024
8cbe5f5
Add image extraction for `Bhaskar`
addie9800 Dec 2, 2024
645441e
Update documentation from @ 0957415dc22763e7fea7397af2635779a636fbe5
addie9800 Dec 2, 2024
b80b829
Add image extraction for `TheJapanNews`
addie9800 Dec 3, 2024
27f9261
Add image extraction for `YomiuriShimbun`
addie9800 Dec 3, 2024
113e1df
Update documentation from @ 0957415dc22763e7fea7397af2635779a636fbe5
addie9800 Dec 3, 2024
2a85504
image_extraction documentation
addie9800 Dec 15, 2024
a5e5ce1
add image example to README.md
addie9800 Dec 15, 2024
960288d
update image example in README.md
addie9800 Dec 15, 2024
dc7c14f
add images to article documentation
addie9800 Dec 15, 2024
2a31d36
update `TechCrunch`
addie9800 Dec 15, 2024
d9e83ed
remove author_filter usage
addie9800 Dec 15, 2024
bcd9f8d
remove image author bloat
addie9800 Dec 15, 2024
de31f6f
guard `Optional[str]` for mypy
MaxDall Dec 16, 2024
dd2ddfb
Merge pull request #663 from flairNLP/update-freie-presse
MaxDall Dec 16, 2024
c540cbc
add image extraction for `ElPais`
MaxDall Dec 16, 2024
8bb273d
some improvements regarding printouts and documentation
MaxDall Dec 16, 2024
fc40318
fix FreeBeacon
addie9800 Dec 16, 2024
1d519db
fix FrankfurterRundschau
addie9800 Dec 16, 2024
97e7bb6
JSON reordering, clean image authors
addie9800 Dec 16, 2024
a2255bc
update Merkur
addie9800 Dec 16, 2024
e8aa4d8
json reordering
addie9800 Dec 16, 2024
6a7ffcd
json reordering
addie9800 Dec 16, 2024
3343e1f
improve image author parsing
addie9800 Dec 16, 2024
1baee60
remove selected image author bloat
addie9800 Dec 16, 2024
cf34efc
update WDR
addie9800 Dec 16, 2024
25f79d6
Merge remote-tracking branch 'origin/images' into images
addie9800 Dec 16, 2024
3799401
remove author_filter
addie9800 Dec 16, 2024
4b40a3a
remove author_filter
addie9800 Dec 16, 2024
27a6766
black
addie9800 Dec 16, 2024
3d64b09
fix pytest
addie9800 Dec 16, 2024
ab846d7
simplify credit_keywords
addie9800 Dec 16, 2024
fdc74ac
Merge branch 'master' into images
addie9800 Dec 16, 2024
95f5424
update metro tests
addie9800 Dec 16, 2024
c7c5d33
catch invalid width and height values
addie9800 Dec 17, 2024
489a6aa
remove author replacement in description
addie9800 Dec 21, 2024
c64ec68
Merge remote-tracking branch 'origin/images' into images
addie9800 Dec 21, 2024
a123bce
remove try - except from float parsing
addie9800 Dec 21, 2024
12d2895
Merge branch 'master' into images
addie9800 Dec 21, 2024
8957e9b
update test data
addie9800 Dec 21, 2024
c5c9c98
mypy
addie9800 Dec 21, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
64 changes: 58 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -68,24 +68,25 @@ That's already it!
If you run this code, it should print out something like this:

```console
Fundus-Article:
Fundus-Article including 1 image(s):
- Title: "Feinstein's Return Not Enough for Confirmation of Controversial New [...]"
- Text: "Democrats jammed three of President Joe Biden's controversial court nominees
through committee votes on Thursday thanks to a last-minute [...]"
- Text: "89-year-old California senator arrived hour late to Judiciary Committee hearing
to advance President Biden's stalled nominations Democrats [...]"
- URL: https://freebeacon.com/politics/feinsteins-return-not-enough-for-confirmation-of-controversial-new-hampshire-judicial-nominee/
- From: FreeBeacon (2023-05-11 18:41)
- From: The Washington Free Beacon (2023-05-11 18:41)

Fundus-Article:
Fundus-Article including 3 image(s):
- Title: "Northwestern student government freezes College Republicans funding over [...]"
- Text: "Student government at Northwestern University in Illinois "indefinitely" froze
the funds of the university's chapter of College Republicans [...]"
- URL: https://www.foxnews.com/us/northwestern-student-government-freezes-college-republicans-funding-poster-critical-lgbtq-community
- From: FoxNews (2023-05-09 14:37)
- From: Fox News (2023-05-09 14:37)
```

This printout tells you that you successfully crawled two articles!

For each article, the printout details:
- the number of images included in the article
- the "Title" of the article, i.e. its headline
- the "Text", i.e. the main article body text
- the "URL" from which it was crawled
Expand Down Expand Up @@ -146,6 +147,57 @@ for article in crawler.crawl(max_articles=1000000):
````


## Example 4: Crawl some images

By default, Fundus tries to parse the images included in every crawled article.
Let's crawl an article and print out the images for some more details.

```python
from fundus import PublisherCollection, Crawler

# initialize the crawler for The LA Times
crawler = Crawler(PublisherCollection.us.LATimes)

# crawl 1 article and print the images
for article in crawler.crawl(max_articles=1):
for image in article.images:
print(image)
```

For [this article](https://www.latimes.com/sports/lakers/story/2024-12-13/lakers-lebron-james-away-from-team-timberwolves) you will get the following output:

```console
Fundus-Article Cover-Image:
-URL: 'https://ca-times.brightspotcdn.com/dims4/default/41c9bc4/2147483647/strip/true/crop/4598x3065+0+0/resize/1200x800!/format/webp/quality/75/?url=https%3A%2F%2Fcalifornia-times-brightspot.s3.amazonaws.com%2F77%2Feb%2F7fed2d3942fd97b0f7325e7060cf%2Flakers-timberwolves-basketball-33765.jpg'
-Description: 'Minnesota Timberwolves forward Julius Randle (30) works toward the basket.'
-Caption: 'Minnesota Timberwolves forward Julius Randle, left, controls the ball in front of Lakers forward Anthony Davis during the first half of the Lakers’ 97-87 loss Friday.'
-Authors: ['Abbie Parr / Associated Press']
-Versions: [320x213, 568x379, 768x512, 1024x683, 1200x800]

Fundus-Article Image:
-URL: 'https://ca-times.brightspotcdn.com/dims4/default/9a22715/2147483647/strip/true/crop/4706x3137+0+0/resize/1200x800!/format/webp/quality/75/?url=https%3A%2F%2Fcalifornia-times-brightspot.s3.amazonaws.com%2Ff7%2F52%2Fdcd6b263480ab579ac583a4fdbbf%2Flakers-timberwolves-basketball-48004.jpg'
-Description: 'Lakers coach JJ Redick talks with forward Anthony Davis during a loss to the Timberwolves.'
-Caption: 'Lakers coach JJ Redick, right, talks with forward Anthony Davis during the first half of a 97-87 loss to the Timberwolves on Friday night.'
-Authors: ['Abbie Parr / Associated Press']
-Versions: [320x213, 568x379, 768x512, 1024x683, 1200x800]

Fundus-Article Image:
-URL: 'https://ca-times.brightspotcdn.com/dims4/default/580bae4/2147483647/strip/true/crop/5093x3470+0+0/resize/1200x818!/format/webp/quality/75/?url=https%3A%2F%2Fcalifornia-times-brightspot.s3.amazonaws.com%2F3b%2Fdf%2F64c0198b4c2fb2b5824aaccb64b7%2F1486148-sp-nba-lakers-trailblazers-25-gmf.jpg'
-Description: 'Lakers star LeBron James sits in street clothes on the bench next to his son, Bronny James.'
-Caption: 'Lakers star LeBron James sits in street clothes on the bench next to his son, Bronny James, during a win over Portland at Crypto.com Arena on Dec. 8.'
-Authors: ['Gina Ferazzi / Los Angeles Times']
-Versions: [320x218, 568x387, 768x524, 1024x698, 1200x818]
```

For each image, the printout details:
- The cover image designation (if applicable).
- The URL for the highest-resolution version of the image.
- A description of the image.
- The image's caption.
- The name of the copyright holder.
- A list of all available versions of the image.


## Tutorials

We provide **quick tutorials** to get you started with the library:
Expand Down
17 changes: 17 additions & 0 deletions docs/3_the_article_class.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@
* [What is an `Article`](#what-is-an-article)
* [The articles' body](#the-articles-body)
* [HTML](#html)
* [Images](#images)
* [Language detection](#language-detection)
* [Saving an Article](#saving-an-article)

Expand Down Expand Up @@ -117,6 +118,22 @@ Here you have access to the following information:
4. `crawl_date: datetime`: The exact timestamp the article was crawled.
5. `source_info: SourceInfo`: Some information about the HTML's origins, mostly for debugging purpose.

## Images

Some publishers provide images with their articles.
To encompass all necessary information, the articles `images` attribute returns a list of custom `Image` objects.
Each `Image` object contains the following attributes:
- `url`: the URL of the image with the largest dimensions.
- `versions`: a list of custom `ImageVersion` objects, each containing the following attributes:
- `url`: the URL of the image with the specific dimensions.
- `size`: a `Dimension` object with attributes `width` and `height`.
- `type`: the image format (e.g. `jpeg`, `png`).
- `is_cover`: a boolean indicating whether the image is the cover image of the article.
- `description`: a string describing the image (usually the alt-text).
- `caption`: the image caption as used in the article.
- `authors`: a list of strings representing the authors of the image.
- `position`: an integer describing the position of the image in the DOM-tree.

## Language detection

Sometimes publishers support articles in different languages.
Expand Down
7 changes: 7 additions & 0 deletions docs/attribute_guidelines.md
Original file line number Diff line number Diff line change
Expand Up @@ -66,4 +66,11 @@ Those attributes will be validated with unit tests when used.
<td><code>bool</code></td>
<td></td>
</tr>
<tr>
<td>images</td>
<td>A list of `Images` - Fundus own datatype for image representation - included within the article.
The `Images` include metadata like caption, authors, and position if available.</td>
<td><code>List[Image]</code></td>
<td><code>image_extraction</code></td>
</tr>
</table>
38 changes: 36 additions & 2 deletions docs/how_to_add_a_publisher.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,8 @@
* [Working with `lxml`](#working-with-lxml)
* [CSS-Select](#css-select)
* [XPath](#xpath)
* [Extract the ArticleBody](#extract-the-articlebody)
* [Extracting the ArticleBody](#extracting-the-articlebody)
* [Extracting the Images](#extracting-the-images)
* [Checking the free_access attribute](#checking-the-free_access-attribute)
* [Finishing the Parser](#finishing-the-parser)
* [6. Generate unit tests and update tables](#6-generate-unit-tests-and-update-tables)
Expand Down Expand Up @@ -533,7 +534,7 @@ Instead, we recommend referring to [this](https://devhints.io/xpath) documentati
Make sure to examine other parsers and consult the [attribute guidelines](attribute_guidelines.md) for specifics on attribute implementation.
We strongly encourage utilizing these utility functions, especially when parsing the `ArticleBody`.

### Extract the ArticleBody
### Extracting the ArticleBody

In the context of Fundus, an article's body typically includes multiple paragraphs, and optionally, a summary and several subheadings.
It's important to note that article layouts can vary significantly between publishers, with the most common layouts being:
Expand All @@ -546,6 +547,39 @@ To accurately extract the body of an article, use the `extract_article_body_with
This function accepts selectors for the different body parts as input and returns a parsed `ArticleBody`.
For practical examples, refer to existing parser implementations to understand how everything integrates.

### Extracting the images

Fundus offers a utility function `image_extraction` to extract images from the article.
This function only requires the `doc` element of the article and the `_paragraph_selector` of the parser with further optional attributes that can be used if necessary.
The skeleton of the function looks like this:

```python
from fundus.parser.utility import image_extraction
from fundus.parser import Image

@attribute
def images(self) -> List[Image]:
return image_extraction(
doc=self.precomputed.doc,
paragraph_selector=self._paragraph_selector,
)
```

Once you have implemented this, you can try to extract your first images from the article body!
What can happen now, is that you get an IndexError.
This is caused by the `upper_boundary_selector` not selecting an element.
You have to adjust it to select an element above the cover image, all images that lie before this upper boundary are discarded.
Once you get your first images, you can further fine-tune your results:

- `image_selector`: This selector is used to filter which image elements are selected.
- `lower_boundary_selector`: By default, all images after the last paragraph are discarded. With this selector, you can define your custom boundary.
- `caption_selector`: This selector is used to extract the caption of the image and should usually be of the form `XPath("./ancestor::...")`
- `alt_selector`: This selector selects the alt text (description) of the image.
- `author_selector`: You have two options, when selecting the author of the image:
- Preferably, the credits are within their own HTML element and can be directly addressed using a XPath selector.
- Alternatively, a `re.Pattern` object can be passed to select the authors from the caption. In this case, a selection group named `credits` is saved as the author, while the entire `Match` will be removed from the caption.
- `relative_urls`: If set, an attempt will be made to complete relative URLs.
- `size_pattern`: A `re.Pattern` object that can be used to extract the image sizes.

### Checking the free_access attribute

Expand Down
18 changes: 14 additions & 4 deletions docs/supported_publishers.md
Original file line number Diff line number Diff line change
Expand Up @@ -393,7 +393,9 @@
<span>www.dw.com</span>
</a>
</td>
<td>&#160;</td>
<td>
<code>images</code>
</td>
<td>&#160;</td>
</tr>
<tr>
Expand Down Expand Up @@ -1665,7 +1667,9 @@
<span>www.cnbc.com</span>
</a>
</td>
<td>&#160;</td>
<td>
<code>images</code>
</td>
<td>
<code>key_points</code>
</td>
Expand Down Expand Up @@ -1716,7 +1720,9 @@
<span>occupydemocrats.com</span>
</a>
</td>
<td>&#160;</td>
<td>
<code>images</code>
</td>
<td>
<code>description</code>
</td>
Expand All @@ -1735,7 +1741,9 @@
<span>www.reuters.com</span>
</a>
</td>
<td>&#160;</td>
<td>
<code>images</code>
</td>
<td>&#160;</td>
</tr>
<tr>
Expand Down Expand Up @@ -1865,6 +1873,7 @@
</a>
</td>
<td>
<code>images</code>
<code>topics</code>
</td>
<td>&#160;</td>
Expand Down Expand Up @@ -1899,6 +1908,7 @@
</a>
</td>
<td>
<code>images</code>
<code>topics</code>
</td>
<td>&#160;</td>
Expand Down
6 changes: 5 additions & 1 deletion scripts/generate_parser_test_files.py
Original file line number Diff line number Diff line change
Expand Up @@ -143,7 +143,11 @@ def main() -> None:
test_data[type(versioned_parser).__name__] = new
else:
entry.update(new)
test_data[type(versioned_parser).__name__] = dict(sorted(entry.items()))

# sort entries
test_data[type(versioned_parser).__name__] = dict(
sorted(test_data[type(versioned_parser).__name__].items())
)

test_data_file.write(test_data)
bar.update()
Expand Down
4 changes: 2 additions & 2 deletions src/fundus/parser/__init__.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
from .base_parser import BaseParser, ParserProxy, attribute, function
from .data import ArticleBody
from .data import ArticleBody, Image

__all__ = ["ParserProxy", "BaseParser", "attribute", "function", "ArticleBody"]
__all__ = ["ParserProxy", "BaseParser", "attribute", "function", "ArticleBody", "Image"]
Loading
Loading