[html] display title/aria-label/caption/summary #2146

midichef · 2023-11-30T07:46:19Z

This PR changes how HTML tables are parsed. It inspects more fields that may occasionally be used to describe a table. The guidance I used in choosing these fields was this Github issue, which lists these:

<caption>, summary attribute, or title attribute) or ARIA (i.e., aria-label

The summary attribute has been obsolete for well over 10 years, according to a 2013 answer on Stack Overflow (edited to add: but it is currently used in new documents at the SEC's EDGAR database). But people may use visidata on HTML documents that may be older than that, so I think it's worth including.

A note on parsing caption elements: this document from w3 shows that <caption> elements can contain other HTML elements, which is why I use normalize-space() to flatten these elements into one string.
Here is a table to test out the parsing: table-described.html.txt

Thoughts?

Also, I don't know much about xpath, so the xpath expression I used to flatten <caption> could use more eyes on it.

saulpw · 2023-11-30T19:11:19Z

Hey @midichef, thanks for putting some effort into this. I think it's useful to provide more avenues for describing tables; when I open .html, I typically just look to see which table has the most rows and/or columns, which is not ideal. But let me ask, have you yourself found any of these fields to be useful in determining which tables have the content you're looking for? If so, great, I'm happy to add some more columns to this sheet. (We might also consider making some of the rare/obsolete ones hidden by default, so they don't clutter the screen, but are available anytime with gv).

One small point, I would add a cache=True for the caption column, since xpath can be expensive. ChatGPT also suggests that it could be combined into a single xpath: if (./caption) then normalize-space(./caption) else (), which is likely faster than running two simpler xpaths. But the cache should be sufficient.

midichef · 2023-12-01T22:16:24Z

I have run across one of these, the summary attribute. It's in active use at the SEC's EDGAR database of financial filings, including high-profile documents created this year.

But that's the only one of the new columns that I've seen in a real document. It's not great to clutter up the screen with columns that will most likely be empty. What do you think about a column option that makes it start out invisible if all its entries are null?

Another way to add information would be to show each table's first row.

anjakefala · 2023-12-02T06:21:38Z

(I updated the branch with develop so we could run the CI on the fixed tests.)

frosencrantz · 2023-12-03T18:00:41Z

I really like this, since this is closer to official guidelines of how to label HTML tables. In practice, however it feels like people use HTML Headers (H1...H9), and it would be great if VisiData could surface the closest header just before a table.

Like this page is one that I wish VisiData would label tables with the headers: https://peps.python.org/. Since the tables all look the same, and the headers provide the context.

saulpw · 2023-12-04T01:39:55Z

It would be great if VisiData could surface the closest header just before a table.

I agree, I've wanted this too. What's the xpath to do this?

midichef · 2023-12-07T11:24:34Z

Here's an attempt at the XPath. It finds the closest sibling heading that comes before a table, i.e. the tags h1 through h6.

I've only tested it on a couple of Wikipedia pages. It seems to work okay there.

saulpw

If this works, we'll take it! Let's move it out of draft.

frosencrantz · 2023-12-08T22:35:10Z

@midichef thank you! This is a great addition. It helps me make sense of which VisiData sheets are connected to the tables in an html file. It is possible to understand the tables in https://peps.python.org/ without looking at a browser.

Now I have to figure out a good way to add those labels as a value in a column so all the sheets can be stacked together, so I know the source of each row.

midichef and others added 2 commits November 29, 2023 23:44

[html] display title/aria-label/caption/summary

77cc3fa

[tests] update html test with new columns

891f7aa

[html] cache the caption column

6ab0def

Merge branch 'develop' into html_captions

38e2e40

midichef force-pushed the html_captions branch from 221c70f to 29d91c8 Compare December 7, 2023 11:13

[html] show table's sibling h1-h6 tags

243db37

midichef force-pushed the html_captions branch from 29d91c8 to 243db37 Compare December 7, 2023 11:21

saulpw approved these changes Dec 7, 2023

View reviewed changes

anjakefala marked this pull request as ready for review December 7, 2023 20:43

anjakefala merged commit 6bf20c8 into saulpw:develop Dec 7, 2023
13 checks passed

midichef deleted the html_captions branch December 22, 2023 01:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[html] display title/aria-label/caption/summary #2146

[html] display title/aria-label/caption/summary #2146

midichef commented Nov 30, 2023 •

edited

Loading

saulpw commented Nov 30, 2023

midichef commented Dec 1, 2023

anjakefala commented Dec 2, 2023

frosencrantz commented Dec 3, 2023

saulpw commented Dec 4, 2023

midichef commented Dec 7, 2023

saulpw left a comment

frosencrantz commented Dec 8, 2023

[html] display title/aria-label/caption/summary #2146

[html] display title/aria-label/caption/summary #2146

Conversation

midichef commented Nov 30, 2023 • edited Loading

saulpw commented Nov 30, 2023

midichef commented Dec 1, 2023

anjakefala commented Dec 2, 2023

frosencrantz commented Dec 3, 2023

saulpw commented Dec 4, 2023

midichef commented Dec 7, 2023

saulpw left a comment

Choose a reason for hiding this comment

frosencrantz commented Dec 8, 2023

midichef commented Nov 30, 2023 •

edited

Loading