Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[html] display title/aria-label/caption/summary #2146

Merged
merged 5 commits into from
Dec 7, 2023

Conversation

midichef
Copy link
Contributor

@midichef midichef commented Nov 30, 2023

This PR changes how HTML tables are parsed. It inspects more fields that may occasionally be used to describe a table. The guidance I used in choosing these fields was this Github issue, which lists these:

<caption>, summary attribute, or title attribute) or ARIA (i.e., aria-label

The summary attribute has been obsolete for well over 10 years, according to a 2013 answer on Stack Overflow (edited to add: but it is currently used in new documents at the SEC's EDGAR database). But people may use visidata on HTML documents that may be older than that, so I think it's worth including.

A note on parsing caption elements: this document from w3 shows that <caption> elements can contain other HTML elements, which is why I use normalize-space() to flatten these elements into one string.
Here is a table to test out the parsing: table-described.html.txt

Thoughts?

Also, I don't know much about xpath, so the xpath expression I used to flatten <caption> could use more eyes on it.

@saulpw
Copy link
Owner

saulpw commented Nov 30, 2023

Hey @midichef, thanks for putting some effort into this. I think it's useful to provide more avenues for describing tables; when I open .html, I typically just look to see which table has the most rows and/or columns, which is not ideal. But let me ask, have you yourself found any of these fields to be useful in determining which tables have the content you're looking for? If so, great, I'm happy to add some more columns to this sheet. (We might also consider making some of the rare/obsolete ones hidden by default, so they don't clutter the screen, but are available anytime with gv).

One small point, I would add a cache=True for the caption column, since xpath can be expensive. ChatGPT also suggests that it could be combined into a single xpath: if (./caption) then normalize-space(./caption) else (), which is likely faster than running two simpler xpaths. But the cache should be sufficient.

@midichef
Copy link
Contributor Author

midichef commented Dec 1, 2023

I have run across one of these, the summary attribute. It's in active use at the SEC's EDGAR database of financial filings, including high-profile documents created this year.

But that's the only one of the new columns that I've seen in a real document. It's not great to clutter up the screen with columns that will most likely be empty. What do you think about a column option that makes it start out invisible if all its entries are null?

Another way to add information would be to show each table's first row.

@anjakefala
Copy link
Collaborator

(I updated the branch with develop so we could run the CI on the fixed tests.)

@frosencrantz
Copy link
Contributor

I really like this, since this is closer to official guidelines of how to label HTML tables. In practice, however it feels like people use HTML Headers (H1...H9), and it would be great if VisiData could surface the closest header just before a table.

Like this page is one that I wish VisiData would label tables with the headers: https://peps.python.org/. Since the tables all look the same, and the headers provide the context.

@saulpw
Copy link
Owner

saulpw commented Dec 4, 2023

It would be great if VisiData could surface the closest header just before a table.

I agree, I've wanted this too. What's the xpath to do this?

@midichef
Copy link
Contributor Author

midichef commented Dec 7, 2023

Here's an attempt at the XPath. It finds the closest sibling heading that comes before a table, i.e. the tags h1 through h6.

I've only tested it on a couple of Wikipedia pages. It seems to work okay there.

Copy link
Owner

@saulpw saulpw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this works, we'll take it! Let's move it out of draft.

@anjakefala anjakefala marked this pull request as ready for review December 7, 2023 20:43
@anjakefala anjakefala merged commit 6bf20c8 into saulpw:develop Dec 7, 2023
13 checks passed
@frosencrantz
Copy link
Contributor

@midichef thank you! This is a great addition. It helps me make sense of which VisiData sheets are connected to the tables in an html file. It is possible to understand the tables in https://peps.python.org/ without looking at a browser.

Now I have to figure out a good way to add those labels as a value in a column so all the sheets can be stacked together, so I know the source of each row.

@midichef midichef deleted the html_captions branch December 22, 2023 01:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants