-
-
Notifications
You must be signed in to change notification settings - Fork 285
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[html] display title/aria-label/caption/summary #2146
Conversation
Hey @midichef, thanks for putting some effort into this. I think it's useful to provide more avenues for describing tables; when I open .html, I typically just look to see which table has the most rows and/or columns, which is not ideal. But let me ask, have you yourself found any of these fields to be useful in determining which tables have the content you're looking for? If so, great, I'm happy to add some more columns to this sheet. (We might also consider making some of the rare/obsolete ones hidden by default, so they don't clutter the screen, but are available anytime with One small point, I would add a |
I have run across one of these, the summary attribute. It's in active use at the SEC's EDGAR database of financial filings, including high-profile documents created this year. But that's the only one of the new columns that I've seen in a real document. It's not great to clutter up the screen with columns that will most likely be empty. What do you think about a column option that makes it start out invisible if all its entries are null? Another way to add information would be to show each table's first row. |
(I updated the branch with develop so we could run the CI on the fixed tests.) |
I really like this, since this is closer to official guidelines of how to label HTML tables. In practice, however it feels like people use HTML Headers (H1...H9), and it would be great if VisiData could surface the closest header just before a table. Like this page is one that I wish VisiData would label tables with the headers: https://peps.python.org/. Since the tables all look the same, and the headers provide the context. |
I agree, I've wanted this too. What's the xpath to do this? |
221c70f
to
29d91c8
Compare
29d91c8
to
243db37
Compare
Here's an attempt at the XPath. It finds the closest sibling heading that comes before a table, i.e. the tags h1 through h6. I've only tested it on a couple of Wikipedia pages. It seems to work okay there. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this works, we'll take it! Let's move it out of draft.
@midichef thank you! This is a great addition. It helps me make sense of which VisiData sheets are connected to the tables in an html file. It is possible to understand the tables in https://peps.python.org/ without looking at a browser. Now I have to figure out a good way to add those labels as a value in a column so all the sheets can be stacked together, so I know the source of each row. |
This PR changes how HTML tables are parsed. It inspects more fields that may occasionally be used to describe a table. The guidance I used in choosing these fields was this Github issue, which lists these:
The summary attribute has been obsolete for well over 10 years, according to a 2013 answer on Stack Overflow (edited to add: but it is currently used in new documents at the SEC's EDGAR database). But people may use visidata on HTML documents that may be older than that, so I think it's worth including.
A note on parsing caption elements: this document from w3 shows that
<caption>
elements can contain other HTML elements, which is why I usenormalize-space()
to flatten these elements into one string.Here is a table to test out the parsing: table-described.html.txt
Thoughts?
Also, I don't know much about xpath, so the xpath expression I used to flatten
<caption>
could use more eyes on it.