`_check_toc_parents` should consider only the descendants of root_doc and n… #13038

khanxmetu · 2024-10-18T15:27:57Z

…ot the whole toctree_includes graph

Purpose

Skips the document referenced in multiple toctree warning/log when the said toctrees are not the descendants of root_doc toctree.
The warning/log output now aligns with the current implementation of global_toctree_for_doc and _get_toctree_ancestors (i.e root_doc being hardcoded as the root ancestor for every document).

Relates

Removes the inconsistent logged output: "document is referenced in multiple toctrees..." in 12999#discussioncomment-10909269
Solution based on my comment 12999#discussioncomment-10921512

…ot the whole toctree_includes graph

sphinx/environment/__init__.py

jayaddison · 2024-10-19T10:17:37Z

sphinx/environment/__init__.py

    toc_parents: dict[str, list[str]] = {}
-    for parent, children in toctree_includes.items():
-        for child in children:
-            toc_parents.setdefault(child, []).append(parent)

+    def _find_toc_parents_dfs(node: str) -> None:
+        for child in toctree_includes.get(node, []):
+            toc_parents.setdefault(child, []).append(node)
+            is_child_already_visited = len(toc_parents[child]) > 1
+            if not is_child_already_visited:
+                _find_toc_parents_dfs(child)
+
+    _find_toc_parents_dfs(root_doc)
    for doc, parents in sorted(toc_parents.items()):


A personal opinion: the code might be easier to understand if the iteration source of the subsequent for loop -- toc_parents -- is assigned-to from the result of the _find_toc_parents_dfs function.

Explaining why: to me, function calls that have side-effects that affect outer-scoped variables are slightly hard to follow.

I think that another potential benefit could be that it'd be easier to write test coverage for the helper function (although I admit that it's a small one, and that perhaps the enclosing function is a better candidate for testing here).

A personal opinion: the code might be easier to understand if the iteration source of the subsequent for loop -- toc_parents -- is assigned-to from the result of the _find_toc_parents_dfs function.

Explaining why: to me, function calls that have side-effects that affect outer-scoped variables are slightly hard to follow.

I think that another potential benefit could be that it'd be easier to write test coverage for the helper function (although I admit that it's a small one, and that perhaps the enclosing function is a better candidate for testing here).

I didn't worry too much about side-effects as it being more simplistic this way.

Here is the DFS without side-effect:

def _find_toc_parents_dfs(node: str, toc_parents: dict[str, list[str]] = {}) -> dict[str, list[str]]: for child in toctree_includes.get(node, []): already_visited = child in toc_parents toc_parents.setdefault(child, []).append(node) if already_visited: continue _find_toc_parents_dfs(child, toc_parents) return toc_parents

Personally I found it slightly more complicated than needed because of toc_parents being propogated down the tree as a parameter but also being returned. Note that return toc_parents will only be used by the external caller of the helper function and not elsewhere.
Anyways I'm fine with this implementation too if you think so.

Edit: There exists another DFS implementation, without taking toc_parents dict as a parameter but only relying on return values, however I believe that would require combining the returned dicts from each subtree at each node which would be expensive.

jayaddison · 2024-10-19T10:19:21Z

sphinx/environment/__init__.py

    for doc, parents in sorted(toc_parents.items()):
        if len(parents) > 1:
            logger.info(
                __(
                    'document is referenced in multiple toctrees: %s, selecting: %s <- %s'
                ),
-                parents,
+                sorted(parents),


Was the sorting added here for debugging/investigation purposes? (and should we include it with these changes?)

The helper function uses preorder traversal which does not guarantee sorted parents as was before. Sorting is kept for consistency reasons (independent of the helper function traversal order) in the logged output, this way it is also easier to write the corresponding tests instead of dry running the traversal order and depending on the helper functions implementation. Further, it also makes it easier for the user to spot the pattern that the lexicographically greatest parent is being selected.

My initial sense here is that I'm not too keen on the practice of modifying application code in order to make test expectations easier to write.

I do understand that it helps in this case, but I think that unit test coverage of different tree/graph structures would be more robust over time.

(apologies for taking a while to add further review commentary)

My initial sense here is that I'm not too keen on the practice of modifying application code

I do not understand how the application code is modified since the function _check_toc_parents() does not produce any side-effects other than the output. In fact I think the idea of applying sorted is quite the opposite.
I'd like to clarify again that the mentioned benefits/reasons in my previous comment of having sorted parents aren't enforced by this PR, instead the parents were already implicitly sorted previously due to the node-wise traversal and the sorted nature of values in toctree_includes. Since the traversal order is now changed to inorder which doesn't inherently guarantee parents being collected in sorted order, sorted function is now applied post-traversal, to keep it consistent with the previous behavior. You could argue that guaranteeing the order of parents should come from the nested helper function itself and the outer body of check_toc_parents() shall not be modified, however given the recursive nature of the helper function, I feel like the current approach is much simpler.

I do understand that it helps in this case, but I think that unit test coverage of different tree/graph structures would be more robust over time.

I don’t have a strong opinion on writing unittests for helper functions, in my opinion we should be testing based on functionality and not the implementation of a function which in this current case would mean that the tests should only care about consistent warning/logging and not about whatever method of traversal is used internally to achieve so. For example the helper method which used node-wise traversal previously, and now inorder traversal in this PR, should ideally NOT break the existing tests and hence having a determined order of parents regardless of the traversal algorithm helps achieve it.

I’d like to know more about what you think of this. If you still believe that we shouldn’t guarantee sorted order of parents anymore, I’d happily remove it.

My initial sense here is that I'm not too keen on the practice of modifying application code

I do not understand how the application code is modified since the function _check_toc_parents() does not produce any side-effects other than the output.

Logging is sometime more than a side-effect, and it can either express or hide internal application state. Sometimes that's risky, and sometimes it's useful. In the context of trying to improve the consistency of build output, I think it's likely to be useful to log the iteration order of the parent document names without sorting applied to them.

As I understand it, commit 8351936 does sort the keys of the toctree includes -- but it doesn't sort the values (the parent document list).

Thanks for the feedback, I'll remove sorting from here then.

As I understand it, commit 8351936 does sort the keys of the toctree includes -- but it doesn't sort the values (the parent document list).

Note that the toctree_includes is a parent -> children mapping. Essentially by sorting the keys (which are parent names) and later iterating over the dictionary, it is guaranteed that parents are processed in lexicographical order. Therefore when the toctree_includes is reversed to obtain toc_parents i.e child -> parents mapping in check_toc_parents, the parents list for each child key would already be sorted.

Note that the toctree_includes is a parent -> children mapping. Essentially by sorting the keys (which are parent names) and later iterating over the dictionary, it is guaranteed that parents are processed in lexicographical order. Therefore when the toctree_includes is reversed to obtain toc_parents i.e child -> parents mapping in check_toc_parents, the parents list for each child key would already be sorted.

Ok, this makes sense, thank you. I didn't understand this until I began attempting to write some unit tests locally. What I'd started on was a test to provide three or four permutations of the same graph -- sometimes with child value lists randomized, sometimes with parent keys randomized -- with the intent of asserting on consistent inverted-graph structure as output.

(in particular, refreshing my memory about the lexicographic sorting took me some time -- and now I understand from that plus your message here, that the check code relies on the parent-key order, and that without a larger refactoring, it wouldn't be valid to write test cases that randomize that ordering, because the application code itself shouldn't allow that to occur)

that the check code relies on the parent-key order, and that without a larger refactoring, it wouldn't be valid to write test cases that randomize that ordering, because the application code itself shouldn't allow that to occur

While it is true that the order of nodes visited during the reversal of the graph edges ~~rely~~ relied on the parent-key order from toctree_includes which is ensured to be deterministic from my previous related PR. However I cannot understand the motivation behind testing the randomized ordering, because in formal definitions a graph consists of set of edge pairs, this unordering implies that any permutation of parents lists in toc_parents values would correspond to the same graph by definition although different internal representations.

Was your idea about testing whether the toc_parents dictionary order obtained from the dfs is equivalent for any randomized order of key values of a specific graph of toctree_includes?

Edit: The dfs algorithm presented here relies on children-values ordering of toctree_includes instead which I don't think is always sorted.

Co-authored-by: James Addison <55152140+jayaddison@users.noreply.github.com>

jayaddison · 2024-11-01T00:13:02Z

@khanxmetu a few things about this are still bothering me:

I continue to believe we should be unit testing the results of tree/graph processing, not checking for presence/absence of log messages.
Following on from my suggestion to remove one sorted(...) call -- I'm beginning to wonder whether this one is equally dubious.
The _check_toc_parents method seems to be a duplicate implementation of the toc-parent resolution elsewhere in the code -- so they risk diverging in future (in fact, it makes it difficult to confirm that they haven't diverged already).

And perhaps a larger design question/concern about the bugreport:

This behavior seems like a bug since for documents included in the orphaned tree but not in master_doc, the sidebar navigation tree on that html document page still shows the toctree for master_doc and is unable to expand to the current document.

How to Reproduce

Toctree Graph
  index      orphan_root         
      \       /     \
       \    /        \ 
      alpha        bravo
       /              /
    delta       charlie

Do you mean that on the page for bravo, we display a toctree that presents content related to the index / alpha / etc side of the tree? If so, I think we should confirm whether that's what we want. I would personally expect that we'd present a toctree grounded in the part of the graph that we're currently in.

jayaddison · 2024-11-01T00:29:15Z

And perhaps a larger design question/concern about the bugreport:

Sorry, I didn't phrase this well - my worry there isn't about your bugreport - I'm grateful you're helping to investigate this. My worry is that the application itself may already be doing something we don't want -- and if so, adding more code that reinforces that unexpected behaviour could make it more difficult to repair properly in future. So my concern is to try to check that we're choosing the correct path before we make additional changes.

khanxmetu · 2024-11-01T10:13:03Z

I continue to believe we should be unit testing the results of tree/graph processing, not checking for presence/absence of log messages.

Initially I used test_circular_toctree() as a reference to write tests for this as the warning/checks was of somewhat similar nature. Since you asked, I'll make sure to change to a unittest testing the traversal only instead.

Following on from my suggestion to remove one sorted(...) call -- I'm beginning to wonder whether this one is equally dubious.

The idea was similar, yes. Since many parts of the codebase would sort the documents before processing or logging, I added this here too. I'll remove it due to the reasons you mentioned.

The _check_toc_parents method seems to be a duplicate implementation of the toc-parent resolution elsewhere in the code -- so they risk diverging in future (in fact, it makes it difficult to confirm that they haven't diverged already).

Could you elaborate on this please?

khanxmetu · 2024-11-01T11:47:57Z

Do you mean that on the page for bravo, we display a toctree that presents content related to the index / alpha / etc side of the tree? If so, I think we should confirm whether that's what we want. I would personally expect that we'd present a toctree grounded in the part of the graph that we're currently in.

In a perfect scenario, I believe that on bravo page the global toctree resolution in the side bar should begin from orphan_root and expanded to bravo, since it's more intuitive to believe that the tree would rooted at orphan_root and has no connection with index.

However this presents more complications.

Firstly, due to the hardcoded master_doc (index) in the existing codebase which would require some parts/functions to be rewritten to remove that dependency.

Secondly, if master_doc isn't set to be fixed as the root ancestor for any toctree, this PR would not have been possible as it would be virtually impossible to disregard the multiple toctree references from nodes that are not descendants of master_doc.

In conclusion, the mentioned bug isn't intended to be resolved by this PR, rather I documented it for the sake of record/documenting the issue.

jayaddison · 2024-11-05T01:42:08Z

Thank you again @khanxmetu for your responses. I think I may need to take a break soon from working on Sphinx -- I will try to review this again, but if I do not, then my apologies in advance and best of luck progressing it to merge readiness.

khanxmetu changed the title ~~_check_toc_parents considers only the descendants of root_doc and n…~~ _check_toc_parents should consider only the descendants of root_doc and n… Oct 18, 2024

khanxmetu force-pushed the discussioncomment-10909269 branch 2 times, most recently from 07080c5 to f371bbb Compare October 18, 2024 15:36

_check_toc_parents considers only the descendants of root_doc and n…

f371bbb

…ot the whole toctree_includes graph

jayaddison reviewed Oct 19, 2024

View reviewed changes

sphinx/environment/__init__.py Outdated Show resolved Hide resolved

jayaddison reviewed Oct 19, 2024

View reviewed changes

jayaddison added builder:html internals:toctree labels Oct 19, 2024

khanxmetu and others added 2 commits October 19, 2024 15:54

Code refactoring suggestion

cc5f3f8

Co-authored-by: James Addison <55152140+jayaddison@users.noreply.github.com>

Keep parents list as unsorted

26b216e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`_check_toc_parents` should consider only the descendants of root_doc and n… #13038

`_check_toc_parents` should consider only the descendants of root_doc and n… #13038

khanxmetu commented Oct 18, 2024 •

edited

Loading

jayaddison Oct 19, 2024

khanxmetu Oct 19, 2024 •

edited

Loading

jayaddison Oct 19, 2024

khanxmetu Oct 19, 2024 •

edited

Loading

jayaddison Oct 21, 2024

khanxmetu Oct 22, 2024 •

edited

Loading

jayaddison Oct 25, 2024

khanxmetu Oct 25, 2024 •

edited

Loading

jayaddison Nov 1, 2024

khanxmetu Nov 1, 2024 •

edited

Loading

jayaddison commented Nov 1, 2024

How to Reproduce

Toctree Graph

jayaddison commented Nov 1, 2024

khanxmetu commented Nov 1, 2024

khanxmetu commented Nov 1, 2024 •

edited

Loading

jayaddison commented Nov 5, 2024

_check_toc_parents should consider only the descendants of root_doc and n… #13038

Are you sure you want to change the base?

_check_toc_parents should consider only the descendants of root_doc and n… #13038

Conversation

khanxmetu commented Oct 18, 2024 • edited Loading

Purpose

Relates

jayaddison Oct 19, 2024

Choose a reason for hiding this comment

khanxmetu Oct 19, 2024 • edited Loading

Choose a reason for hiding this comment

jayaddison Oct 19, 2024

Choose a reason for hiding this comment

khanxmetu Oct 19, 2024 • edited Loading

Choose a reason for hiding this comment

jayaddison Oct 21, 2024

Choose a reason for hiding this comment

khanxmetu Oct 22, 2024 • edited Loading

Choose a reason for hiding this comment

jayaddison Oct 25, 2024

Choose a reason for hiding this comment

khanxmetu Oct 25, 2024 • edited Loading

Choose a reason for hiding this comment

jayaddison Nov 1, 2024

Choose a reason for hiding this comment

khanxmetu Nov 1, 2024 • edited Loading

Choose a reason for hiding this comment

jayaddison commented Nov 1, 2024

How to Reproduce

Toctree Graph

jayaddison commented Nov 1, 2024

khanxmetu commented Nov 1, 2024

khanxmetu commented Nov 1, 2024 • edited Loading

jayaddison commented Nov 5, 2024

`_check_toc_parents` should consider only the descendants of root_doc and n… #13038

`_check_toc_parents` should consider only the descendants of root_doc and n… #13038

khanxmetu commented Oct 18, 2024 •

edited

Loading

khanxmetu Oct 19, 2024 •

edited

Loading

khanxmetu Oct 19, 2024 •

edited

Loading

khanxmetu Oct 22, 2024 •

edited

Loading

khanxmetu Oct 25, 2024 •

edited

Loading

khanxmetu Nov 1, 2024 •

edited

Loading

khanxmetu commented Nov 1, 2024 •

edited

Loading