-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Split NFA positive tags into start and end transitions to encapsulate a capture group. #50
Conversation
…for clairty that nothing is shared b/w tests
…egexASTgroup with min = 1 OR'd with RegexASTEmpty
…iteral arguments; Use const& for non-literals; Use auto where possible; Use uint32_t over int for ids; replace begin() and end() with cbegin() and cend()
…(); Add docstrign to RegexDFAStatePair
…NFA; Made add to nfa functions const
…reflect functionality change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the major renaming issue is about this new graph we introduced:
root --(pos_tagged_start_transition)--> capture_group_start_state --> [inner capture group NFA] --(neg_tagged_transition)--> neg_state --> capture_group_end_state --(pos_tagged_end_transition)--> end_state
Things we introduced:
- pos_tagged_start_transition (edge)
- pos_tagged_end_transition (edge)
- capture_group_start_state (state)
- capture_group_end_state (state)
IMO the naming of these four edges/states is important for people who first read this code. I'm not sure if the current naming is already optimized to its best state:
- positive "tagged" start/end transition and "capture group" states using different terms. Is it possible to stick to one? Either
tagged
orcapture group
- If we use the transition to name a new state, i.e.,
new_state_with_positive_tagged_end_transition
isn't clear which state it creates, as it has two states connected. And for people who first read the code, it could be either the transition points to the new state or the new state has a new transition pointing to other state. Is it more easier to understand if we explicitly name thenew
function by its state, i.e.,new_capture_group_end_state
, that matches the diagram above and can visually tell people which state it creates. (that is why I suggestednew_positive_tagged_transition_end_state
as it explicitly speaks out which new state it creates, but we may need to come up a better name rather thanpositive_tagged_transition_end_state
).
It's my mistake that I should realize some of the problems above in the previous PR. When this PR introduces more complexity to the tagged transitions, I realized more clarifications might be required.
I think the action item would be to formalize the diagram above, and then we can apply to the related code and future PRs. Maybe we can send an issue to track and get this PR merged first.
@kirkrodrigues Can u help take a look as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Outside diff range and nitpick comments (1)
src/log_surgeon/finite_automata/RegexAST.hpp (1)
899-901
: Update diagram comment to match current variable names.The diagram should use
capture_start_state
andcapture_end_state
to match the implementation.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (1)
src/log_surgeon/finite_automata/RegexAST.hpp
(2 hunks)
🧰 Additional context used
📓 Learnings (1)
src/log_surgeon/finite_automata/RegexAST.hpp (2)
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#48
File: src/log_surgeon/finite_automata/RegexNFAState.hpp:0-0
Timestamp: 2024-11-13T20:02:13.737Z
Learning: In `src/log_surgeon/finite_automata/RegexNFAState.hpp`, the constructor `RegexNFAState(std::set<Tag const*> tags, RegexNFAState const* dest_state)` has been updated to use `std::vector<Tag const*> tags` instead of `std::set`.
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#48
File: src/log_surgeon/finite_automata/RegexAST.hpp:700-700
Timestamp: 2024-11-13T22:38:19.472Z
Learning: In `RegexASTCapture`, `m_tag` must always be non-null.
🔇 Additional comments (2)
src/log_surgeon/finite_automata/RegexAST.hpp (2)
696-698
: LGTM! Parameter name change improves clarity.
The rename from end_state
to dest_state
better reflects the parameter's role as a destination state rather than implying it's the final state.
902-911
: LGTM! Well-structured implementation with clear state management.
The implementation:
- Correctly splits positive tags into start and end transitions
- Uses RAII-style state creation through the factory method
- Properly manages root state to prevent side effects
- Follows the documented NFA structure
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Outside diff range and nitpick comments (2)
src/log_surgeon/finite_automata/RegexNFAState.hpp (2)
52-55
: Add parameter validation for tag and dest_stateConsider adding null checks for the parameters to prevent potential issues.
auto add_positive_tagged_start_transition(Tag const* tag, RegexNFAState const* dest_state) -> void { + if (tag == nullptr || dest_state == nullptr) { + throw std::invalid_argument("tag and dest_state must not be null"); + } m_positive_tagged_start_transitions.emplace_back(tag, dest_state); }
Line range hint
190-234
: Fix indentation in the error handling blockThe indentation of the error handling block at line 206 is inconsistent with the surrounding code.
if (m_positive_tagged_end_transition.has_value()) { auto const optional_serialized_positive_end_transition = m_positive_tagged_end_transition.value().serialize(state_ids); if (false == optional_serialized_positive_end_transition.has_value()) { - return std::nullopt; + return std::nullopt; } serialized_positive_tagged_end_transition = optional_serialized_positive_end_transition.value(); }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (2)
src/log_surgeon/finite_automata/RegexAST.hpp
(2 hunks)src/log_surgeon/finite_automata/RegexNFAState.hpp
(5 hunks)
🧰 Additional context used
📓 Learnings (2)
src/log_surgeon/finite_automata/RegexAST.hpp (2)
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#48
File: src/log_surgeon/finite_automata/RegexNFAState.hpp:0-0
Timestamp: 2024-11-13T20:02:13.737Z
Learning: In `src/log_surgeon/finite_automata/RegexNFAState.hpp`, the constructor `RegexNFAState(std::set<Tag const*> tags, RegexNFAState const* dest_state)` has been updated to use `std::vector<Tag const*> tags` instead of `std::set`.
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#48
File: src/log_surgeon/finite_automata/RegexAST.hpp:700-700
Timestamp: 2024-11-13T22:38:19.472Z
Learning: In `RegexASTCapture`, `m_tag` must always be non-null.
src/log_surgeon/finite_automata/RegexNFAState.hpp (2)
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#48
File: src/log_surgeon/finite_automata/RegexNFAState.hpp:0-0
Timestamp: 2024-11-13T20:02:13.737Z
Learning: In `src/log_surgeon/finite_automata/RegexNFAState.hpp`, the constructor `RegexNFAState(std::set<Tag const*> tags, RegexNFAState const* dest_state)` has been updated to use `std::vector<Tag const*> tags` instead of `std::set`.
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#47
File: src/log_surgeon/finite_automata/TaggedTransition.hpp:16-37
Timestamp: 2024-11-10T16:46:58.543Z
Learning: In `src/log_surgeon/finite_automata/TaggedTransition.hpp`, the classes `PositiveTaggedTransition` and `NegativeTaggedTransition` currently do not share enough functionality to justify refactoring into a common base class.
🔇 Additional comments (4)
src/log_surgeon/finite_automata/RegexNFAState.hpp (2)
35-35
: LGTM: Clean implementation of tag splitting
The implementation correctly separates positive tags into start and end transitions, using appropriate container types (vector for multiple start transitions, optional for single end transition).
Also applies to: 113-114
Line range hint 35-234
: Verify the impact on transition consumers
Let's verify that all code using positive tagged transitions has been updated to handle the split between start and end transitions.
✅ Verification successful
All code has been properly updated to handle the split between positive tagged start and end transitions
The verification shows that all consumers of the positive tagged transitions have been properly updated to handle the split between start and end transitions. This is evidenced by:
- The test cases in
test-NFA.cpp
have been updated to use the new format with separatepositive_tagged_start_transitions
andpositive_tagged_end_transitions
- The
RegexNFA
class has been updated to handle both types:- Uses
add_positive_tagged_start_transition
for start transitions - Uses
new_state_with_positive_tagged_end_transition
for end transitions
- Uses
- The traversal code in
get_bfs_traversal_order
correctly handles both transition types separately
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Search for direct usage of the old positive tagged transitions
rg -l "get_positive_tagged_transitions|m_positive_tagged_transitions" --type cpp
# Search for potential transition consumers
ast-grep --pattern 'class $_ {
$$$
auto $_($$$) -> void {
$$$
positive_tagged_$$$
$$$
}
$$$
}'
Length of output: 199
Script:
#!/bin/bash
# Search for any references to positive tagged transitions with more context
rg -A 5 -B 5 "positive_tagged" --type cpp
# Search for any files that might be handling transitions
rg -l "RegexNFAState|TaggedTransition" --type cpp
Length of output: 21775
src/log_surgeon/finite_automata/RegexAST.hpp (2)
696-698
: LGTM! Documentation updated consistently.
The parameter name in the documentation has been updated to match the method signature change from end_state
to dest_state
.
895-910
: LGTM! Implementation follows RAII principles and maintains NFA state.
The changes effectively:
- Split positive tags into start and end tags
- Use RAII-style initialization for state management
- Properly handle root state preservation
- Match the documented NFA structure in the comment
This diagram uses concise variable names for all transitions, states, and the sub-NFA. In terms of incoming vs outgoing edges, in general LS has states store out-going edges, therefore any edge associated with a state is always out-going. Maybe this can be made more clear in the docstring for the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Outside diff range and nitpick comments (1)
src/log_surgeon/finite_automata/RegexAST.hpp (1)
899-901
: Consider enhancing the comment with ASCII art.The comment clearly explains the NFA structure, but could be enhanced with ASCII art for better visualization:
/* * NFA structure: * * root --(`m_tag` start)--> capture_start_state --> [`m_group_regex_ast` NFA] * --(`m_negative_tags`)--> capture_end_state --(`m_tag` end)--> dest_state * * Example ASCII art: * [root] ---tag_start---> [capture_start] ---group_ast---> [capture_end] ---tag_end---> [dest] * ^ | * | | * +--negative_tags---+ */
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (2)
src/log_surgeon/finite_automata/RegexAST.hpp
(2 hunks)src/log_surgeon/finite_automata/RegexNFA.hpp
(5 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- src/log_surgeon/finite_automata/RegexNFA.hpp
🧰 Additional context used
📓 Learnings (1)
src/log_surgeon/finite_automata/RegexAST.hpp (2)
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#48
File: src/log_surgeon/finite_automata/RegexNFAState.hpp:0-0
Timestamp: 2024-11-13T20:02:13.737Z
Learning: In `src/log_surgeon/finite_automata/RegexNFAState.hpp`, the constructor `RegexNFAState(std::set<Tag const*> tags, RegexNFAState const* dest_state)` has been updated to use `std::vector<Tag const*> tags` instead of `std::set`.
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#48
File: src/log_surgeon/finite_automata/RegexAST.hpp:700-700
Timestamp: 2024-11-13T22:38:19.472Z
Learning: In `RegexASTCapture`, `m_tag` must always be non-null.
🔇 Additional comments (2)
src/log_surgeon/finite_automata/RegexAST.hpp (2)
696-698
: LGTM! Documentation is clear and consistent.
The parameter name change from end_state
to dest_state
in the documentation matches the method signature update.
895-910
: Verify usage of the new factory method.
The implementation looks good and follows RAII principles. Let's verify that all capture group creation sites use the new factory method consistently.
✅ Verification successful
Let me gather more context about the capture group implementations to ensure comprehensive verification.
Let me check the implementation of the factory method to complete the verification.
Let me try one final search with a different pattern to locate the factory method implementation.
Factory method implementation is correctly used for capture groups
The verification confirms that the new factory method new_start_and_end_states_with_positive_tagged_transitions
is the only method used for creating capture group states, and it properly encapsulates the state creation and transition setup. The implementation correctly uses RAII principles and maintains consistency throughout the codebase.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Verify consistent usage of the new factory method for capture groups.
# Search for direct state creation and transition addition that should use the factory method
rg -U "new_state.*add_positive_tagged.*transition" --type cpp
# Search for the new factory method usage
rg "new_start_and_end_states_with_positive_tagged_transitions" --type cpp
Length of output: 520
Script:
#!/bin/bash
# Search for capture-related implementations and their state management
ast-grep --pattern 'class RegexASTCapture {
$$$
}'
# Find all capture-related state creation patterns
rg -U "class.*Capture.*\{[\s\S]*?}" --type cpp -A 20
# Look for any other potential capture state creation patterns
rg -U "capture.*state" --type cpp
Length of output: 3908
Script:
#!/bin/bash
# Get the implementation of the factory method
rg -U "new_start_and_end_states_with_positive_tagged_transitions.*\{[\s\S]*?\}" --type cpp
# Check if there are any other methods in RegexNFA that create states with tagged transitions
ast-grep --pattern 'class RegexNFA {
$$$
new_state$$$
$$$
}'
Length of output: 161
Script:
#!/bin/bash
# Search for the factory method implementation with more context
rg -U "new_start_and_end_states_with_positive_tagged_transitions" --type cpp -B 5 -A 20
# Look for any other state creation patterns in RegexNFA
rg -U "RegexNFA.*new_state" --type cpp -B 2 -A 10
Length of output: 8980
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Outside diff range and nitpick comments (1)
src/log_surgeon/finite_automata/RegexAST.hpp (1)
899-929
: LGTM! Clear and helpful diagram.The ASCII diagram effectively illustrates the NFA structure with split tags. The transitions and states are well-labeled, making it easy to understand the flow.
Consider moving this diagram to a dedicated documentation file as suggested in the TODO comment on line 899. This would make it more discoverable and maintainable.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (1)
src/log_surgeon/finite_automata/RegexAST.hpp
(2 hunks)
🧰 Additional context used
📓 Learnings (1)
src/log_surgeon/finite_automata/RegexAST.hpp (2)
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#48
File: src/log_surgeon/finite_automata/RegexNFAState.hpp:0-0
Timestamp: 2024-11-13T20:02:13.737Z
Learning: In `src/log_surgeon/finite_automata/RegexNFAState.hpp`, the constructor `RegexNFAState(std::set<Tag const*> tags, RegexNFAState const* dest_state)` has been updated to use `std::vector<Tag const*> tags` instead of `std::set`.
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#48
File: src/log_surgeon/finite_automata/RegexAST.hpp:700-700
Timestamp: 2024-11-13T22:38:19.472Z
Learning: In `RegexASTCapture`, `m_tag` must always be non-null.
🔇 Additional comments (3)
src/log_surgeon/finite_automata/RegexAST.hpp (3)
696-698
: LGTM! Documentation update is clear and accurate.
The parameter name change from end_state
to dest_state
in the documentation aligns well with the method signature update.
895-898
: LGTM! Method signature update improves clarity.
The parameter name change from end_state
to dest_state
is more accurate as the state isn't necessarily the end state in the new split-tag design.
930-939
: LGTM! Clean implementation that follows RAII principles.
The implementation:
- Uses the factory method for state creation
- Properly manages root state scope
- Matches the diagram exactly
- Successfully splits the positive tags into start and end tags
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for the PR title, how about:
feat: Split NFA positive tags into start and end transitions to encapsulate a capture group.
Description
Validation performed
Summary by CodeRabbit
New Features
Bug Fixes
Tests