feat: Split NFA positive tags into start and end transitions to encapsulate a capture group. #50

SharafMohamed · 2024-11-18T16:20:12Z

Description

In tagged Regex, positive tags can exist in isolation. However, for capture groups to be useful we always need them to exist in pairs, when before the capture group and one after.
Split the existing positive tags into a start and end tag.

Validation performed

Unit-tests updated to output both start and end tags, runs correctly.

Summary by CodeRabbit

New Features
- Improved handling of capture groups in regex processing, enhancing NFA state transitions.
- Clearer distinction between start and end tagged transitions for better clarity in transition management.
Bug Fixes
- Enhanced traversal logic for tagged transitions to ensure accurate processing.
Tests
- Updated test cases to reflect changes in serialized NFA output, including clearer naming conventions for transitions.

… added tags

…for clairty that nothing is shared b/w tests

…egexASTgroup with min = 1 OR'd with RegexASTEmpty

…iteral arguments; Use const& for non-literals; Use auto where possible; Use uint32_t over int for ids; replace begin() and end() with cbegin() and cend()

…(); Add docstrign to RegexDFAStatePair

… functions const

…NFA; Made add to nfa functions const

… to be explicit.

…reflect functionality change.

LinZhihao-723

I think the major renaming issue is about this new graph we introduced:

root --(pos_tagged_start_transition)--> capture_group_start_state --> [inner capture group NFA] --(neg_tagged_transition)--> neg_state --> capture_group_end_state --(pos_tagged_end_transition)--> end_state

Things we introduced:

pos_tagged_start_transition (edge)
pos_tagged_end_transition (edge)
capture_group_start_state (state)
capture_group_end_state (state)

IMO the naming of these four edges/states is important for people who first read this code. I'm not sure if the current naming is already optimized to its best state:

positive "tagged" start/end transition and "capture group" states using different terms. Is it possible to stick to one? Either tagged or capture group
If we use the transition to name a new state, i.e., new_state_with_positive_tagged_end_transition isn't clear which state it creates, as it has two states connected. And for people who first read the code, it could be either the transition points to the new state or the new state has a new transition pointing to other state. Is it more easier to understand if we explicitly name the new function by its state, i.e., new_capture_group_end_state, that matches the diagram above and can visually tell people which state it creates. (that is why I suggested new_positive_tagged_transition_end_state as it explicitly speaks out which new state it creates, but we may need to come up a better name rather than positive_tagged_transition_end_state).

It's my mistake that I should realize some of the problems above in the previous PR. When this PR introduces more complexity to the tagged transitions, I realized more clarifications might be required.
I think the action item would be to formalize the diagram above, and then we can apply to the related code and future PRs. Maybe we can send an issue to track and get this PR merged first.
@kirkrodrigues Can u help take a look as well?

coderabbitai

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (1)

src/log_surgeon/finite_automata/RegexAST.hpp (1)

899-901: Update diagram comment to match current variable names.

The diagram should use capture_start_state and capture_end_state to match the implementation.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 98c5b95 and f59cf41.

📒 Files selected for processing (1)

src/log_surgeon/finite_automata/RegexAST.hpp (2 hunks)

🧰 Additional context used

📓 Learnings (1)

src/log_surgeon/finite_automata/RegexAST.hpp (2)

Learnt from: SharafMohamed
PR: y-scope/log-surgeon#48
File: src/log_surgeon/finite_automata/RegexNFAState.hpp:0-0
Timestamp: 2024-11-13T20:02:13.737Z
Learning: In `src/log_surgeon/finite_automata/RegexNFAState.hpp`, the constructor `RegexNFAState(std::set<Tag const*> tags, RegexNFAState const* dest_state)` has been updated to use `std::vector<Tag const*> tags` instead of `std::set`.

Learnt from: SharafMohamed
PR: y-scope/log-surgeon#48
File: src/log_surgeon/finite_automata/RegexAST.hpp:700-700
Timestamp: 2024-11-13T22:38:19.472Z
Learning: In `RegexASTCapture`, `m_tag` must always be non-null.

🔇 Additional comments (2)

src/log_surgeon/finite_automata/RegexAST.hpp (2)

696-698: LGTM! Parameter name change improves clarity.

The rename from end_state to dest_state better reflects the parameter's role as a destination state rather than implying it's the final state.

902-911: LGTM! Well-structured implementation with clear state management.

The implementation:

Correctly splits positive tags into start and end transitions
Uses RAII-style state creation through the factory method
Properly manages root state to prevent side effects
Follows the documented NFA structure

coderabbitai

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (2)

src/log_surgeon/finite_automata/RegexNFAState.hpp (2)

52-55: Add parameter validation for tag and dest_state

Consider adding null checks for the parameters to prevent potential issues.

 auto
 add_positive_tagged_start_transition(Tag const* tag, RegexNFAState const* dest_state) -> void {
+    if (tag == nullptr || dest_state == nullptr) {
+        throw std::invalid_argument("tag and dest_state must not be null");
+    }
     m_positive_tagged_start_transitions.emplace_back(tag, dest_state);
 }

Line range hint 190-234: Fix indentation in the error handling block

The indentation of the error handling block at line 206 is inconsistent with the surrounding code.

     if (m_positive_tagged_end_transition.has_value()) {
         auto const optional_serialized_positive_end_transition
                 = m_positive_tagged_end_transition.value().serialize(state_ids);
         if (false == optional_serialized_positive_end_transition.has_value()) {
-            return std::nullopt;
+             return std::nullopt;
         }
         serialized_positive_tagged_end_transition = optional_serialized_positive_end_transition.value();
     }

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 85a2d69 and 2b01433.

📒 Files selected for processing (2)

src/log_surgeon/finite_automata/RegexAST.hpp (2 hunks)
src/log_surgeon/finite_automata/RegexNFAState.hpp (5 hunks)

🧰 Additional context used

📓 Learnings (2)

src/log_surgeon/finite_automata/RegexAST.hpp (2)

Learnt from: SharafMohamed
PR: y-scope/log-surgeon#48
File: src/log_surgeon/finite_automata/RegexNFAState.hpp:0-0
Timestamp: 2024-11-13T20:02:13.737Z
Learning: In `src/log_surgeon/finite_automata/RegexNFAState.hpp`, the constructor `RegexNFAState(std::set<Tag const*> tags, RegexNFAState const* dest_state)` has been updated to use `std::vector<Tag const*> tags` instead of `std::set`.

Learnt from: SharafMohamed
PR: y-scope/log-surgeon#48
File: src/log_surgeon/finite_automata/RegexAST.hpp:700-700
Timestamp: 2024-11-13T22:38:19.472Z
Learning: In `RegexASTCapture`, `m_tag` must always be non-null.

src/log_surgeon/finite_automata/RegexNFAState.hpp (2)

Learnt from: SharafMohamed
PR: y-scope/log-surgeon#48
File: src/log_surgeon/finite_automata/RegexNFAState.hpp:0-0
Timestamp: 2024-11-13T20:02:13.737Z
Learning: In `src/log_surgeon/finite_automata/RegexNFAState.hpp`, the constructor `RegexNFAState(std::set<Tag const*> tags, RegexNFAState const* dest_state)` has been updated to use `std::vector<Tag const*> tags` instead of `std::set`.

Learnt from: SharafMohamed
PR: y-scope/log-surgeon#47
File: src/log_surgeon/finite_automata/TaggedTransition.hpp:16-37
Timestamp: 2024-11-10T16:46:58.543Z
Learning: In `src/log_surgeon/finite_automata/TaggedTransition.hpp`, the classes `PositiveTaggedTransition` and `NegativeTaggedTransition` currently do not share enough functionality to justify refactoring into a common base class.

🔇 Additional comments (4)

src/log_surgeon/finite_automata/RegexNFAState.hpp (2)

35-35: LGTM: Clean implementation of tag splitting

The implementation correctly separates positive tags into start and end transitions, using appropriate container types (vector for multiple start transitions, optional for single end transition).

Also applies to: 113-114

Line range hint 35-234: Verify the impact on transition consumers

Let's verify that all code using positive tagged transitions has been updated to handle the split between start and end transitions.

✅ Verification successful

All code has been properly updated to handle the split between positive tagged start and end transitions

The verification shows that all consumers of the positive tagged transitions have been properly updated to handle the split between start and end transitions. This is evidenced by:

The test cases in test-NFA.cpp have been updated to use the new format with separate positive_tagged_start_transitions and positive_tagged_end_transitions
The RegexNFA class has been updated to handle both types:
- Uses add_positive_tagged_start_transition for start transitions
- Uses new_state_with_positive_tagged_end_transition for end transitions
The traversal code in get_bfs_traversal_order correctly handles both transition types separately

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for direct usage of the old positive tagged transitions
rg -l "get_positive_tagged_transitions|m_positive_tagged_transitions" --type cpp

# Search for potential transition consumers
ast-grep --pattern 'class $_ {
  $$$
  auto $_($$$) -> void {
    $$$
    positive_tagged_$$$
    $$$
  }
  $$$
}'

Length of output: 199

Script:

#!/bin/bash
# Search for any references to positive tagged transitions with more context
rg -A 5 -B 5 "positive_tagged" --type cpp

# Search for any files that might be handling transitions
rg -l "RegexNFAState|TaggedTransition" --type cpp

Length of output: 21775

src/log_surgeon/finite_automata/RegexAST.hpp (2)

696-698: LGTM! Documentation updated consistently.

The parameter name in the documentation has been updated to match the method signature change from end_state to dest_state.

895-910: LGTM! Implementation follows RAII principles and maintains NFA state.

The changes effectively:

Split positive tags into start and end tags
Use RAII-style initialization for state management
Properly handle root state preservation
Match the documented NFA structure in the comment

SharafMohamed · 2024-11-20T23:04:01Z

I think the major renaming issue is about this new graph we introduced:
root --(pos_tagged_start_transition)--> capture_group_start_state --> [inner capture group NFA] --(neg_tagged_transition)--> neg_state --> capture_group_end_state --(pos_tagged_end_transition)--> end_state
Things we introduced:

pos_tagged_start_transition (edge)

pos_tagged_end_transition (edge)

capture_group_start_state (state)

capture_group_end_state (state)

IMO the naming of these four edges/states is important for people who first read this code. I'm not sure if the current naming is already optimized to its best state:

positive "tagged" start/end transition and "capture group" states using different terms. Is it possible to stick to one? Either tagged or capture group

If we use the transition to name a new state, i.e., new_state_with_positive_tagged_end_transition isn't clear which state it creates, as it has two states connected. And for people who first read the code, it could be either the transition points to the new state or the new state has a new transition pointing to other state. Is it more easier to understand if we explicitly name the new function by its state, i.e., new_capture_group_end_state, that matches the diagram above and can visually tell people which state it creates. (that is why I suggested new_positive_tagged_transition_end_state as it explicitly speaks out which new state it creates, but we may need to come up a better name rather than positive_tagged_transition_end_state).

It's my mistake that I should realize some of the problems above in the previous PR. When this PR introduces more complexity to the tagged transitions, I realized more clarifications might be required. I think the action item would be to formalize the diagram above, and then we can apply to the related code and future PRs. Maybe we can send an issue to track and get this PR merged first. @kirkrodrigues Can u help take a look as well?

root --(`m_tag` start)--> capture_start_state --> [`m_group_regex_ast` NFA] --(`m_negative_tags`)--> capture_end_state --(`m_tag` end)--> dest_state

This diagram uses concise variable names for all transitions, states, and the sub-NFA. In terms of incoming vs outgoing edges, in general LS has states store out-going edges, therefore any edge associated with a state is always out-going. Maybe this can be made more clear in the docstring for the NFAState class.

src/log_surgeon/finite_automata/RegexNFA.hpp

coderabbitai

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (1)

src/log_surgeon/finite_automata/RegexAST.hpp (1)

899-901: Consider enhancing the comment with ASCII art.

The comment clearly explains the NFA structure, but could be enhanced with ASCII art for better visualization:

/*
 * NFA structure:
 *
 *   root --(`m_tag` start)--> capture_start_state --> [`m_group_regex_ast` NFA]
 *         --(`m_negative_tags`)--> capture_end_state --(`m_tag` end)--> dest_state
 *
 *   Example ASCII art:
 *   [root] ---tag_start---> [capture_start] ---group_ast---> [capture_end] ---tag_end---> [dest]
 *                                                ^                  |
 *                                                |                  |
 *                                                +--negative_tags---+
 */

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between e37b29a and c5beca3.

📒 Files selected for processing (2)

src/log_surgeon/finite_automata/RegexAST.hpp (2 hunks)
src/log_surgeon/finite_automata/RegexNFA.hpp (5 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

src/log_surgeon/finite_automata/RegexNFA.hpp

🧰 Additional context used

📓 Learnings (1)

src/log_surgeon/finite_automata/RegexAST.hpp (2)

Learnt from: SharafMohamed
PR: y-scope/log-surgeon#48
File: src/log_surgeon/finite_automata/RegexNFAState.hpp:0-0
Timestamp: 2024-11-13T20:02:13.737Z
Learning: In `src/log_surgeon/finite_automata/RegexNFAState.hpp`, the constructor `RegexNFAState(std::set<Tag const*> tags, RegexNFAState const* dest_state)` has been updated to use `std::vector<Tag const*> tags` instead of `std::set`.

Learnt from: SharafMohamed
PR: y-scope/log-surgeon#48
File: src/log_surgeon/finite_automata/RegexAST.hpp:700-700
Timestamp: 2024-11-13T22:38:19.472Z
Learning: In `RegexASTCapture`, `m_tag` must always be non-null.

🔇 Additional comments (2)

src/log_surgeon/finite_automata/RegexAST.hpp (2)

696-698: LGTM! Documentation is clear and consistent.

The parameter name change from end_state to dest_state in the documentation matches the method signature update.

895-910: Verify usage of the new factory method.

The implementation looks good and follows RAII principles. Let's verify that all capture group creation sites use the new factory method consistently.

✅ Verification successful

Let me gather more context about the capture group implementations to ensure comprehensive verification.

Let me check the implementation of the factory method to complete the verification.

Let me try one final search with a different pattern to locate the factory method implementation.

Factory method implementation is correctly used for capture groups

The verification confirms that the new factory method new_start_and_end_states_with_positive_tagged_transitions is the only method used for creating capture group states, and it properly encapsulates the state creation and transition setup. The implementation correctly uses RAII principles and maintains consistency throughout the codebase.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify consistent usage of the new factory method for capture groups.

# Search for direct state creation and transition addition that should use the factory method
rg -U "new_state.*add_positive_tagged.*transition" --type cpp

# Search for the new factory method usage
rg "new_start_and_end_states_with_positive_tagged_transitions" --type cpp

Length of output: 520

Script:

#!/bin/bash
# Search for capture-related implementations and their state management
ast-grep --pattern 'class RegexASTCapture {
  $$$
}'

# Find all capture-related state creation patterns
rg -U "class.*Capture.*\{[\s\S]*?}" --type cpp -A 20

# Look for any other potential capture state creation patterns
rg -U "capture.*state" --type cpp

Length of output: 3908

Script:

#!/bin/bash
# Get the implementation of the factory method
rg -U "new_start_and_end_states_with_positive_tagged_transitions.*\{[\s\S]*?\}" --type cpp

# Check if there are any other methods in RegexNFA that create states with tagged transitions
ast-grep --pattern 'class RegexNFA {
  $$$
  new_state$$$
  $$$
}'

Length of output: 161

Script:

#!/bin/bash
# Search for the factory method implementation with more context
rg -U "new_start_and_end_states_with_positive_tagged_transitions" --type cpp -B 5 -A 20

# Look for any other state creation patterns in RegexNFA
rg -U "RegexNFA.*new_state" --type cpp -B 2 -A 10

Length of output: 8980

coderabbitai

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (1)

src/log_surgeon/finite_automata/RegexAST.hpp (1)

899-929: LGTM! Clear and helpful diagram.

The ASCII diagram effectively illustrates the NFA structure with split tags. The transitions and states are well-labeled, making it easy to understand the flow.

Consider moving this diagram to a dedicated documentation file as suggested in the TODO comment on line 899. This would make it more discoverable and maintainable.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between c5beca3 and fe4a7b3.

📒 Files selected for processing (1)

src/log_surgeon/finite_automata/RegexAST.hpp (2 hunks)

🧰 Additional context used

📓 Learnings (1)

src/log_surgeon/finite_automata/RegexAST.hpp (2)

Learnt from: SharafMohamed
PR: y-scope/log-surgeon#48
File: src/log_surgeon/finite_automata/RegexNFAState.hpp:0-0
Timestamp: 2024-11-13T20:02:13.737Z
Learning: In `src/log_surgeon/finite_automata/RegexNFAState.hpp`, the constructor `RegexNFAState(std::set<Tag const*> tags, RegexNFAState const* dest_state)` has been updated to use `std::vector<Tag const*> tags` instead of `std::set`.

Learnt from: SharafMohamed
PR: y-scope/log-surgeon#48
File: src/log_surgeon/finite_automata/RegexAST.hpp:700-700
Timestamp: 2024-11-13T22:38:19.472Z
Learning: In `RegexASTCapture`, `m_tag` must always be non-null.

🔇 Additional comments (3)

src/log_surgeon/finite_automata/RegexAST.hpp (3)

696-698: LGTM! Documentation update is clear and accurate.

The parameter name change from end_state to dest_state in the documentation aligns well with the method signature update.

895-898: LGTM! Method signature update improves clarity.

The parameter name change from end_state to dest_state is more accurate as the state isn't necessarily the end state in the new split-tag design.

930-939: LGTM! Clean implementation that follows RAII principles.

The implementation:

Uses the factory method for state creation
Properly manages root state scope
Matches the diagram exactly
Successfully splits the positive tags into start and end tags

LinZhihao-723

for the PR title, how about:
feat: Split NFA positive tags into start and end transitions to encapsulate a capture group.

SharafMohamed added 30 commits September 13, 2024 10:41

Add tags to AST; Serialize AST for testing; Add unit-test for testing…

e05acbb

… added tags

Use using to condense code; Use a unique schema object for each test …

5e61e83

…for clairty that nothing is shared b/w tests

Add has_capture_groups(); Add unit-test for has_capture_groups()

082090d

Create and use RegexASTEmpty to split RegexASTgroup with min=0 into R…

2c6d94e

…egexASTgroup with min = 1 OR'd with RegexASTEmpty

Add unit-test for 0 repetition regex

4e02f24

Add more tests for repetition regex

bb3c543

Return by value in literal getters; Use const instead of const& for l…

54027ad

…iteral arguments; Use const& for non-literals; Use auto where possible; Use uint32_t over int for ids; replace begin() and end() with cbegin() and cend()

Refactor new_state()

e58274f

Rename get_first_matching_variable_ids() to get_matching_variable_ids…

1321871

…(); Add docstrign to RegexDFAStatePair

Remove redundant docstrings

c904755

Remove has_capture_groups()

ffe9a0f

Const and auto changes

913ed1a

Add tagged-nfa

795add3

Clarify that the add functions are adding to the nfa; Make add to nfa…

6e45657

… functions const

Changed AST add functions to indicate the AST are being added to the …

7aa8a92

…NFA; Made add to nfa functions const

Merged with previous PR

d1d87e7

Merge branch 'tagged-ast' into pre-tagged-nfa-cleanup

f386a3b

Merge branch 'pre-tagged-nfa-cleanup' into regex-ast-empty

0c600d7

Change add in RegexASTEmpty to add_to_nfa

bedad75

Merge with previous PR

c78f79c

Fix and refactor NFA unit-test

cd54e64

Merge with previous PRs and update some ints to uints.

06c7066

Fix compiler error.

38ab6fe

Fix compiler error where macos considers a struct default constructor…

4c6b9c6

… to be explicit.

Add state_type explicitly.

2e71aaa

Add state_type explicitly.

c062a2c

Remove commented out code.

eaa5674

Remove errent +=.

f150474

Replace constructors with aggregate initialization.

bdafe10

Replace static inline with static constexpr.

335bb34

SharafMohamed added 2 commits November 20, 2024 05:16

Update new_capture_group_start_states to new_capture_group_states to …

0c2c1d1

…reflect functionality change.

Linter.

b0b951a

SharafMohamed changed the title ~~Split NFA positive tags into start and end tags.~~ feat: Split NFA positive tags into start and end tags. Nov 20, 2024

SharafMohamed added 2 commits November 20, 2024 15:11

Update docstring for .

3c2a2ab

Rename to new_start_and_end_states_with_positively_tagged_transitions.

98c5b95

LinZhihao-723 reviewed Nov 20, 2024

View reviewed changes

Rename to capture_X_state.

f59cf41

coderabbitai bot reviewed Nov 20, 2024

View reviewed changes

SharafMohamed added 3 commits November 20, 2024 17:16

Update docstring.

85a2d69

Updated diagram to match vars used in code.

4c602d4

Rename vars to serialized_X.

2b01433

coderabbitai bot reviewed Nov 20, 2024

View reviewed changes

Run Linter.

e37b29a

LinZhihao-723 reviewed Nov 20, 2024

View reviewed changes

src/log_surgeon/finite_automata/RegexNFA.hpp Outdated Show resolved Hide resolved

Fix typo.

c5beca3

coderabbitai bot reviewed Nov 20, 2024

View reviewed changes

Update diagram for capture group NFA.

fe4a7b3

coderabbitai bot reviewed Nov 20, 2024

View reviewed changes

LinZhihao-723 approved these changes Nov 21, 2024

View reviewed changes

SharafMohamed changed the title ~~feat: Split NFA positive tags into start and end tags.~~ feat: Split NFA positive tags into start and end transitions to encapsulate a capture group. Nov 21, 2024

SharafMohamed merged commit 3f13224 into y-scope:main Nov 21, 2024
8 checks passed

coderabbitai bot mentioned this pull request Nov 27, 2024

feat: Add PrefixTree and RegisterHandler to support TDFA simulation. #56

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Split NFA positive tags into start and end transitions to encapsulate a capture group. #50

feat: Split NFA positive tags into start and end transitions to encapsulate a capture group. #50

SharafMohamed commented Nov 18, 2024 •

edited by coderabbitai bot

Loading

LinZhihao-723 left a comment

coderabbitai bot left a comment

coderabbitai bot left a comment

SharafMohamed commented Nov 20, 2024 •

edited

Loading

coderabbitai bot left a comment

coderabbitai bot left a comment

LinZhihao-723 left a comment

feat: Split NFA positive tags into start and end transitions to encapsulate a capture group. #50

feat: Split NFA positive tags into start and end transitions to encapsulate a capture group. #50

Conversation

SharafMohamed commented Nov 18, 2024 • edited by coderabbitai bot Loading

Description

Validation performed

Summary by CodeRabbit

LinZhihao-723 left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

SharafMohamed commented Nov 20, 2024 • edited Loading

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

LinZhihao-723 left a comment

Choose a reason for hiding this comment

SharafMohamed commented Nov 18, 2024 •

edited by coderabbitai bot

Loading

SharafMohamed commented Nov 20, 2024 •

edited

Loading