Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Split NFA positive tags into start and end transitions to encapsulate a capture group. #50

Merged
merged 257 commits into from
Nov 21, 2024

Conversation

SharafMohamed
Copy link
Contributor

@SharafMohamed SharafMohamed commented Nov 18, 2024

Description

  • In tagged Regex, positive tags can exist in isolation. However, for capture groups to be useful we always need them to exist in pairs, when before the capture group and one after.
  • Split the existing positive tags into a start and end tag.

Validation performed

  • Unit-tests updated to output both start and end tags, runs correctly.

Summary by CodeRabbit

  • New Features

    • Improved handling of capture groups in regex processing, enhancing NFA state transitions.
    • Clearer distinction between start and end tagged transitions for better clarity in transition management.
  • Bug Fixes

    • Enhanced traversal logic for tagged transitions to ensure accurate processing.
  • Tests

    • Updated test cases to reflect changes in serialized NFA output, including clearer naming conventions for transitions.

…for clairty that nothing is shared b/w tests
…egexASTgroup with min = 1 OR'd with RegexASTEmpty
…iteral arguments; Use const& for non-literals; Use auto where possible; Use uint32_t over int for ids; replace begin() and end() with cbegin() and cend()
@SharafMohamed SharafMohamed changed the title Split NFA positive tags into start and end tags. feat: Split NFA positive tags into start and end tags. Nov 20, 2024
Copy link
Member

@LinZhihao-723 LinZhihao-723 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the major renaming issue is about this new graph we introduced:

root --(pos_tagged_start_transition)--> capture_group_start_state --> [inner capture group NFA] --(neg_tagged_transition)--> neg_state --> capture_group_end_state --(pos_tagged_end_transition)--> end_state

Things we introduced:

  • pos_tagged_start_transition (edge)
  • pos_tagged_end_transition (edge)
  • capture_group_start_state (state)
  • capture_group_end_state (state)

IMO the naming of these four edges/states is important for people who first read this code. I'm not sure if the current naming is already optimized to its best state:

  • positive "tagged" start/end transition and "capture group" states using different terms. Is it possible to stick to one? Either tagged or capture group
  • If we use the transition to name a new state, i.e., new_state_with_positive_tagged_end_transition isn't clear which state it creates, as it has two states connected. And for people who first read the code, it could be either the transition points to the new state or the new state has a new transition pointing to other state. Is it more easier to understand if we explicitly name the new function by its state, i.e., new_capture_group_end_state, that matches the diagram above and can visually tell people which state it creates. (that is why I suggested new_positive_tagged_transition_end_state as it explicitly speaks out which new state it creates, but we may need to come up a better name rather than positive_tagged_transition_end_state).

It's my mistake that I should realize some of the problems above in the previous PR. When this PR introduces more complexity to the tagged transitions, I realized more clarifications might be required.
I think the action item would be to formalize the diagram above, and then we can apply to the related code and future PRs. Maybe we can send an issue to track and get this PR merged first.
@kirkrodrigues Can u help take a look as well?

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (1)
src/log_surgeon/finite_automata/RegexAST.hpp (1)

899-901: Update diagram comment to match current variable names.

The diagram should use capture_start_state and capture_end_state to match the implementation.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 98c5b95 and f59cf41.

📒 Files selected for processing (1)
  • src/log_surgeon/finite_automata/RegexAST.hpp (2 hunks)
🧰 Additional context used
📓 Learnings (1)
src/log_surgeon/finite_automata/RegexAST.hpp (2)
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#48
File: src/log_surgeon/finite_automata/RegexNFAState.hpp:0-0
Timestamp: 2024-11-13T20:02:13.737Z
Learning: In `src/log_surgeon/finite_automata/RegexNFAState.hpp`, the constructor `RegexNFAState(std::set<Tag const*> tags, RegexNFAState const* dest_state)` has been updated to use `std::vector<Tag const*> tags` instead of `std::set`.
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#48
File: src/log_surgeon/finite_automata/RegexAST.hpp:700-700
Timestamp: 2024-11-13T22:38:19.472Z
Learning: In `RegexASTCapture`, `m_tag` must always be non-null.
🔇 Additional comments (2)
src/log_surgeon/finite_automata/RegexAST.hpp (2)

696-698: LGTM! Parameter name change improves clarity.

The rename from end_state to dest_state better reflects the parameter's role as a destination state rather than implying it's the final state.


902-911: LGTM! Well-structured implementation with clear state management.

The implementation:

  • Correctly splits positive tags into start and end transitions
  • Uses RAII-style state creation through the factory method
  • Properly manages root state to prevent side effects
  • Follows the documented NFA structure

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (2)
src/log_surgeon/finite_automata/RegexNFAState.hpp (2)

52-55: Add parameter validation for tag and dest_state

Consider adding null checks for the parameters to prevent potential issues.

 auto
 add_positive_tagged_start_transition(Tag const* tag, RegexNFAState const* dest_state) -> void {
+    if (tag == nullptr || dest_state == nullptr) {
+        throw std::invalid_argument("tag and dest_state must not be null");
+    }
     m_positive_tagged_start_transitions.emplace_back(tag, dest_state);
 }

Line range hint 190-234: Fix indentation in the error handling block

The indentation of the error handling block at line 206 is inconsistent with the surrounding code.

     if (m_positive_tagged_end_transition.has_value()) {
         auto const optional_serialized_positive_end_transition
                 = m_positive_tagged_end_transition.value().serialize(state_ids);
         if (false == optional_serialized_positive_end_transition.has_value()) {
-            return std::nullopt;
+             return std::nullopt;
         }
         serialized_positive_tagged_end_transition = optional_serialized_positive_end_transition.value();
     }
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 85a2d69 and 2b01433.

📒 Files selected for processing (2)
  • src/log_surgeon/finite_automata/RegexAST.hpp (2 hunks)
  • src/log_surgeon/finite_automata/RegexNFAState.hpp (5 hunks)
🧰 Additional context used
📓 Learnings (2)
src/log_surgeon/finite_automata/RegexAST.hpp (2)
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#48
File: src/log_surgeon/finite_automata/RegexNFAState.hpp:0-0
Timestamp: 2024-11-13T20:02:13.737Z
Learning: In `src/log_surgeon/finite_automata/RegexNFAState.hpp`, the constructor `RegexNFAState(std::set<Tag const*> tags, RegexNFAState const* dest_state)` has been updated to use `std::vector<Tag const*> tags` instead of `std::set`.
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#48
File: src/log_surgeon/finite_automata/RegexAST.hpp:700-700
Timestamp: 2024-11-13T22:38:19.472Z
Learning: In `RegexASTCapture`, `m_tag` must always be non-null.
src/log_surgeon/finite_automata/RegexNFAState.hpp (2)
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#48
File: src/log_surgeon/finite_automata/RegexNFAState.hpp:0-0
Timestamp: 2024-11-13T20:02:13.737Z
Learning: In `src/log_surgeon/finite_automata/RegexNFAState.hpp`, the constructor `RegexNFAState(std::set<Tag const*> tags, RegexNFAState const* dest_state)` has been updated to use `std::vector<Tag const*> tags` instead of `std::set`.
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#47
File: src/log_surgeon/finite_automata/TaggedTransition.hpp:16-37
Timestamp: 2024-11-10T16:46:58.543Z
Learning: In `src/log_surgeon/finite_automata/TaggedTransition.hpp`, the classes `PositiveTaggedTransition` and `NegativeTaggedTransition` currently do not share enough functionality to justify refactoring into a common base class.
🔇 Additional comments (4)
src/log_surgeon/finite_automata/RegexNFAState.hpp (2)

35-35: LGTM: Clean implementation of tag splitting

The implementation correctly separates positive tags into start and end transitions, using appropriate container types (vector for multiple start transitions, optional for single end transition).

Also applies to: 113-114


Line range hint 35-234: Verify the impact on transition consumers

Let's verify that all code using positive tagged transitions has been updated to handle the split between start and end transitions.

✅ Verification successful

All code has been properly updated to handle the split between positive tagged start and end transitions

The verification shows that all consumers of the positive tagged transitions have been properly updated to handle the split between start and end transitions. This is evidenced by:

  1. The test cases in test-NFA.cpp have been updated to use the new format with separate positive_tagged_start_transitions and positive_tagged_end_transitions
  2. The RegexNFA class has been updated to handle both types:
    • Uses add_positive_tagged_start_transition for start transitions
    • Uses new_state_with_positive_tagged_end_transition for end transitions
  3. The traversal code in get_bfs_traversal_order correctly handles both transition types separately
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for direct usage of the old positive tagged transitions
rg -l "get_positive_tagged_transitions|m_positive_tagged_transitions" --type cpp

# Search for potential transition consumers
ast-grep --pattern 'class $_ {
  $$$
  auto $_($$$) -> void {
    $$$
    positive_tagged_$$$
    $$$
  }
  $$$
}'

Length of output: 199


Script:

#!/bin/bash
# Search for any references to positive tagged transitions with more context
rg -A 5 -B 5 "positive_tagged" --type cpp

# Search for any files that might be handling transitions
rg -l "RegexNFAState|TaggedTransition" --type cpp

Length of output: 21775

src/log_surgeon/finite_automata/RegexAST.hpp (2)

696-698: LGTM! Documentation updated consistently.

The parameter name in the documentation has been updated to match the method signature change from end_state to dest_state.


895-910: LGTM! Implementation follows RAII principles and maintains NFA state.

The changes effectively:

  1. Split positive tags into start and end tags
  2. Use RAII-style initialization for state management
  3. Properly handle root state preservation
  4. Match the documented NFA structure in the comment

@SharafMohamed
Copy link
Contributor Author

SharafMohamed commented Nov 20, 2024

I think the major renaming issue is about this new graph we introduced:

root --(pos_tagged_start_transition)--> capture_group_start_state --> [inner capture group NFA] --(neg_tagged_transition)--> neg_state --> capture_group_end_state --(pos_tagged_end_transition)--> end_state

Things we introduced:

  • pos_tagged_start_transition (edge)
  • pos_tagged_end_transition (edge)
  • capture_group_start_state (state)
  • capture_group_end_state (state)

IMO the naming of these four edges/states is important for people who first read this code. I'm not sure if the current naming is already optimized to its best state:

  • positive "tagged" start/end transition and "capture group" states using different terms. Is it possible to stick to one? Either tagged or capture group
  • If we use the transition to name a new state, i.e., new_state_with_positive_tagged_end_transition isn't clear which state it creates, as it has two states connected. And for people who first read the code, it could be either the transition points to the new state or the new state has a new transition pointing to other state. Is it more easier to understand if we explicitly name the new function by its state, i.e., new_capture_group_end_state, that matches the diagram above and can visually tell people which state it creates. (that is why I suggested new_positive_tagged_transition_end_state as it explicitly speaks out which new state it creates, but we may need to come up a better name rather than positive_tagged_transition_end_state).

It's my mistake that I should realize some of the problems above in the previous PR. When this PR introduces more complexity to the tagged transitions, I realized more clarifications might be required. I think the action item would be to formalize the diagram above, and then we can apply to the related code and future PRs. Maybe we can send an issue to track and get this PR merged first. @kirkrodrigues Can u help take a look as well?

root --(`m_tag` start)--> capture_start_state --> [`m_group_regex_ast` NFA] --(`m_negative_tags`)--> capture_end_state --(`m_tag` end)--> dest_state

This diagram uses concise variable names for all transitions, states, and the sub-NFA. In terms of incoming vs outgoing edges, in general LS has states store out-going edges, therefore any edge associated with a state is always out-going. Maybe this can be made more clear in the docstring for the NFAState class.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (1)
src/log_surgeon/finite_automata/RegexAST.hpp (1)

899-901: Consider enhancing the comment with ASCII art.

The comment clearly explains the NFA structure, but could be enhanced with ASCII art for better visualization:

/*
 * NFA structure:
 *
 *   root --(`m_tag` start)--> capture_start_state --> [`m_group_regex_ast` NFA]
 *         --(`m_negative_tags`)--> capture_end_state --(`m_tag` end)--> dest_state
 *
 *   Example ASCII art:
 *   [root] ---tag_start---> [capture_start] ---group_ast---> [capture_end] ---tag_end---> [dest]
 *                                                ^                  |
 *                                                |                  |
 *                                                +--negative_tags---+
 */
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between e37b29a and c5beca3.

📒 Files selected for processing (2)
  • src/log_surgeon/finite_automata/RegexAST.hpp (2 hunks)
  • src/log_surgeon/finite_automata/RegexNFA.hpp (5 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/log_surgeon/finite_automata/RegexNFA.hpp
🧰 Additional context used
📓 Learnings (1)
src/log_surgeon/finite_automata/RegexAST.hpp (2)
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#48
File: src/log_surgeon/finite_automata/RegexNFAState.hpp:0-0
Timestamp: 2024-11-13T20:02:13.737Z
Learning: In `src/log_surgeon/finite_automata/RegexNFAState.hpp`, the constructor `RegexNFAState(std::set<Tag const*> tags, RegexNFAState const* dest_state)` has been updated to use `std::vector<Tag const*> tags` instead of `std::set`.
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#48
File: src/log_surgeon/finite_automata/RegexAST.hpp:700-700
Timestamp: 2024-11-13T22:38:19.472Z
Learning: In `RegexASTCapture`, `m_tag` must always be non-null.
🔇 Additional comments (2)
src/log_surgeon/finite_automata/RegexAST.hpp (2)

696-698: LGTM! Documentation is clear and consistent.

The parameter name change from end_state to dest_state in the documentation matches the method signature update.


895-910: Verify usage of the new factory method.

The implementation looks good and follows RAII principles. Let's verify that all capture group creation sites use the new factory method consistently.

✅ Verification successful

Let me gather more context about the capture group implementations to ensure comprehensive verification.


Let me check the implementation of the factory method to complete the verification.


Let me try one final search with a different pattern to locate the factory method implementation.


Factory method implementation is correctly used for capture groups

The verification confirms that the new factory method new_start_and_end_states_with_positive_tagged_transitions is the only method used for creating capture group states, and it properly encapsulates the state creation and transition setup. The implementation correctly uses RAII principles and maintains consistency throughout the codebase.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify consistent usage of the new factory method for capture groups.

# Search for direct state creation and transition addition that should use the factory method
rg -U "new_state.*add_positive_tagged.*transition" --type cpp

# Search for the new factory method usage
rg "new_start_and_end_states_with_positive_tagged_transitions" --type cpp

Length of output: 520


Script:

#!/bin/bash
# Search for capture-related implementations and their state management
ast-grep --pattern 'class RegexASTCapture {
  $$$
}'

# Find all capture-related state creation patterns
rg -U "class.*Capture.*\{[\s\S]*?}" --type cpp -A 20

# Look for any other potential capture state creation patterns
rg -U "capture.*state" --type cpp

Length of output: 3908


Script:

#!/bin/bash
# Get the implementation of the factory method
rg -U "new_start_and_end_states_with_positive_tagged_transitions.*\{[\s\S]*?\}" --type cpp

# Check if there are any other methods in RegexNFA that create states with tagged transitions
ast-grep --pattern 'class RegexNFA {
  $$$
  new_state$$$
  $$$
}'

Length of output: 161


Script:

#!/bin/bash
# Search for the factory method implementation with more context
rg -U "new_start_and_end_states_with_positive_tagged_transitions" --type cpp -B 5 -A 20

# Look for any other state creation patterns in RegexNFA
rg -U "RegexNFA.*new_state" --type cpp -B 2 -A 10

Length of output: 8980

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (1)
src/log_surgeon/finite_automata/RegexAST.hpp (1)

899-929: LGTM! Clear and helpful diagram.

The ASCII diagram effectively illustrates the NFA structure with split tags. The transitions and states are well-labeled, making it easy to understand the flow.

Consider moving this diagram to a dedicated documentation file as suggested in the TODO comment on line 899. This would make it more discoverable and maintainable.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between c5beca3 and fe4a7b3.

📒 Files selected for processing (1)
  • src/log_surgeon/finite_automata/RegexAST.hpp (2 hunks)
🧰 Additional context used
📓 Learnings (1)
src/log_surgeon/finite_automata/RegexAST.hpp (2)
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#48
File: src/log_surgeon/finite_automata/RegexNFAState.hpp:0-0
Timestamp: 2024-11-13T20:02:13.737Z
Learning: In `src/log_surgeon/finite_automata/RegexNFAState.hpp`, the constructor `RegexNFAState(std::set<Tag const*> tags, RegexNFAState const* dest_state)` has been updated to use `std::vector<Tag const*> tags` instead of `std::set`.
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#48
File: src/log_surgeon/finite_automata/RegexAST.hpp:700-700
Timestamp: 2024-11-13T22:38:19.472Z
Learning: In `RegexASTCapture`, `m_tag` must always be non-null.
🔇 Additional comments (3)
src/log_surgeon/finite_automata/RegexAST.hpp (3)

696-698: LGTM! Documentation update is clear and accurate.

The parameter name change from end_state to dest_state in the documentation aligns well with the method signature update.


895-898: LGTM! Method signature update improves clarity.

The parameter name change from end_state to dest_state is more accurate as the state isn't necessarily the end state in the new split-tag design.


930-939: LGTM! Clean implementation that follows RAII principles.

The implementation:

  1. Uses the factory method for state creation
  2. Properly manages root state scope
  3. Matches the diagram exactly
  4. Successfully splits the positive tags into start and end tags

Copy link
Member

@LinZhihao-723 LinZhihao-723 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for the PR title, how about:
feat: Split NFA positive tags into start and end transitions to encapsulate a capture group.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants