Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Another approach to parse tag string with spaces to avoid long running time on regex matching. #7625

Open
wants to merge 8 commits into
base: develop
Choose a base branch
from

Conversation

dkphm
Copy link
Contributor

@dkphm dkphm commented Oct 29, 2024

Which issue(s) does this change fix?

#6657

Why is this change necessary?

  • Previous fix (fix: improve tag parsing performance for list input #7049) only handle the case when tags are provided through samconfig.toml, which resulted into a list to CfnTags, but it doesn't fix the case when customer specify --tags parameter. In the later case, the provided value will be in a string, and we will face the issue with re.findall again, mainly due to the complex regex provided to parse {tag}={tag} pattern in tags.

  • The solution is instead of parsing a full complex regex with {tag}={tag}, we will look for any quote strings with space in the value (e.g. "Test App"), replace the space with some replacement, and restart the parsing process.

How does it address the issue?

Run sam deploy with --tags tag1="value 1" tag2="value 2"

What side effects does this change have?

Customer might not be able to set tags properly during sam deploy.

Mandatory Checklist

PRs will only be reviewed after checklist is complete

  • Add input/output type hints to new functions/methods
  • Write design document if needed (Do I need to write a design document?)
  • Write/update unit tests
  • Write/update integration tests
  • Write/update functional tests if needed
  • make pr passes
  • make update-reproducible-reqs if dependencies were changed
  • Write documentation

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@dkphm dkphm requested a review from a team as a code owner October 29, 2024 19:09
@github-actions github-actions bot added pr/external stage/needs-triage Automatically applied to new issues and PRs, indicating they haven't been looked at. labels Oct 29, 2024
@dkphm dkphm added pr/internal and removed pr/external stage/needs-triage Automatically applied to new issues and PRs, indicating they haven't been looked at. labels Oct 29, 2024
@dkphm dkphm requested review from hawflau and removed request for mildaniel October 29, 2024 21:18
parsed, parsed_tag = CfnTags._standard_key_value_parser(value)
if not parsed:
return False, None
tags_dict = {**tags_dict, **parsed_tag}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit (if parsed_tag is of type dict)

Suggested change
tags_dict = {**tags_dict, **parsed_tag}
tags_dict.update(parsed_tag)

for value in tag_value.split():
parsed, parsed_tag = CfnTags._standard_key_value_parser(value)
if not parsed:
return False, None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How come we're also returning a boolean here? Can we just return None, or an empty dict instead of returning a tuple?

Comment on lines 252 to 253
# first, remove all the spaces and start over again.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there missing code here, or maybe duplicate comment? This comment and the one under say "first"


# Next, looking for a quote strings that contain spaces and proceed to replace them
quoted_strings_with_spaces = re.findall(self._quoted_pattern, modified_val)
quoted_strings_without_spaces = [self._replace_spaces(s) for s in quoted_strings_with_spaces]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: no single character variable names

for group in groups:
key, v = group
self._add_value(result, _unquote_wrapped_quotes(key), _unquote_wrapped_quotes(v))
# Instead of parsing a full {tag}={tag} pattern, we will try to look for quoted string with spaces
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we refactor this new logic out to a separate method to unit test? There's a lot of moving parts here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is tricky because the block below

groups = re.findall(self._pattern, val)

take longer time to return

@@ -194,6 +219,7 @@ def __init__(self, multiple_values_per_key=False):
TAG_REGEX = '[A-Za-z0-9\\"_:\\.\\/\\+-\\@=]'

_pattern = r"{tag}={tag}".format(tag=_generate_match_regex(match_pattern=TAG_REGEX, delim=" "))
_quoted_pattern = _generate_quoted_match_regex(match_pattern=TAG_REGEX)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we reuse or refactor the old _generate_match_regex? They share similar return values.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They have different regex, hence creating a new function.

# Finally, restart the parsing with key=value separated by (multiple) spaces.
parsed, tags = self._multiple_space_separated_key_value_parser(modified_val)
if parsed:
for k in tags:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
for k in tags:
for key, value in tags.items():

space_positions = [i for i, char in enumerate(text) if char == " "]
modified = text.replace(" ", replacement)

return modified, space_positions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thoughts on using dataclasses for return objects, instead of returning tuples and performing array indexes to get the modified strings/replacement index array?

I imagine that changing things to dataclasses will make it easier for future readers to understand what is happening quickly.

@lucashuy
Copy link
Contributor

lucashuy commented Nov 5, 2024

What side effects does this change have?

Customer might not be able to set tags properly during sam deploy.

What does this mean? Does this change how tags works for customers?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants