-
Notifications
You must be signed in to change notification settings - Fork 296
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improved Bambenek parser to better handle description #1451
base: develop
Are you sure you want to change the base?
Conversation
amojamo
commented
Sep 13, 2019
- Changed value to values as it makes more sense.
- Use of regex to read the description for better robustness. As it stands now, there is a conflict when the Bambenek parser reads the IP list. This is because there is a slight change in the Bambenek IP list, where they have a longer description with more commas than usual.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What was the actual problem you were facing? E.g. a line which could not be parsed?
@@ -32,32 +33,33 @@ def parse_line(self, line, report): | |||
self.tempdata.append(line) | |||
|
|||
else: | |||
value = line.split(',') | |||
m = re.match(r"(?P<ip>\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}),(?P<description>.*), \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does not seem to work with IPv6
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we change .*
(greedy) to .*?
(non-greedy) here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I'm sorry, I didn't consider IPv6. Regex for an IPv6 pattern is out of my scope.
The greedy to non-greedy works for the description group, but not for the URL group.
value = line.split(',') | ||
m = re.match(r"(?P<ip>\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}),(?P<description>.*), \ | ||
(?P<timestamp>\d{4}-\d{2}-\d{2}[ ]\d{2}[:]\d{2}),(?P<url>.*)", line) | ||
values = m.groups() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That line raises an exception in the tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could be my line split. I'm not sure how to break the regex line into two lines in order not to trigger the "Line too long" warning when it comes to code style checking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just split the line like this:
m = re.match(r"(?P<ip>\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}),(?P<description>.*), "
r"(?P<timestamp>\d{4}-\d{2}-\d{2}[ ]\d{2}[:]\d{2}),(?P<url>.*)", line)
@@ -32,32 +33,33 @@ def parse_line(self, line, report): | |||
self.tempdata.append(line) | |||
|
|||
else: | |||
value = line.split(',') | |||
m = re.match(r"(?P<ip>\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}),(?P<description>.*), \ | |||
(?P<timestamp>\d{4}-\d{2}-\d{2}[ ]\d{2}[:]\d{2}),(?P<url>.*)", line) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we change .*
(greedy) to .*?
(non-greedy) here?
Can we $
at the end to indicate the regular expression should match the full line?
They (Bambenek) have since last time edited the IP list, so it parses the line without error as of today. The problem before their last edit was on the following line(s):
This has since been changed to the following:
The problem was that the parsing failed because there were more commas than anticipated, so This kinda makes the change obsolete in a way, but without a regex expression the parser is more fragile than it needs to be. |
Well, that's obvious.
If the regular expression itself is stable - yes. I opened https://github.com/amojamo/intelmq/pull/1 for some tests which you could use for development. In case the format is proper CSV (using commas is ok then if properly escaped), we can use the csv parser of python itself like with any csv-based feed. That's IMHO the best option. |
TST: add bambenek tests for commas in descriptions
Are you still working on this? The tests are still failing on |