-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Skip detectors for known bad chunks #1741
base: main
Are you sure you want to change the base?
Conversation
702f879
to
4e47825
Compare
4e47825
to
8e23ec8
Compare
ae8b74e
to
048585d
Compare
pkg/detectors/parseur/parseur.go
Outdated
@@ -30,6 +33,14 @@ func (s Scanner) Keywords() []string { | |||
return []string{"parseur"} | |||
} | |||
|
|||
func (s Scanner) ScanChunk(chunk sources.Chunk) bool { | |||
// TODO: Can |chunk.SourceMetadata| be nil? | |||
if m, ok := chunk.SourceMetadata.GetData().(sources.GitSourceMetadata); ok { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just tested this. Sadly, it does not work because GetData returns {"github": ...}
— life is not so easy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean that you need to explicitly attempt to unmarshal the metadata to determine whether it's from Git?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought that GetData()
returned a top-level object, however, it actually returns an object like {"bitbucket":{...}} | {"github":{...}} | {"gitlab":{...}}
. Only calling GetX()
returns the inner object.
As far as I can tell, the only solution is to construct an ugly switch statement to handle each SourceType, similar to:
trufflehog/pkg/engine/engine.go
Lines 605 to 609 in b03cc30
func SupportsLineNumbers(sourceType sourcespb.SourceType) bool { | |
switch sourceType { | |
case sourcespb.SourceType_SOURCE_TYPE_GIT, | |
sourcespb.SourceType_SOURCE_TYPE_GITHUB, | |
sourcespb.SourceType_SOURCE_TYPE_GITLAB, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like smuggling the interface to avoid changing the entire Detector
signature, and I like the idea of using chunk metadata to help make the decision about whether a given detector should scan a given chunk - assuming we can find a more performant way to do the metadata type check than repeated attempts at unmarshaling.
As for the implied question of whether this is even worth doing: You've convinced me, but I'll leave space for others to weigh in if they have other opinions.
pkg/detectors/parseur/parseur.go
Outdated
@@ -30,6 +33,14 @@ func (s Scanner) Keywords() []string { | |||
return []string{"parseur"} | |||
} | |||
|
|||
func (s Scanner) ScanChunk(chunk sources.Chunk) bool { | |||
// TODO: Can |chunk.SourceMetadata| be nil? | |||
if m, ok := chunk.SourceMetadata.GetData().(sources.GitSourceMetadata); ok { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean that you need to explicitly attempt to unmarshal the metadata to determine whether it's from Git?
pkg/sources/sources.go
Outdated
// GitSourceMetadata defines a common interface for Git-based source metadata. | ||
// For example, this should match Git, Azure, Bitbucket, GitHub, and Gitlab. | ||
type GitSourceMetadata interface { | ||
GetRepository() string | ||
GetCommit() string | ||
GetFile() string | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like this idea, assuming we can find a performant way to actually use it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are the performance implications of this right now? (Assuming that it works, which it doesn't seem to...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, you have to cast the struct
to interface{}
first.
md := data.GetGit()
if d, ok := interface{}(md).(GitSource); ok {
d.GetFile() // do something
}
048585d
to
9509f7a
Compare
acc79eb
to
654a95a
Compare
Repository string | ||
Commit string | ||
File string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using a struct is much less satisfying than an interface. >:(
Also, should these be pointers?
@rosecodym I've updated the implementation based on feedback and testing. Let me know your thoughts on this. |
91841b9
to
7d0c42f
Compare
@rgmz we just had an internal conversation about this and a question came up: How much of this problem is associated with lockfiles specifically? If the answer is "a lot," does it make sense to either use file exclusions or to consider skipping them at the engine level rather than adding complexity to detectors? |
I wouldn't say the problem is specific to lock files. There are lots of files that create a high volume of false detections for specific detectors, that doesn't mean the files should be skipped altogether — even lock files have a one-in-a-million chance to contain secrets. |
3bbcf42
to
bec2bc2
Compare
bec2bc2
to
3f921c6
Compare
3f921c6
to
398774c
Compare
398774c
to
5072fa0
Compare
5072fa0
to
907da3a
Compare
907da3a
to
937841b
Compare
937841b
to
e82f89f
Compare
e82f89f
to
380fea0
Compare
380fea0
to
2d9be58
Compare
Description:
This is a POC to fix #1517. Skipping detectors on chunks that are known to be problematic (e.g., #1460) should improve performance by reducing the number of false-positives and extraneous network requests.
Any feedback and suggestions are welcome. (Also, I have yet to test whether this specific code works; emphasis on "concept".)
Checklist:
make test-community
)?make lint
this requires golangci-lint)?