Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Skip detectors for known bad chunks #1741

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

rgmz
Copy link
Contributor

@rgmz rgmz commented Sep 1, 2023

Description:

This is a POC to fix #1517. Skipping detectors on chunks that are known to be problematic (e.g., #1460) should improve performance by reducing the number of false-positives and extraneous network requests.

Any feedback and suggestions are welcome. (Also, I have yet to test whether this specific code works; emphasis on "concept".)

Checklist:

  • Tests passing (make test-community)?
  • Lint passing (make lint this requires golangci-lint)?

@rgmz rgmz requested a review from a team as a code owner September 1, 2023 04:17
@rgmz rgmz force-pushed the feat/detector-chunk-info branch 3 times, most recently from 702f879 to 4e47825 Compare September 7, 2023 17:16
@rgmz rgmz force-pushed the feat/detector-chunk-info branch from 4e47825 to 8e23ec8 Compare September 11, 2023 12:44
@rgmz rgmz marked this pull request as draft October 30, 2023 05:33
@rgmz rgmz force-pushed the feat/detector-chunk-info branch 3 times, most recently from ae8b74e to 048585d Compare December 31, 2023 21:53
@rgmz rgmz marked this pull request as ready for review December 31, 2023 21:53
@rgmz rgmz requested a review from a team as a code owner December 31, 2023 21:53
@@ -30,6 +33,14 @@ func (s Scanner) Keywords() []string {
return []string{"parseur"}
}

func (s Scanner) ScanChunk(chunk sources.Chunk) bool {
// TODO: Can |chunk.SourceMetadata| be nil?
if m, ok := chunk.SourceMetadata.GetData().(sources.GitSourceMetadata); ok {
Copy link
Contributor Author

@rgmz rgmz Dec 31, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just tested this. Sadly, it does not work because GetData returns {"github": ...} — life is not so easy.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean that you need to explicitly attempt to unmarshal the metadata to determine whether it's from Git?

Copy link
Contributor Author

@rgmz rgmz Jan 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought that GetData() returned a top-level object, however, it actually returns an object like {"bitbucket":{...}} | {"github":{...}} | {"gitlab":{...}}. Only calling GetX() returns the inner object.

As far as I can tell, the only solution is to construct an ugly switch statement to handle each SourceType, similar to:

func SupportsLineNumbers(sourceType sourcespb.SourceType) bool {
switch sourceType {
case sourcespb.SourceType_SOURCE_TYPE_GIT,
sourcespb.SourceType_SOURCE_TYPE_GITHUB,
sourcespb.SourceType_SOURCE_TYPE_GITLAB,

Copy link
Collaborator

@rosecodym rosecodym left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like smuggling the interface to avoid changing the entire Detector signature, and I like the idea of using chunk metadata to help make the decision about whether a given detector should scan a given chunk - assuming we can find a more performant way to do the metadata type check than repeated attempts at unmarshaling.

As for the implied question of whether this is even worth doing: You've convinced me, but I'll leave space for others to weigh in if they have other opinions.

@@ -30,6 +33,14 @@ func (s Scanner) Keywords() []string {
return []string{"parseur"}
}

func (s Scanner) ScanChunk(chunk sources.Chunk) bool {
// TODO: Can |chunk.SourceMetadata| be nil?
if m, ok := chunk.SourceMetadata.GetData().(sources.GitSourceMetadata); ok {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean that you need to explicitly attempt to unmarshal the metadata to determine whether it's from Git?

pkg/detectors/detectors.go Outdated Show resolved Hide resolved
Comment on lines 41 to 83
// GitSourceMetadata defines a common interface for Git-based source metadata.
// For example, this should match Git, Azure, Bitbucket, GitHub, and Gitlab.
type GitSourceMetadata interface {
GetRepository() string
GetCommit() string
GetFile() string
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like this idea, assuming we can find a performant way to actually use it.

Copy link
Contributor Author

@rgmz rgmz Jan 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the performance implications of this right now? (Assuming that it works, which it doesn't seem to...)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, you have to cast the struct to interface{} first.

md := data.GetGit()
if d, ok := interface{}(md).(GitSource); ok {
   d.GetFile() // do something
}

pkg/detectors/detectors.go Outdated Show resolved Hide resolved
@CLAassistant
Copy link

CLAassistant commented Jan 13, 2024

CLA assistant check
All committers have signed the CLA.

@rgmz rgmz force-pushed the feat/detector-chunk-info branch 2 times, most recently from acc79eb to 654a95a Compare January 13, 2024 22:24
Comment on lines +43 to +52
Repository string
Commit string
File string
Copy link
Contributor Author

@rgmz rgmz Jan 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using a struct is much less satisfying than an interface. >:(

Also, should these be pointers?

@rgmz
Copy link
Contributor Author

rgmz commented Jan 13, 2024

@rosecodym I've updated the implementation based on feedback and testing. Let me know your thoughts on this.

@rgmz rgmz force-pushed the feat/detector-chunk-info branch 3 times, most recently from 91841b9 to 7d0c42f Compare January 22, 2024 14:49
@rosecodym
Copy link
Collaborator

@rgmz we just had an internal conversation about this and a question came up: How much of this problem is associated with lockfiles specifically? If the answer is "a lot," does it make sense to either use file exclusions or to consider skipping them at the engine level rather than adding complexity to detectors?

@rgmz
Copy link
Contributor Author

rgmz commented Jan 22, 2024

I wouldn't say the problem is specific to lock files. There are lots of files that create a high volume of false detections for specific detectors, that doesn't mean the files should be skipped altogether — even lock files have a one-in-a-million chance to contain secrets.

@rgmz rgmz force-pushed the feat/detector-chunk-info branch 2 times, most recently from 3bbcf42 to bec2bc2 Compare January 27, 2024 17:05
@rgmz rgmz force-pushed the feat/detector-chunk-info branch from bec2bc2 to 3f921c6 Compare February 3, 2024 19:24
@rgmz rgmz changed the title POC to skip detectors for known bad chunks Skip detectors for known bad chunks Feb 3, 2024
@rgmz rgmz force-pushed the feat/detector-chunk-info branch from 3f921c6 to 398774c Compare April 14, 2024 14:25
@rgmz rgmz force-pushed the feat/detector-chunk-info branch from 398774c to 5072fa0 Compare April 27, 2024 14:49
@rgmz rgmz force-pushed the feat/detector-chunk-info branch from 5072fa0 to 907da3a Compare June 5, 2024 00:54
@rgmz rgmz force-pushed the feat/detector-chunk-info branch from 907da3a to 937841b Compare December 2, 2024 14:07
@rgmz rgmz requested review from a team as code owners December 2, 2024 14:07
@rgmz rgmz force-pushed the feat/detector-chunk-info branch from 937841b to e82f89f Compare December 15, 2024 16:47
@rgmz rgmz force-pushed the feat/detector-chunk-info branch from e82f89f to 380fea0 Compare December 31, 2024 15:19
@rgmz rgmz force-pushed the feat/detector-chunk-info branch from 380fea0 to 2d9be58 Compare January 11, 2025 18:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

Pass additional Chunk information to detectors
3 participants