Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix flakiness on TestFilestreamMetadataUpdatedOnRename #42221

Open
wants to merge 3 commits into
base: 8.x
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 18 additions & 0 deletions filebeat/input/filestream/input_integration_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -101,6 +101,24 @@
"prospector.scanner.check_interval": "1ms",
"prospector.scanner.fingerprint.enabled": false,
"file_identity.native": map[string]any{},

// For some reason this test became flaky, the root of the flakiness
// is not on the test, it is on how a rename operation is detected.
// Even though this test uses `os.Rename`, it does not seem to be an atomic
// operation. https://www.man7.org/linux/man-pages/man2/rename.2.html
// does not make it clear whether 'renameat' (used by `os.Rename`) is
// atomic.
//
// On a flaky execution, the file is actually perceived as removed
// and then a new file is created, both with the same inode. This
// happens on a system that does not reuse inodes as soon they're
// freed. Because the file is detected as removed, it's state is also
// removed. Then when more data is added, only the offset of the new
// data is tracked by the registry, causing the test to fail.
Comment on lines +112 to +117
Copy link
Member

@mauri870 mauri870 Jan 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this be a bug with the file system watcher implementation? AFAIK rename is atomic on Linux.

The man page says:

If newpath already exists, it will be atomically replaced, so
       that there is no point at which another process attempting to
       access newpath will find it missing.  However, there will
       probably be a window in which both oldpath and newpath refer to
       the file being renamed.

For this limitation the file watcher should have a debounce mechanism to avoid changing states multiple times within a short period.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw this bit of the man page, however, my understanding is that it is atomic only when "newpath" already exists:

If newpath already exists, it will be atomically replaced

The documentation makes no statement about the behaviour when "newpath" does not exist.

For this limitation the file watcher should have a debounce mechanism to avoid changing states multiple times within a short period.

We kinda have a debounce approach, we only check the file system once every 10s (default value for prospector.scanner.check_interval), at that point we list the files pointed by the glob, then diff it with our current list of files being harvested, if the file is not in the file system but is in our list, then the file has been removed.

There hasn't been any significant change on the file watcher in a while. It is curious this started happening how.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How are we sure that this happens only in testing and our customers don't face the same issue running in production?

If our customers experience the same issue on their environments, it's not a flaky test it's a bug in our implementation that needs to be fixed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How are we sure that this happens only in testing and our customers don't face the same issue running in production?

Honestly, we don't know. If that happens on a user's deployment, the observable effect will be data duplication of that specific file, which is not a good outcome, but at least it is not data loss.

One particularity of this test, which is very likely what makes it flaky, is the fact that prospector.scanner.check_interval is set to 1ms. Our docs are clear: we do not recommend using values < 1s. The reason for that is that most file systems track modification times in seconds, so scanning faster than that can have unexpected results.

Honestly, if in a real production environment, if a rename operation is taking more than 10s (the default value), there are likely bigger problems happening there.

I do not believe it's a bug with our implementation, we take a "snapshot" of the files (well, the globs defined in paths) and compare it with the list of files we know. If the file is not on the file system, but it is on our list, it has been removed. There isn't much more (if anything) that we can do without getting into the inner workings of the Kernel and file system. We have reasonable defaults and document possible pitfalls users might fall into. Our implementation also behaves the same across all OSes and file systems.

Even inotify does not handle renames directly:

Rename events are not handled directly; i.e., inotify issues two separate events that must be examined and matched in a context of potential race conditions.

Which is pretty much the issue this PR is circumventing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How are we sure that this happens only in testing and our customers don't face the same issue running in production?

If our customers experience the same issue on their environments, it's not a flaky test it's a bug in our implementation that needs to be fixed.

Could be wrong but isn't this a limitation of how the native ID interacts with inode re-use? In which case the "fix" is to use fingerprint which shouldn't be getting confused about a rename.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This case has nothing to do with inode reuse, I manually validated that. The file is correctly renamed, inode is preserved, the content is preserved, etc. I did some tests of quickly deleting and creating a different file and the inodes are never re-used.

The problem is that when we resolve the glob, none of the files are there, neither the original, nor the "destination".

With a prospector.scanner.check_interval so small, there is a small chance that when we resolve the glob, it resolves to nothing, but then late the renamed file appears.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This case has nothing to do with inode reuse, I manually validated that. The file is correctly renamed, inode is preserved, the content is preserved, etc. I did some tests of quickly deleting and creating a different file and the inodes are never re-used.

The problem is that when we resolve the glob, none of the files are there, neither the original, nor the "destination".

With a prospector.scanner.check_interval so small, there is a small chance that when we resolve the glob, it resolves to nothing, but then late the renamed file appears.

OK. So I misunderstood the comment about inodes. Maybe we should add to the prospector.scanner.check_interval docs this possibility.

//
// A workaround for this is to not remove the state when the file is
// removed, hence `clean_removed: false` is set here.
"clean_removed": false,
})

testline := []byte("log line\n")
Expand Down Expand Up @@ -339,7 +357,7 @@
line := []byte("first line\n")
buf := bytes.NewBuffer(nil)
writer := transform.NewWriter(buf, encoder)
writer.Write(line)

Check failure on line 360 in filebeat/input/filestream/input_integration_test.go

View workflow job for this annotation

GitHub Actions / lint (linux)

Error return value of `writer.Write` is not checked (errcheck)
writer.Close()

env.mustWriteToFile(testlogName, buf.Bytes())
Expand Down Expand Up @@ -1088,7 +1106,7 @@
}
n := 0
for n <= iterations {
f.Write([]byte(fmt.Sprintf("hello world %d\n", r*iterations+n)))

Check failure on line 1109 in filebeat/input/filestream/input_integration_test.go

View workflow job for this annotation

GitHub Actions / lint (linux)

Error return value of `f.Write` is not checked (errcheck)
n += 1
time.Sleep(100 * time.Millisecond)
}
Expand Down
Loading