Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data loss when using compression (ros2 bag record) #978

Closed
chrmel opened this issue Mar 25, 2022 · 10 comments
Closed

Data loss when using compression (ros2 bag record) #978

chrmel opened this issue Mar 25, 2022 · 10 comments
Labels
bug Something isn't working

Comments

@chrmel
Copy link

chrmel commented Mar 25, 2022

Description

Recording big bags with ros2 bag record --max-bag-size=2000000000 --compression-mode file --compression-format zstd topics get lost during compression.

Expected Behavior

When a new bag is opened and the old one gets compressed I would expect the new bag to contain all topics published (those published during time of compression as well).

Actual Behavior

During the compression of the just closed bag (due to max-bag-size) there are no topics recorded in the new bag.

To Reproduce

  1. Start a system with big data occurring (in my case a camera (1280x720 @ 10fps) and motion data from a Bluetooth device)
  2. Record raw image topics and motion data with: ros2 bag record --max-bag-size=2000000000 --compression-mode file --compression-format zstd
  3. Replay the recorded bag

System

  • OS: Ubuntu 18.04
  • ROS 2 Distro: Foxy (built from source)
  • Version: ros2

Additional Information

When recording bags without compression this issue does not occur.

Suspicion

Is it possible that the compression of the bag is run in a new thread and not a process? Because, as you probably know, in python a new thread cannot be distributed to a different CPU core only a process can. If recording and compressing are two threads running in the same process could it be that the compression thread blocks the recording thread by using all resources of this one CPU core?

Thank you!

Possibly related to #973

@chrmel chrmel added the bug Something isn't working label Mar 25, 2022
@emersonknapp
Copy link
Collaborator

For context, the threading logic all happens in a C++ layer - the Python CLI is only a thin wrapper around calling the C++ core.

You mention you're building from source, are you using the foxy branch, or the foxy-future branch? I might recommend foxy-future as it has many performance improvements and bugfixes that could not be released officially into Foxy due to API breakage

@MichaelOrlov
Copy link
Contributor

Can be related to the #936, #866 and #647

@chrmel
Copy link
Author

chrmel commented Mar 30, 2022

> For context, the threading logic all happens in a C++ layer - the Python CLI is only a thin wrapper around calling the C++ core.

Thank you for clarification about Python CLI wrapping C++ logic. I did not know how this worked.

> You mention you're building from source, are you using the foxy branch, or the foxy-future branch? I might recommend foxy-future as it has many performance improvements and bugfixes that could not be released officially into Foxy due to API breakage

Actually I am not quite sure which branch. My work is greatly based on the ros.foxy.Dockerfile from dusty-nv/jetson-containers which uses rosinstall_generator --deps --rosdistro foxy ros_base ... to fetch the repos.

Update: So I am in fact using the foxy branch. Tried building the foxy-future branch but rosinstall_generator does only allow the base branch names (foxy, galactic, ...).

@amacneil
Copy link
Contributor

amacneil commented May 1, 2022

This is a very concerning bug - data loss is a worst case scenario for a data recording tool. Has anyone tried to repro/investigate it in galactic/humble/rolling?

@clalancette
Copy link
Contributor

This is a very concerning bug - data loss is a worst case scenario for a data recording tool. Has anyone tried to repro/investigate it in galactic/humble/rolling?

I haven't investigated it myself, but looking at the Dockerfile in use, the original reporter is likely using the foxy branch of this repository. That branch has known performance and dataloss issues, which is why we recommend the foxy-future branch there. It would be interesting to see if the original problem can be reproduced with the foxy-future branch, which is much closer to what is in Galactic.

@amacneil
Copy link
Contributor

amacneil commented May 3, 2022

Good point.

@chrmel have you tested this in Galactic?

@chrmel
Copy link
Author

chrmel commented May 3, 2022

@amacneil I tried building from source with branch galactic but it did not succeed. I can try test it with the pre-built packages.

@chrmel
Copy link
Author

chrmel commented May 3, 2022

@amacneil, @clalancette, @emersonknapp

Ok, I tested my setup with the pre-built debian packages for distros foxy and galactic.

Testing with sample data

Every step in the data represents the time a new split bag file beeing created.
image

foxy

Recording data with ros2 bag record --max-bag-size=500000000 --compression-mode file --compression-format zstd /image_raw /image_raw/compressed /motion.

Same issue as described when built from source. Everytime a new split bag file starts there is a data gap.

Play bag with ros2 bag play --topics=/motion --rate=1.0 rosbag2_foxy/
image

galactic and rolling

Recording data with ros2 bag record --max-bag-size=500000000 /image_raw /image_raw/compressed /motion.

I was not able to reproduce the problem BECAUSE a different problem occurred.
It seems that split bag files cannot be properly played in galactic (with and without compression). When playing the bags only the last file of the sequence of split files is played properly. Preceding files seem to be either skipped or the data beeing squashed at the beginning of the last split bag file.

Play bag with ros2 bag play --read-ahead-queue-size 50000 --topics=/motion --rate=1.0 rosbag2/. Only the data from the last split bag file is played (you can see the last distinctive ripple in the data).
image

@MichaelOrlov
Copy link
Contributor

Issue related to the playing only last bag from split was found before and described here #966

@MichaelOrlov
Copy link
Contributor

Notes:

  1. With SQLite3 there is a current design limitation and suboptimal performance when using --max-bag-size differs from 0
    See Improving performance of should_split_bagfile() #647 (comment). Recommended to either use MCAP file format or not use --max-bag-size parameter with the SQlite3 backend for better performance.
  2. Yes. There is a known issue that compression threads may consume all CPU resources and recording threads will be in starvation which will lead to the messages being lost. The solution will be done in the Add option to set compression threads priority #1457. However, there are no CLI parameter yet for the compression_thread_priority option. Only will be available via node parameters for the composable node. The follow-up PR with adding CLI parameter is welcome.
  • Closing this issue as stale and since there are workarounds already exists.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants