Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AP_DDS: Automatic reconnect to MicroROS Agent not working #23372

Closed
Ryanf55 opened this issue Mar 31, 2023 · 9 comments · Fixed by #25228
Closed

AP_DDS: Automatic reconnect to MicroROS Agent not working #23372

Ryanf55 opened this issue Mar 31, 2023 · 9 comments · Fixed by #25228
Assignees
Labels
BUG For-4.5 Planned for 4.5 release ROS

Comments

@Ryanf55
Copy link
Collaborator

Ryanf55 commented Mar 31, 2023

If connection between the autopilot and companion computer is flaky, severed, or the micro ROS agent restarts at runtime, the connection is not recovered.

The scope of this issue is to perform the following

  • Add a callback that gets triggered when connection (heartbeat) is lost
  • Add a GCS message that connection is lost
  • Add recovery behavior to re-iniitalize publishers, subscribers, data writers, data readers, and topics
  • Stretch: Add a system test to check recovery behavior
@Ryanf55 Ryanf55 added For-4.5 Planned for 4.5 release ROS labels Mar 31, 2023
@Ryanf55 Ryanf55 added this to DDS/ROS2 Mar 31, 2023
@Ryanf55 Ryanf55 moved this to 🔖 Ready in DDS/ROS2 Apr 5, 2023
@Ryanf55 Ryanf55 added the BUG label May 5, 2023
@Ryanf55 Ryanf55 changed the title AP_DDS: Support automatic reconnect to MicroROS Agent AP_DDS: Automatic reconnect to MicroROS Agent not working May 5, 2023
@vibgyor-s
Copy link

vibgyor-s commented Sep 12, 2023

Currently facing an issue with running the microros agent with SITL (UDP), wherein the microros agent just terminates due to "bad_array_new_length". No topics thus published on ROS2.

Cannot even restart the microros to fix this, as the then the reconnection is not possible.
What could be the reason for the following error?


[mavproxy.py --out 127.0.0.1:14550 --out 127.0.0.1:14551 --master tcp:127.0.0.1:5760 --sitl 127.0.0.1:5501 --non-interactive -3] AP: Frame: QUAD/PLUS
[micro_ros_agent-1] [1694444766.334808] info | Root.cpp | create_client | create | client_key: 0xAAAABBBB, session_id: 0x81
[micro_ros_agent-1] [1694444766.335212] info | SessionManager.hpp | establish_session | session established | client_key: 0xAAAABBBB, address: 127.0.0.1:36817
[mavproxy.py --out 127.0.0.1:14550 --out 127.0.0.1:14551 --master tcp:127.0.0.1:5760 --sitl 127.0.0.1:5501 --non-interactive -3]
[micro_ros_agent-1] terminate called after throwing an instance of 'std::bad_array_new_length'
[micro_ros_agent-1] what(): std::bad_array_new_length
[mavproxy.py --out 127.0.0.1:14550 --out 127.0.0.1:14551 --master tcp:127.0.0.1:5760 --sitl 127.0.0.1:5501 --non-interactive -3] AP: ArduPilot Ready
[mavproxy.py --out 127.0.0.1:14550 --out 127.0.0.1:14551 --master tcp:127.0.0.1:5760 --sitl 127.0.0.1:5501 --non-interactive -3] AP: AHRS: DCM active
[mavproxy.py --out 127.0.0.1:14550 --out 127.0.0.1:14551 --master tcp:127.0.0.1:5760 --sitl 127.0.0.1:5501 --non-interactive -3] AP: DDS Client: Init Complete
[ERROR] [micro_ros_agent-1]: process has died [pid 90931, exit code -6, cmd '/home/vibsin/workspace/DroneSim/ros2_ardup_ws/install/micro_ros_agent/lib/micro_ros_agent/micro_ros_agent udp4 --middleware dds --port 2019 --refs /home/vibsin/workspace/DroneSim/ros2_ardup_ws/install/ardupilot_sitl/share/ardupilot_sitl/config/dds_xrce_profile.xml --ros-args -r __node:=micro_ros_agent -r __ns:=/'].
[mavproxy.py --out 127.0.0.1:14550 --out 127.0.0.1:14551 --master tcp:127.0.0.1:5760 --sitl 127.0.0.1:5501 --non-interactive -3] AP: XRCE Client: Participant session request failure
[mavproxy.py --out 127.0.0.1:14550 --out 127.0.0.1:14551 --master tcp:127.0.0.1:5760 --sitl 127.0.0.1:5501 --non-interactive -3] AP: DDS Client: Creation Requests failed
[mavproxy.py --out 127.0.0.1:14550 --out 127.0.0.1:14551 --master tcp:127.0.0.1:5760 --sitl 127.0.0.1:5501 --non-interactive -3] AP: RC7: SaveWaypoint LOW
[mavproxy.py --out 127.0.0.1:14550 --out 127.0.0.1:14551 --master tcp:127.0.0.1:5760 --sitl 127.0.0.1:5501 --non-interactive -3] paramftp: bad count 1327 should be 1325
[mavproxy.py --out 127.0.0.1:14550 --out 127.0.0.1:14551 --master tcp:127.0.0.1:5760 --sitl 127.0.0.1:5501 --non-interactive -3] AP: ArduCopter V4.5.0-dev (768e240)

@srmainwaring
Copy link
Contributor

@vibgyor-s I can't tell what might be causing this from the log you posted. Could you post all steps required to replicate (terminal commands and full log) and some details about the system you're running.

@srmainwaring srmainwaring self-assigned this Oct 10, 2023
@srmainwaring
Copy link
Contributor

srmainwaring commented Oct 10, 2023

Notes

The PX4 uxrce_dds_client has some support for reconnecting to the micro-ROS agent if the connection is dropped. It makes use of the uxr ping functions declared in uxr/client/util/ping.h to monitor the connection status:

  • uxr_ping_agent
  • uxr_ping_agent_attempts

To implement similar behaviour in ArduPilot AP_DDS we need the following:

  • Add member variable status_ok and assign this to the result of the uxr_run_session_time in update.
  • Reserve the member variable connected for the result of a ping test.
  • Split the init function into init_transport and init_session.
  • Do an initial ping test after initialising the transport.
  • Add a reconnect loop about the update loop in main_loop.
  • Add a call to uxr_delete_session_retries if the connection is dropped.
  • Add a periodic ping test to the update loop.
  • We should also close the transport in the AP_DDS_Client destructor.
  • Move uxr_init_session out of ddsSerialInit and ddsUdpInit as it must be called on reconnect.

Tracking in: #25228

Issues

1. micro-ROS agent is restarted

  • Issue: can stop and restart the agent once and the client reconnects, on a second restart the uxr client library segfaults.
  • Fix: need to call uxr_init_session when reconnecting, so move out of AP_DDS_Serial and AP_DDS_UDP.

Testing

Figure: reconnection after micro-ROS agent is repeatedly restarted.
dds-reconnect

@srmainwaring srmainwaring moved this from 🔖 Ready to 🏗 In progress in DDS/ROS2 Oct 10, 2023
@srmainwaring srmainwaring moved this from 🏗 In progress to 👀 In review in DDS/ROS2 Oct 10, 2023
@KyleJewiss
Copy link

Hi @srmainwaring, I've merged your "pr_dds_reconnect" branch. I'm getting a weird issue where once I disconnect the DDS client, it will register the "disconnecting", but after a couple of seconds it will then "exit". After this exit I can't reconnect to the client without doing a power reset. Any help would be awesome, cheers.

image

@srmainwaring
Copy link
Contributor

Hi @KyleJewiss, thanks for testing the PR. The timeout after a 10s seconds is intentional.

If a connection cannot be reestablished after 10s the loop exits.

        // check ping
        const uint64_t ping_timeout_ms{1000};
        const uint8_t ping_max_attempts{10};
        if (!uxr_ping_agent_attempts(comm, ping_timeout_ms, ping_max_attempts)) {
            GCS_SEND_TEXT(MAV_SEVERITY_ERROR, "DDS Client: No ping response, exiting");
            return;
        }

We need to implement fall-back behaviour in a future PR.

@KyleJewiss
Copy link

Good to know. Thanks for the for the code and the reply @srmainwaring. Have a good one

@srmainwaring
Copy link
Contributor

Btw - were you testing in SITL or hardware?

At the moment we can manage a reconnect of the client if the micro-ROS agent dies and is respawned (within 10s).

Unplugging and reconnecting a serial to USB adapter connecting a flight controller to a PC is not working. I have not tested a connection between a FCU and GPIO pins on a companion computer such as an RPi4.

@KyleJewiss
Copy link

We were testing on hardware, that makes sense. We can close the agent and reconnect in those 10 seconds but if we take longer, we need to unplug and plug back in.

@github-project-automation github-project-automation bot moved this from 👀 In review to ✅ Done in DDS/ROS2 Nov 10, 2023
@Ryanf55
Copy link
Collaborator Author

Ryanf55 commented Dec 8, 2023

Currently facing an issue with running the microros agent with SITL (UDP), wherein the microros agent just terminates due to "bad_array_new_length". No topics thus published on ROS2.

Cannot even restart the microros to fix this, as the then the reconnection is not possible. What could be the reason for the following error?

[mavproxy.py --out 127.0.0.1:14550 --out 127.0.0.1:14551 --master tcp:127.0.0.1:5760 --sitl 127.0.0.1:5501 --non-interactive -3] AP: Frame: QUAD/PLUS
[micro_ros_agent-1] [1694444766.334808] info | Root.cpp | create_client | create | client_key: 0xAAAABBBB, session_id: 0x81
[micro_ros_agent-1] [1694444766.335212] info | SessionManager.hpp | establish_session | session established | client_key: 0xAAAABBBB, address: 127.0.0.1:36817
[mavproxy.py --out 127.0.0.1:14550 --out 127.0.0.1:14551 --master tcp:127.0.0.1:5760 --sitl 127.0.0.1:5501 --non-interactive -3]
[micro_ros_agent-1] terminate called after throwing an instance of 'std::bad_array_new_length'
[micro_ros_agent-1] what(): std::bad_array_new_length
[mavproxy.py --out 127.0.0.1:14550 --out 127.0.0.1:14551 --master tcp:127.0.0.1:5760 --sitl 127.0.0.1:5501 --non-interactive -3] AP: ArduPilot Ready
[mavproxy.py --out 127.0.0.1:14550 --out 127.0.0.1:14551 --master tcp:127.0.0.1:5760 --sitl 127.0.0.1:5501 --non-interactive -3] AP: AHRS: DCM active
[mavproxy.py --out 127.0.0.1:14550 --out 127.0.0.1:14551 --master tcp:127.0.0.1:5760 --sitl 127.0.0.1:5501 --non-interactive -3] AP: DDS Client: Init Complete
[ERROR] [micro_ros_agent-1]: process has died [pid 90931, exit code -6, cmd '/home/vibsin/workspace/DroneSim/ros2_ardup_ws/install/micro_ros_agent/lib/micro_ros_agent/micro_ros_agent udp4 --middleware dds --port 2019 --refs /home/vibsin/workspace/DroneSim/ros2_ardup_ws/install/ardupilot_sitl/share/ardupilot_sitl/config/dds_xrce_profile.xml --ros-args -r __node:=micro_ros_agent -r __ns:=/'].
[mavproxy.py --out 127.0.0.1:14550 --out 127.0.0.1:14551 --master tcp:127.0.0.1:5760 --sitl 127.0.0.1:5501 --non-interactive -3] AP: XRCE Client: Participant session request failure
[mavproxy.py --out 127.0.0.1:14550 --out 127.0.0.1:14551 --master tcp:127.0.0.1:5760 --sitl 127.0.0.1:5501 --non-interactive -3] AP: DDS Client: Creation Requests failed
[mavproxy.py --out 127.0.0.1:14550 --out 127.0.0.1:14551 --master tcp:127.0.0.1:5760 --sitl 127.0.0.1:5501 --non-interactive -3] AP: RC7: SaveWaypoint LOW
[mavproxy.py --out 127.0.0.1:14550 --out 127.0.0.1:14551 --master tcp:127.0.0.1:5760 --sitl 127.0.0.1:5501 --non-interactive -3] paramftp: bad count 1327 should be 1325
[mavproxy.py --out 127.0.0.1:14550 --out 127.0.0.1:14551 --master tcp:127.0.0.1:5760 --sitl 127.0.0.1:5501 --non-interactive -3] AP: ArduCopter V4.5.0-dev (768e240)

This related to unresolved: micro-ROS/micro-ROS-Agent#205

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
BUG For-4.5 Planned for 4.5 release ROS
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

4 participants