Developer Environment

GXF Failure due to Tensor Stream Creation Failed

When launching an Isaac ROS node on a Jetson accessed remotely with X11 forwarding, X events emitted by CUDA for initializing a tensor stream are captured by the remote monitoring machine instead of the local Jetson.

Symptom

[component_container_mt-2] 2023-09-15 14:33:11.362 ERROR /workspaces/isaac_ros-dev/src/isaac_ros_image_pipeline/isaac_ros_image_proc/gxf/tensorops/extensions/tensorops/components/TensorStream.cpp@104: tensor stream creation failed.
[component_container_mt-2] 2023-09-15 14:33:11.363 ERROR gxf/std/entity_warden.cpp@380: Failed to initialize component 00292 (stream)
[component_container_mt-2] 2023-09-15 14:33:11.363 ERROR gxf/core/runtime.cpp@616: Could not initialize entity 'IPOQAFVQYV_global' (E290): GXF_FAILURE
[component_container_mt-2] 2023-09-15 14:33:11.363 ERROR gxf/std/program.cpp@205: Failed to activate entity 00290 named IPOQAFVQYV_global: GXF_FAILURE
[component_container_mt-2] 2023-09-15 14:33:11.363 ERROR gxf/std/program.cpp@207: Deactivating...
[component_container_mt-2] 2023-09-15 14:33:11.396 ERROR gxf/core/runtime.cpp@1227: Graph activation failed with error: GXF_FAILURE
[component_container_mt-2] [ERROR] [1694759591.396604150] [left_decoder_node]: [NitrosContext] GxfGraphActivate Error: GXF_FAILURE
[component_container_mt-2] [INFO] [1694759591.396906843] [right_decoder_node]: [NitrosContext] Loading application: '/workspaces/isaac_ros-dev/install/isaac_ros_h264_decoder/share/isaac_ros_h264_decoder/EZXPATGEMM.yaml'
[component_container_mt-2] [ERROR] [1694759591.396922620] [left_decoder_node]: [NitrosNode] runGraphAsync Error: GXF_FAILURE
[component_container_mt-2] terminate called after throwing an instance of 'std::runtime_error'
[component_container_mt-2] what(): [NitrosNode] runGraphAsync Error: GXF_FAILURE

Solution

Disabling X11 forwarding will prevent X events from being intercepted by the remote machine, resolving the error.

Docker buildx plugin not installed

Isaac ROS Dev scripts rely on Docker configured with the buildx plugin which is no longer installed by default.

Symptom

nvidia@ubuntu:~/workspaces/isaac_ros-dev/src/isaac_ros_common$ ./scripts/run_dev.sh
Building aarch64.ros2_humble.ros1_noetic.user base as image: isaac_ros_dev-aarch64 using key aarch64.ros2_humble.ros1_noetic.user
Using configured docker search paths: /home/nvidia/workspaces/isaac_ros-dev/src/isaac_ros_common/scripts/../../isaac_ros_nitros_bridge/docker /home/nvidia/workspaces/isaac_ros-dev/src/isaac_ros_common/scripts/../docker
Using base image name not specified, using ''
Using docker context dir not specified, using Dockerfile directory
Resolved the following Dockerfiles for target image: aarch64.ros2_humble.ros1_noetic.user
/home/nvidia/workspaces/isaac_ros-dev/src/isaac_ros_common/scripts/../docker/Dockerfile.user
/home/nvidia/workspaces/isaac_ros-dev/src/isaac_ros_common/scripts/../../isaac_ros_nitros_bridge/docker/Dockerfile.ros1_noetic
/home/nvidia/workspaces/isaac_ros-dev/src/isaac_ros_common/scripts/../docker/Dockerfile.aarch64.ros2_humble
Building /home/nvidia/workspaces/isaac_ros-dev/src/isaac_ros_common/scripts/../docker/Dockerfile.aarch64.ros2_humble as image: aarch64-ros2_humble-image with base:
ERROR: BuildKit is enabled but the buildx component is missing or broken.
      Install the buildx component to build images with BuildKit:
      https://docs.docker.com/go/buildx/
Failed to build base image: isaac_ros_dev-aarch64, aborting.

Solution

From the official Docker install instructions (here), install the docker-buildx-plugin.

# Add Docker's official GPG key:
sudo apt-get update
sudo apt-get install ca-certificates curl gnupg
sudo install -m 0755 -d /etc/apt/keyrings
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg
sudo chmod a+r /etc/apt/keyrings/docker.gpg

# Add the repository to Apt sources:
echo \
"deb [arch="$(dpkg --print-architecture)" signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu \
"$(. /etc/os-release && echo "$VERSION_CODENAME")" stable" | \
sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
sudo apt-get update

sudo apt install docker-buildx-plugin

Repository cloned without pulling git-lfs files

Symptom

You fail building some isaac_ros packages, or some image files in the repository are empty.

Solution

Many Isaac ROS repositories are configured with Git LFS. For example, isaac_ros_visual_slam repository is configured with Git LFS to accommodate big rosbag files for test.

To check if some of the LFS files are not resolved, you can first list all the files that are supposed to be managed by Git LFS.

cd ${ISAAC_ROS_WS}/src/isaac_ros_common/ && \
  git lfs ls-files -s

It should give something like this.

Example of output of git lfs ls-files -s:

945a1a82ee - docker/tao/tao-converter-aarch64-tensorrt8.4.zip (36 KB)
cfbf5fbcee - resources/Isaac_sim_app_launcher.png (68 KB)
448d244a3e - resources/Isaac_sim_app_terminal.png (57 KB)
750f7de371 - resources/graph_read-write-speed.png (77 KB)
c29c6d6a3a - resources/isaac_ros_common_tools.png (30 KB)
87e38eea78 - resources/isaac_sim_initial_screen.png (265 KB)
e2292bc330 - resources/isaac_sim_nucleus_setup.png (221 KB)
ea8b297681 - resources/isaac_sim_ros_bridge.png (300 KB)

Now check the actual file size of those files.

git lfs ls-files -s | awk '{ print $3 }' | xargs ls -l

Example of output of git lfs ls-files -s | awk '{ print $3 }' | xargs ls -l:

-rw-rw-r- 1 jetson jetson 130 Mar 22 08:56 docker/tao/tao-converter-aarch64-tensorrt8.4.zip
-rw-rw-r- 1 jetson jetson 130 Mar 22 08:56 resources/graph_read-write-speed.png
-rw-rw-r- 1 jetson jetson 130 Mar 22 08:56 resources/isaac_ros_common_tools.png
-rw-rw-r- 1 jetson jetson 130 Mar 22 08:56 resources/Isaac_sim_app_launcher.png
-rw-rw-r- 1 jetson jetson 130 Mar 22 08:56 resources/Isaac_sim_app_terminal.png
-rw-rw-r- 1 jetson jetson 131 Mar 22 08:56 resources/isaac_sim_initial_screen.png
-rw-rw-r- 1 jetson jetson 131 Mar 22 08:56 resources/isaac_sim_nucleus_setup.png
-rw-rw-r- 1 jetson jetson 131 Mar 22 08:56 resources/isaac_sim_ros_bridge.png

In above example, all the file sizes are 130byte or 131byte, much smaller than what was advertised by git lfs ls-files -s. This shows that those LFS files are not really pulled.

Optionally, if you know a directory that should have a large LFS file, you can check the directory size.

du -hs ${ISAAC_ROS_WS}/src/isaac_ros_visual_slam/isaac_ros_visual_slam/test/test_cases/rosbags/
20K     /home/username/workspaces/isaac_ros-dev/src/isaac_ros_visual_slam/isaac_ros_visual_slam/test/test_cases/rosbags/

If any of above is the case, you can manually pull the LFS files, after installing git lfs.

cd ${ISAAC_ROS_WS}/src/isaac_ros_common && \
  git lfs pull

or

cd ${ISAAC_ROS_WS}/src/isaac_ros_visual_slam && \
  git lfs pull -X "" -I isaac_ros_visual_slam/test/test_cases/rosbags/

X11-forward issue

Symptom

When you run X11 window application inside the Isaac ROS container, you see an error message:

“X11 connection rejected because of wrong authentication.”

Solution

Go back to your docker host, and re SSH-login from your remote machine with ssh -X option.

ssh -X ${USER}@${IP}

Check if you see an error message like this.

/usr/bin/xauth:  /home/jetson/.Xauthority not writable, changes will be ignored
/usr/bin/xauth:  /home/jetson/.Xauthority not writable, changes ignored

If so, delete the ~/.Xauthority directory.

sudo rm -rf ~/.Xauthority/

Log out and re-login with ssh -X.

Check if you have .Xauthority file with your user as owner.

~$ ll ~/.Xauthority
-rw------- 1 jetson jetson 61 Mar 22 14:08 /home/jetson/.Xauthority

Now, you should be able to achieve X11 forwarding. Try with xeyes on your docker host.

Then, launch your container and try running the app that requires X11 forwarding.

ROS 2 Domain ID Collision

Symptom

  • You see ROS 2 topics that your system is not (yet) publishing

  • ROS 2 topics don’t go away even after you stop the node that you think is publishing

Solution

By default, all ROS 2 nodes use domain ID 0, so it is possible you are seeing topics from other machines on the network (docs.ros.org).

Solution 1:

You can declare ROS_DOMAIN_ID on every bash instance inside the Isaac ROS container.

ROS_DOMAIN_ID=42

Solution 2:

You declare the ROS_DOMAIN_ID on the host once, and modify the run_dev.sh script to carry the environment variable.

vi ${ISAAC_ROS_WS}/src/isaac_ros_common/scripts/run_dev.sh

Find the portion that list all the environment variables to carry over, and add like this.

DOCKER_ARGS+=("-v /tmp/.X11-unix:/tmp/.X11-unix")
DOCKER_ARGS+=("-v $HOME/.Xauthority:/home/admin/.Xauthority:rw")
DOCKER_ARGS+=("-e DISPLAY")
DOCKER_ARGS+=("-e ROS_DOMAIN_ID")

No /opt/ros/humble/install directory in Docker container

Symptom

You expect the /opt/ros/humble/install directory to exist inside the Isaac ROS Dev Docker container, but this directory does not exist.

This is the result of a change made in the Isaac ROS 2.1 release in November 2023.

In Isaac ROS 2.1, all base ROS 2 Humble packages are installed using pre-built Debian packages produced by the NVIDIA ROS 2 Buildfarm. Previously, these base packages were built from source in a customized global underlay workspace rooted at /opt/ros/humble/.

Since the base packages are no longer built from source inline within the image, the ./install subfolder of /opt/ros/humble is never created.

Solution

Solution 1:

Move any user-specific ROS 2 packages out of the global underlay workspace to an appropriate local overlay workspace, as recommended by the official ROS documentation.

Then, build and source this overlay workspace normally.

Solution 2:

Carefully recreate the customized global underlay workspace by following the steps used in Isaac ROS 2.0 and older releases:

# EXAMPLE: Install OSRF negotiated ROS 2 package from source

ENV ROS_DISTRO=humble
ENV ROS_ROOT=/opt/ros/${ROS_DISTRO}

RUN apt-get update && mkdir -p ${ROS_ROOT}/src && cd ${ROS_ROOT}/src \
   && git clone https://github.com/osrf/negotiated && cd negotiated && git checkout master \
   && source ${ROS_ROOT}/setup.bash \
   && cd negotiated_interfaces && bloom-generate rosdebian && fakeroot debian/rules binary \
   && cd ../ && apt-get install -y ./*.deb && rm ./*.deb \
   && cd negotiated && bloom-generate rosdebian && fakeroot debian/rules binary \
   && cd ../ && apt-get install -y ./*.deb && rm ./*.deb \
&& rm -rf /var/lib/apt/lists/* \
&& apt-get clean

This process involves overriding the behavior of bloom-generate to force it to produce and install the Debian packages inline.

run_dev.sh fails to build image with Docker certificate errors

Symptom

Docker build fails building base image with error messages with the following form:

tls: failed to verify certificate: x509: certificate has expired or is not yet valid

More context:

[+] Building 0.2s (2/2) FINISHED                                                                                                                                                                                 docker:default
=> [internal] load build definition from Dockerfile.aarch64                                                                                                                                                               0.0s
=> => transferring dockerfile: 8.45kB                                                                                                                                                                                     0.0s
=> ERROR [internal] load metadata for nvcr.io/nvidia/nvidia-aarch64-base:ubuntu22.04                                                                                                                                     0.1s
------
> [internal] load metadata for nvcr.io/nvidia/nvidia-aarch64-base:ubuntu22.04:
------
Dockerfile.aarch64:11
--------------------
   9 |     # Docker file for aarch64 based Jetson device
10 |     ARG BASE_IMAGE="nvcr.io/nvidia/nvidia-aarch64-base:ubuntu22.04"
11 | >>> FROM ${BASE_IMAGE}
12 |
13 |     # Store list of packages (must be first)
--------------------
ERROR: failed to solve: nvcr.io/nvidia/nvidia-aarch64-base:ubuntu22.04: failed to resolve source metadata for nvcr.io/nvidia/nvidia-aarch64-base:ubuntu22.04: failed to do request: Head "https://nvcr.io/v2/nvidia/nvidia-aarch64-base/manifests/ubuntu22.04";: tls: failed to verify certificate: x509: certificate has expired or is not yet valid: current time 1969-12-31T17:10:53-08:00 is before 2024-04-24T00:00:00Z
/mnt/nova_ssd/workspaces/isaac_ros-dev/src/isaac_ros_common

Solution

Restarting the Docker daemon should resolve the issue:

sudo systemctl restart docker

Failed to load a NITROS enabled nodes with intraprocess communication enabled

When launching a NITROS-enabled ROS node with explicitly setting use_intra_process_comms to True, the following error can be seen.

Symptom

[component_container-1] [ERROR] [1715506914.017761602] [camera_1.camera_1_container]: Component constructor threw an exception: intraprocess communication allowed only with volatile durability
[ERROR] [launch_ros.actions.load_composable_nodes]: Failed to load node 'visual_slam_node' of type 'nvidia::isaac_ros::visual_slam::VisualSlamNode' in container '/camera_1/camera_1_container': Component constructor threw an exception: intraprocess communication allowed only with volatile durability

Solution

NITROS nodes already have intraprocess communication enabled by default, with an exception of few publishers and subscribers that does not participate in data transfer. Disabling intraprocess communication by setting the parameter to False should resolve the issue and it does not hamper the performance.