Maximize GPU network bandwidth with GPUDirect-TCPX and multi-networking

Standard

This page shows you how to maximize network bandwidth and throughput for high-performance GPU workloads in Google Kubernetes Engine (GKE) clusters in Standard mode. This page is intended for machine learning (ML) engineers and platform administrators who facilitate ML workloads. You should already be familiar with networking technologies such as network interface cards (NICs) and TCP, and with accelerator technologies like the NVIDIA Collective Communications Library (NCCL).

Artificial intelligence (AI), ML, and high performance computing (HPC) applications require powerful acceleration to optimize performance by reducing job completion times. For example, ML models that focus on conversational AI and image generation require high scalability and compute power.

About Google Cloud GPU supercomputers

Google Cloud has accelerator-optimized supercomputers that are built for scalable, massive models. These machines have the following benefits:

Eight NVIDIA H100 GPUs per machine.
Up to 200 Gbps bandwidth on the primary NIC.
Up to four secondary NICs, each supporting up to 200 Gbps bandwidth for GPU data transfer.

For a full list of benefits, see A3 machine series in the Compute Engine documentation.

Your GKE workload must use all available GPUs and all available secondary NICs on a single node and use a significant portion of the available bandwidth. The solution described in this document is ideal for workloads that require high performance, high throughput, and low latency.

Required features and capabilities for maximized bandwidth

To maximize your network bandwidth in GPU supercomputer nodes, use all of the following features:

GPUDirect-TCPX: Reduce the overhead required to transfer packet payloads to and from GPUs, which significantly improves throughput at scale compared to GPUs that don't use GPUDirect-TCPX.
gVNIC: Enable GPUDirect-TCPX capabilities such as packet header splitting, flow steering, and buffer management. gVNIC is required to use GPUDirect-TCPX. For details about gVNIC, see Increase network traffic speed for GPU nodes.
Multi-networking: Add secondary NICs to the accelerator-optimized machine. For A3 machines, adds four additional NICs. Each NIC is associated with a separate subnet in its own VPC to avoid conflicts. For details about multi-network support, see Setup multi-network support for Pods.
Placement policies: Use a resource placement policy to place all GPU nodes for a specific workload on physically close servers to minimize latency. For details, see Define compact placement for GKE nodes.

Procedure outline

To use all of these capabilities together, you'll do the following:

Create Virtual Private Cloud (VPC)s and subnets
Create the GKE environment:
1. Create a cluster with multi-networking enabled
2. Create a node pool with the following characteristics:
  1. gVNIC enabled
  2. Multi-networking subnets specified for each secondary NIC
  3. A3 machine series with H100 GPUs (four secondary NICs and eight GPUs) backing the nodes
  4. Latest NVIDIA drivers installed
Install the GPUDirect-TCPX binary and the NCCL plugin
Deploy a test workload to verify GPUDirect-TCPX setup

Before you begin

Before you start, make sure you have performed the following tasks:

Enable the Google Kubernetes Engine API.

Enable Google Kubernetes Engine API

If you want to use the Google Cloud CLI for this task, install and then initialize the gcloud CLI. If you previously installed the gcloud CLI, get the latest version by running gcloud components update.
Note: For existing gcloud CLI installations, make sure to set the compute/region and compute/zone properties. By setting default locations, you can avoid errors in gcloud CLI like the following: One of [--zone, --region] must be supplied: Please specify location.

Ensure that you have enough quota for H100 GPUs. To request more quota, see GPU quotas.

Requirements

GPUDirect-TCPX is supported on GKE version 1.27 or later and requires:
- For GKE version 1.27, use GKE patch version 1.27.7-gke.1121000 or later.
- For GKE version 1.28, use GKE patch version 1.28.8-gke.1095000 or later.
- For GKE version 1.29, use GKE patch version 1.29.3-gke.1093000 or later.
Your GPU nodes must use NVIDIA driver version 535 or later.
You must use GKE Dataplane V2.

Limitations

The following limitations apply:

You can't use GPUDirect-TCPX in Autopilot clusters
You can only use GPUDirect-TCPX on GKE version 1.27 or later and using the following patch versions:
- For GKE version 1.27, use GKE patch version 1.27.7-gke.1121000 or later.
- For GKE version 1.28, use GKE patch version 1.28.8-gke.1095000 or later.
- For GKE version 1.29, use GKE patch version 1.29.3-gke.1093000 or later.
You can't use GPUDirect-TCPX with multi-instance GPUs or GPU time-sharing
You can't use NCCL FastSocket
Your environment must support setting hostNetwork: true in the Pod specification
To use Local SSDs for Pod storage, you must explicitly specify the exact number of Local SSDs to attach to the underlying A3 VM by using the --ephemeral-storage-local-ssd=count=SSD_COUNT flag for ephemeral storage or the --local-nvme-ssd-block=count=SSD_COUNT flag for block access. If you omit this flag, you won't be able to use the Local SSDs in your Pods. These flags are only required if you want to use Local SSD for data access.

The supported machine size in GKE is a3-highgpu-8g, and the corresponding Local SSD count is 16.

Create VPCs and subnets

Create separate VPC networks in your project for each virtual NIC that you'll add to your nodes. Each VPC must have a subnet and a firewall rule that allows internal network traffic. To maximize your bandwidth, we recommend that you create four new networks.

Update the default VPC subnetwork in your project to add secondary IP address ranges for Pods and for Services:
```
gcloud compute networks subnets update DEFAULT_NETWORK \
    --region=REGION \
    --add-secondary-ranges="CLUSTER_NAME-pods=POD_IP_ADDRESS_RANGE,CLUSTER_NAME-services=SERVICE_IP_ADDRESS_RANGE"
```
Replace the following:
- DEFAULT_NETWORK: the name of the default subnet in your project.
- REGION: the region of the default subnet.
- CLUSTER_NAME: the name of your GKE cluster.
- POD_IP_ADDRESS_RANGE: the IP address range for Pods in the cluster to use, in CIDR notation. For example, 10.64.0.0/19.
- SERVICE_IP_ADDRESS_RANGE: the IP address range for Services in the cluster to use, in CIDR notation. Must be different to the Pod range. For example, 10.65.0.0/19.
Create the VPC networks for GPUDirect-TCPX in your project, each with a subnet and a firewall rule:

Caution: Don't let the IP address ranges in the following commands overlap with the secondary IP address ranges that you created in the previous step.
```
for N in $(seq 1 4); do
gcloud compute networks create PROJECT_ID-net-$N \
    --subnet-mode=custom \
    --mtu=8244

gcloud compute networks subnets create PROJECT_ID-sub-$N \
    --network=PROJECT_ID-net-$N \
    --region=REGION \
    --range=SUBNET_RANGE

gcloud compute firewall-rules create PROJECT_ID-internal-$N \
  --network=PROJECT_ID-net-$N \
  --action=ALLOW \
  --rules=tcp:0-65535,udp:0-65535,icmp \
  --source-ranges=SOURCE_RANGE
done
```
Replace the following:
- PROJECT_ID: your Google Cloud project ID.
- REGION: the Compute Engine region for each subnet.
- SUBNET_RANGE: the IP address range of each subnet in CIDR notation. This example command iterates for four subnets, so use a variable to change the IP address for each subnet. For example, specify 192.168.$N.0/24 so that the first subnet uses 192.168.1.0/24, the second subnet uses 192.168.2.0/24, etc.
- SOURCE_RANGE: The source IP address range for the firewall rule to allow ingress traffic, in CIDR notation. For example, 192.168.0.0/16.
Verify that the networks were created:
```
gcloud compute networks list
```

Create the GKE environment

Create a new GKE cluster that uses multi-networking (Preview) and create a GPU node pool that uses A3 machines with attached H100 GPUs and four additional NICs. You can't update an existing cluster to use multi-networking.

Create a cluster:
```
gcloud container clusters create CLUSTER_NAME \
    --location=LOCATION \
    --cluster-version=VERSION \
    --enable-dataplane-v2 --enable-ip-alias \
    --enable-multi-networking \
    --no-enable-autoupgrade \
    --cluster-secondary-range-name=CLUSTER_NAME-pods \
    --services-secondary-range-name=CLUSTER_NAME-services
```
Replace the following:
- CLUSTER_NAME: the name of your new cluster
- LOCATION: the Compute Engine region for the cluster
- VERSION: the GKE version for the cluster. Must be a supported version as described in the Requirements section.
This command also explicitly specifies the secondary IP address for Pods and Services for the cluster that you created in the previous section.

Create Network and GKENetworkParamSet resources in the cluster that correspond to the VPC networks and subnetworks that you created:

kubectl apply -f - <<EOF
apiVersion: networking.gke.io/v1
kind: Network
metadata:
  name: vpc1
spec:
  parametersRef:
    group: networking.gke.io
    kind: GKENetworkParamSet
    name: vpc1
  type: Device
---
apiVersion: networking.gke.io/v1
kind: Network
metadata:
  name: vpc2
spec:
  parametersRef:
    group: networking.gke.io
    kind: GKENetworkParamSet
    name: vpc2
  type: Device
---
apiVersion: networking.gke.io/v1
kind: Network
metadata:
  name: vpc3
spec:
  parametersRef:
    group: networking.gke.io
    kind: GKENetworkParamSet
    name: vpc3
  type: Device
---
apiVersion: networking.gke.io/v1
kind: Network
metadata:
  name: vpc4
spec:
  parametersRef:
    group: networking.gke.io
    kind: GKENetworkParamSet
    name: vpc4
  type: Device
---
apiVersion: networking.gke.io/v1
kind: GKENetworkParamSet
metadata:
  name: vpc1
spec:
  vpc: PROJECT_ID-net-1
  vpcSubnet: PROJECT_ID-sub-1
  deviceMode: NetDevice
---
apiVersion: networking.gke.io/v1
kind: GKENetworkParamSet
metadata:
  name: vpc2
spec:
  vpc: PROJECT_ID-net-2
  vpcSubnet: PROJECT_ID-sub-2
  deviceMode: NetDevice
---
apiVersion: networking.gke.io/v1
kind: GKENetworkParamSet
metadata:
  name: vpc3
spec:
  vpc: PROJECT_ID-net-3
  vpcSubnet: PROJECT_ID-sub-3
  deviceMode: NetDevice
---
apiVersion: networking.gke.io/v1
kind: GKENetworkParamSet
metadata:
  name: vpc4
spec:
  vpc: PROJECT_ID-net-4
  vpcSubnet: PROJECT_ID-sub-4
  deviceMode: NetDevice
EOF

These resources tell GKE to configure the NICs for GPU traffic in passthrough mode. GKE doesn't apply built-in networking programming using eBPF to this traffic.

Create a node pool for the H100 GPUs:

gcloud container node-pools create NODE_POOL_NAME \
    --cluster=CLUSTER_NAME \
    --location=LOCATION \
    --machine-type=a3-highgpu-8g \
    --accelerator=type=nvidia-h100-80gb,count=8,gpu-driver-version=LATEST \
    --additional-node-network=network=PROJECT_ID-net-1,subnetwork=PROJECT_ID-sub-1 \
    --additional-node-network=network=PROJECT_ID-net-2,subnetwork=PROJECT_ID-sub-2 \
    --additional-node-network=network=PROJECT_ID-net-3,subnetwork=PROJECT_ID-sub-3 \
    --additional-node-network=network=PROJECT_ID-net-4,subnetwork=PROJECT_ID-sub-4 \
    --enable-gvnic \
    --no-enable-autoupgrade \
    [--ephemeral-storage-local-ssd=count=16]

Replace NODE_POOL_NAME with the name of the node pool.

If this command fails, you might not have enough H100 GPU quota in your project. Ensure that you have quota and retry the command.

Get a list of nodes in the cluster:
```
kubectl get nodes
```

Verify that each GPU node has eight GPUs:

kubectl describe node NODE_NAME

The output is similar to the following:

Capacity:
  ...
  nvidia.com/gpu:             8
Allocatable:
  ...
  nvidia.com/gpu:             8

Install GPUDirect-TCPX and configure NCCL

This section shows you how to install the GPUDirect-TCPX binary and a specific NCCL using a DaemonSet.

Review the DaemonSet manifest:

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: nccl-tcpx-installer
  namespace: kube-system
  labels:
    k8s-app: nccl-tcpx-installer
spec:
  selector:
    matchLabels:
      k8s-app: nccl-tcpx-installer
  updateStrategy:
    type: RollingUpdate
  template:
    metadata:
      labels:
        name: nccl-tcpx-installer
        k8s-app: nccl-tcpx-installer
    spec:
      priorityClassName: system-node-critical
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
              - matchExpressions:
                  - key: cloud.google.com/gke-accelerator
                    operator: In
                    values:
                      - nvidia-h100-80gb
      tolerations:
        - operator: "Exists"
      hostNetwork: true
      hostPID: true
      volumes:
        - name: var-lib
          hostPath:
            path: /var/lib
        - name: tcpx
          hostPath:
            path: /var/lib/tcpx
        - name: library-dir-host
          hostPath:
            path: /home/kubernetes/bin
      initContainers:
        - image: us-docker.pkg.dev/gce-ai-infra/gpudirect-tcpx/nccl-plugin-gpudirecttcpx-dev:v3.1.9
          name: nccl-tcpx-installer
          resources:
            requests:
              cpu: 150m
          securityContext:
            privileged: true
          volumeMounts:
            - name: var-lib
              mountPath: /var/lib
            - name: library-dir-host
              mountPath: /usr/local
          command: ["/bin/sh", "-c"]
          args:
            - |
              set -ex
              /scripts/container_entry.sh install --install-nccl
              mkdir -p /usr/local/nvidia/lib64
              cp -r /var/lib/tcpx/lib64/. /usr/local/nvidia/lib64
              echo "installation finishes"
      containers:
        - image: "gcr.io/google-containers/pause:2.0"
          name: pause

This DaemonSet does the following:

Installs an NCCL library and the GPUDirect-TCPX binary on the node.
Stores the library and the binary in the /home/kubernetes/bin/nvidia/lib64 directory on the VM. By default, GKE mounts this directory into the /usr/local/nvidia/lib64 path in GPU containers that need to use NCCL and GPUDirect-TCPX.

Deploy the DaemonSet:

kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/gpudirect-tcpx/nccl-tcpx-installer.yaml

The NCCL plugin takes approximately two minutes to start running.

Verify the status of the DaemonSet Pods:

kubectl get pods -n=kube-system -l=name=nccl-tcpx-installer

The output is similar to the following:

nccl-tcpx-installer-6c2pv                    1/1     Running   0          2m11s
nccl-tcpx-installer-qgg82                    1/1     Running   0          2m11s

Deploy a test workload

In this section, you deploy a sample workload to verify that NCCL and GPUDirect-TCPX work as expected. This workload includes a sidecar container named the tcpx-daemon, which runs a service that lets the Pod use GPUDirect-TCPX. You must add this sidecar container to any Pods in your own environment that need to use GPUDirect-TCPX. For a snippet of the required fields to add to your manifests, see Add GPUDirect-TCPX to your manifest in this document.

Review the nccl-config-default.yaml ConfigMap manifest in GitHub. This manifest deploys scrips that initialize an NCCL allgather test and sets NCCL-specific environment variables.
Review the nccl-test.yaml manifest in GitHub. This manifest does the following:
1. Deploys two Pods, each of which runs in a node that has H100 GPUs.
2. Deploys a sidecar container named tcpx-daemon in each Pod to let those Pods use GPUDirect-TCPX.
Caution: GPUDirect-TCPX Pods require access to the host network (hostNetwork: true). The tcpx-daemon service must run as a privileged container.

Deploy the ConfigMap and the test workload:

kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/gpudirect-tcpx/nccl-config-default.yaml
kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/gpudirect-tcpx/nccl-test.yaml

Run the following commands to trigger an NCCL all-gather test for the nodes:

head_pod=$(kubectl get pods --output='custom-columns=POD:.metadata.name' --no-headers | head -n1)

nodes=($(kubectl get pods --output='custom-columns=NODE:.spec.nodeName' --no-headers))

kubectl exec --stdin --tty --container=nccl-test ${head_pod} -- /configs/allgather.sh ${nodes[@]}

The output is similar to the following:

#                                                              out-of-place                       in-place
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
     1048576         16384     float    none      -1    696.8    1.50    1.41      0    729.0    1.44    1.35      0
     2097152         32768     float    none      -1    776.4    2.70    2.53      0    726.7    2.89    2.71      0
     4194304         65536     float    none      -1    774.3    5.42    5.08      0    805.1    5.21    4.88      0
     8388608        131072     float    none      -1    812.1   10.33    9.68      0    817.6   10.26    9.62      0
    16777216        262144     float    none      -1   1035.2   16.21   15.19      0   1067.8   15.71   14.73      0
    33554432        524288     float    none      -1   1183.3   28.36   26.59      0   1211.8   27.69   25.96      0
    67108864       1048576     float    none      -1   1593.4   42.12   39.49      0   1510.5   44.43   41.65      0
   134217728       2097152     float    none      -1   2127.8   63.08   59.13      0   2312.7   58.03   54.41      0
   268435456       4194304     float    none      -1   3603.0   74.50   69.85      0   3586.2   74.85   70.17      0
   536870912       8388608     float    none      -1   7101.7   75.60   70.87      0   7060.9   76.03   71.28      0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 29.8293

Use NCCL environment variables to improve performance

You can optionally set specific environment variables to improve the performance of your workloads that use NCCL. The nccl-config-default.yaml ConfigMap that you deploy in the Deploy a test workload section sets some NCCL variables by default. The variable configuration is stored in the run-nccl.sh script in the ConfigMap.

To change the NCCL environment variables, deploy an updated ConfigMap manifest with modified variables. The nccl-config-latest.yaml manifest in GitHub contains every recommended variable with an updated run-nccl.sh script.

The following command updates the existing ConfigMap that has the default variables with the updated nccl-config-latest.yaml ConfigMap:

kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/gpudirect-tcpx/nccl-config-latest.yaml

Kubernetes takes approximately two minutes to update the ConfigMap.

To check the NCCL environment variables, run the following command:

head_pod=$(kubectl get pods --output='custom-columns=POD:.metadata.name' --no-headers | head -n1)

kubectl exec --stdin --tty --container=nccl-test ${head_pod} -- cat /configs/run-nccl.sh

Add GPUDirect-TCPX to your manifests

This section provides the required fields that you must add to your Kubernetes manifests for your Pods to use GPUDirect-TCPX.

Add the following fields to the Pod specification:

spec:
  hostNetwork: true
  dnsPolicy: ClusterFirstWithHostNet
  volumes:
  - name: libraries
    hostPath:
      path: /home/kubernetes/bin/nvidia/lib64
  - name: tcpx-socket
    hostPath:
      path: /run/tcpx

Add the following container to the manifest to run the tcpx-daemon service:

- name: tcpx-daemon
  image: us-docker.pkg.dev/gce-ai-infra/gpudirect-tcpx/tcpgpudmarxd-dev:v2.0.9
  command:
    - /tcpgpudmarxd/build/app/tcpgpudmarxd
    - --gpu_nic_preset
    - a3vm
    - --gpu_shmem_type
    - fd
    - --uds_path
    - /run/tcpx
    - --setup_param
    - \"--verbose 128 2 0 \"
  securityContext:
    privileged: true
  volumeMounts:
    - name: libraries
      mountPath: /usr/local/nvidia/lib64
    - name: tcpx-socket
      mountPath: /run/tcpx
  env:
    - name: LD_LIBRARY_PATH
      value: /usr/local/nvidia/lib64

Add the following volume mounts to any containers that request GPUs:
```
volumeMounts:
- name: tcpx-socket
  mountPath: /tmp
- name: libraries
  mountPath: /usr/local/nvidia/lib64
```
Note: the default tcpx-socket path is /tmp for containers that request GPUs. If you set the NCCL_GPUDIRECTTCPX_UNIX_CLIENT_PREFIX environment variable to a value other than /tmp, GKE mounts the tcpx-socket volume to that mountPath.

Add the following environment variable to every GPU container:

env:
- name: LD_LIBRARY_PATH
  value: /usr/local/nvidia/lib64

Optionally, add environment variables to configure NCCL options. For details, see the Use NCCL environment variables to improve performance section in this document.

A completed Pod specification looks like the following:

apiVersion: v1
kind: Pod
metadata:
  name: example-pod
  labels:
    name: example-pod
spec:
  hostNetwork: true
  dnsPolicy: ClusterFirstWithHostNet
  containers:
  - name: tcpx-daemon
    image: us-docker.pkg.dev/gce-ai-infra/gpudirect-tcpx/tcpgpudmarxd-dev:v2.0.9
    command:
      - /tcpgpudmarxd/build/app/tcpgpudmarxd
      - --gpu_nic_preset
      - a3vm
      - --gpu_shmem_type
      - fd
      - --uds_path
      - /run/tcpx
      - --setup_param
      - \"--verbose 128 2 0 \"
    securityContext:
      privileged: true
    volumeMounts:
      - name: libraries
        mountPath: /usr/local/nvidia/lib64
      - name: tcpx-socket
        mountPath: /run/tcpx
    env:
      - name: LD_LIBRARY_PATH
        value: /usr/local/nvidia/lib64
    - name: nccl-test
      image: us-docker.pkg.dev/gce-ai-infra/gpudirect-tcpx/nccl-plugin-gpudirecttcpx:v3.1.2
      imagePullPolicy: Always
      command:
        - /bin/sh
        - -c
        - "while true; do echo hello; sleep 1; done"
      env:
        - name: LD_LIBRARY_PATH
          value: /usr/local/nvidia/lib64
      volumeMounts:
        - name: tcpx-socket
          mountPath: /run/tcpx
        - name: libraries
          mountPath: /usr/local/nvidia/lib64
      resources:
        limits:
          nvidia.com/gpu: 8
  volumes:
    - name: libraries
      hostPath:
        path: /home/kubernetes/bin/nvidia/lib64
    - name: tcpx-socket
      hostPath:
        path: /run/tcpx