Deploying NeMo Framework Models

NVIDIA Docs Hub NVIDIA NeMo Framework User Guide Deploying NeMo Framework Models

NVIDIA NeMo Framework offers various deployment paths for NeMo models, tailored to different domains such as Large Language Models (LLMs) and Multimodal Models (MMs). There are three primary deployment paths for NeMo models: enterprise-level deployment with NVIDIA Inference Microservice (NIM), optimized inference via exporting to another library and deploying with Triton, and in-framework inference. To begin serving your model on these three deployment paths, all you need is a NeMo checkpoint. You can find the support matrix below.

Domain	NVIDIA NIM	Optimized	In-Framework
LLMs	Yes	N/A	N/A
MMs	N/A	N/A	N/A

While a number of deployment paths are currently available, others are still in development. As each unique deployment path becomes available, it will be added to this section.

The following section describes the paths that are available to you today for working with LLMs.

NVIDIA NIM for LLMs

Enterprises seeking a comprehensive solution that covers both on-premises and cloud deployment can use NVIDIA NIM. This approach leverages the NVIDIA AI Enterprise suite, which includes support for NVIDIA NeMo, Triton Inference Server, TensorRT-LLM, and other NVIDIA AI software.

This option is ideal for organizations requiring a reliable and scalable solution to deploy generative AI models in production environments. It also stands out as the fastest inference option, offering user-friendly scripts and APIs. Leveraging the TensorRT-LLM Triton backend, it achieves rapid inference using advanced batching algorithms, including in-flight batching. Note that this deployment path supports only selected LLM models.

To learn more about NVIDIA NIM, visit the NVIDIA website.

In-Framework Inference

In-framework inference involves running LLM models directly within the NeMo Framework. This approach is straightforward and eliminates the need to export models to another format. It is ideal for development and testing phases, where ease-of-use and flexibility are critical. The NeMo Framework supports multi-node and multi-GPU inference, maximizing throughput. This method allows for rapid iterations and direct testing within the NeMo environment. Although this is the slowest option, it provides support for all NeMo models.

This section will be updated in the upcoming releases.

Optimized Inference for LLMs using TensorRT-LLM

For scenarios requiring optimized performance, NeMo models can leverage TensorRT-LLM, a specialized library for accelerating and optimizing LLM inference on NVIDIA GPUs. This process involves converting NeMo models into a format compatible with TensorRT-LLM using the nemo.export module. Unlike the NIM for the LLMs path, this option does not include the advanced batching algorithms, such as in-flight batching using the TensorRT-LLM Triton backend, which achieves the fastest LLM inference. Note that this deployment path supports only selected LLM models.

As new information becomes available, this section will be updated for future releases.

Previous Performance

Next Library Documentation