Large Language Models Pretraining
The results in the table below show pre-training performance of various models on DGXH100, with FP8.
Please refer to MLCommons Training results for performance of GPT3-175B pre-training on large scale H100 systems.
To calculate Model TFLOPs, please see Appendix A in paper.
Model |
#-GPUs |
GBS |
MBS |
Sequence Length |
TP |
PP |
Tokens / sec / GPU |
Model TFLOP / sec / GPU |
Est. time to train in days (10T tokens, 1K GPUs) |
---|---|---|---|---|---|---|---|---|---|
GPT3-175B | 512 | 2048 | 1 | 2048 | 4 | 8 | 741 | 797 | 153 |
GPT3-5B | 64 | 2048 | 4 | 2048 | 1 | 1 | 23574 | 746 | 5 |
GPT3-20B | 64 | 256 | 2 | 2048 | 2 | 1 | 5528 | 708 | 20 |
LLAMA2-7B | 8 | 128 | 1 | 4096 | 1 | 1 | 16290 | 751 | 7 |
LLAMA2-13B | 16 | 128 | 1 | 4096 | 4 | 1 | 8317 | 725 | 14 |
LLAMA2-70B | 64 | 128 | 1 | 4096 | 4 | 4 | 1725 | 767 | 66 |
Nemotron-8B | 8 | 32 | 2 | 4096 | 2 | 1 | 11538 | 593 | 10 |
Nemotron-22B | 16 | 32 | 2 | 4096 | 1 | 4 | 3828 | 499 | 30 |
Large Language Models Fine-tuning
The following table provides performance benchmarking of LLAMA2 models with SFT (supervised fine-tuning), and LoRA (Low-rank adaptors) on DGXH100, with FP8.
For fine-tuning, we use SQuAD-v1.1 dataset, and the inputs are packed to 4096 tokens.
To calculate Model TFLOPs, please see Appendix A in paper.
Model |
Mode |
#-GPUs |
GBS |
MBS |
Sequence Length |
TP |
PP |
Tokens / sec / GPU |
Model TFLOP / sec / GPU |
Est. time to complete in mins (10M tokens) |
---|---|---|---|---|---|---|---|---|---|---|
LLAMA2-7B | SFT | 8 | 32 | 1 | 4096 | 1 | 1 | 14761 | 591 | 1.4 |
LLAMA2-13B | SFT | 8 | 32 | 1 | 4096 | 1 | 4 | 8989 | 698 | 2.3 |
LLAMA2-70B | SFT | 16 | 32 | 1 | 4096 | 4 | 4 | 1470 | 609 | 7.1 |
LLAMA2-7B | LoRA | 8 | 32 | 1 | 4096 | 1 | 1 | 20750 | 556 | 1.0 |
LLAMA2-13B | LoRA | 8 | 32 | 1 | 4096 | 1 | 1 | 12584 | 654 | 1.7 |
LLAMA2-70B | LoRA | 8 | 32 | 1 | 4096 | 2 | 4 | 2279 | 631 | 9.1 |
These scripts run a recommended config for GPT3, LLAMA2, NeMo Pretraining, and Fine-tuning for various model sizes on A100, H100. For example, for GPT3 pretrain the following folders provide sample scripts.
A100 : Scripts to run GPT pretraining on NVIDIA A100, in bf16 data type
H100 : Scripts to run GPT pretraining for NVIDIA H100, in fp8 data type
Setup
To run these scripts, you must have access to the NeMo Framework Container.. - Please sign in at NGC (user = ea-bignlp/ga-participants) to access the catalog.
Update the following bash variables in the example run scripts:
NEMO_MEGATRON_LAUNCHER_DIR
: the directory of where this repository is locatedDATA_DIR
: the directory of the dataset used for pretraining, by default this isNEMO_MEGATRON_LAUNCHER_DIR/data
Enter your cluster environment settings at config.yaml
For bcm type clusters update the job name, partition, and account at bcm.yaml
For testing performance with synthetic data on an interactive node, you need to add the following options to your bash script:
cluster_type=interactive \ ++training.cluster_type=BCP \ training.model.data.data_impl="mock" \ training.model.data.data_prefix=[]
For further details see General Configuration
Collect Results
For performance, the “step_time_per_sec” variable on the console out provides a quick way to read performance of a workload.
For more details and graphics, one can use tensorboard or Weights and Biases. In order to use that, please use results stored at NEMO_MEGATRON_LAUNCHER_DIR/results/<experiment_name>
with the following structure:
NEMO_MEGATRON_LAUNCHER_DIR/results/<experiment_name>/<experiment_name>.yaml
: The config of the pretrained modelNEMO_MEGATRON_LAUNCHER_DIR/results/<experiment_name>/<jobname>_<experiment_name>.sh
: The autogenerated .sh file that was runNEMO_MEGATRON_LAUNCHER_DIR/results/<experiment_name>/results/
: Directory contained per rank logs, and tensorboard data.
For further details see Interpreting the Results