In this item I show you the steps I took when I followed the Fine-tune with NeMo tutorial. In addition, I show you where I deviated from the the tutorial and what was different from it. I could successfully complete the tutorial on these systems:

  1. Dell Pro Max GB10 (NVIDIA DGX Spark clone)
    aarch64 / arm64 with 128 GB (shared) VRAM
  2. Dell Prcision 7960T
    x86_64 / amd64 with Intel w9 60 cores and 512 GB RAM, 4 NVIDIA RTX Pro 6000 Blackwell Max-Q Workstation with each 96 GB VRAM

Both systems are “Blackwell” GPUs and the driver support varies. For example, I could not install Axolotl on either of these systems. There, aarch64 support was missing for a torch… library and on the other system, the Cuda capabilities were too high.

So, the good news is, that for these both systems, Nemo AutoModel starts “out of the box” …

Installation

Installation with docker is straight-forward. In the tutorial they start the image via docker. I created a docker-composel.yml. Both should work correctly.

NVIDIA NeMo AutoModel Image

I installed this newwer version of the (multi-arch) image:

nvcr.io/nvidia/nemo-automodel:26.04

Docker Compose and Volume

I created a docker-compose.yml and mounted volumes for:

  • models
  • datasets
  • workspace
  • results
  • huggingface cache
  • checkpoints.
services:
automodel:
image: nvcr.io/nvidia/nemo-automodel:26.04
container_name: automodel
user: "0:0"
volumes:
- /data/nvidia/models:/models
- /data/nvidia/datasets:/datasets
- /data/nvidia/workspace:/workspace
- /data/nvidia/results:/results
- /data/nvidia/hf_cache:/root/.cache/huggingface
- /data/automodel/checkpoints:/opt/Automodel/checkpoints
working_dir: /workspace
ipc: host
ulimits:
memlock: -1
stack: 67108864
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
tty: true
stdin_open: true
entrypoint: /usr/bin/bash
environment:
- TRANSFORMER_ENGINE_PTE=1
- NVIDIA_VISIBLE_DEVICES=all

Start and Fine-Tune

NOTE: When I started the scripts, I added HF_TOKEN to the current shell With this, the Huggingface downloader did not ask to token information.

I use automodel or (am) instead of examples/llm_finetune/finetune.py:

cd /opt/Automodel
automodel --nproc-per-node 4 \
examples/llm_finetune/llama3_2/llama3_2_1b_squad_peft.yaml \
--model.pretrained_model_name_or_path meta-llama/Llama-3.1-8B \
--packed_sequence.packed_sequence_size 1024 \
--step_scheduler.max_steps 20

Something suprised me: when I enter the path to the yaml configuration with an absolute path (“/”), automodel could not find the config file. Therefore, I had to specify the relative path.

Warnings

When I started the fine-tuning commands, the programme showed different warnings, but did not stop.

I got these warnings on the aarch64 system:

/usr/local/lib/python3.12/dist-packages/torch/cuda/__init__.py:61: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
import pynvml # type: ignore[import]
/opt/venv/lib/python3.12/site-packages/pydantic/_internal/_generate_schema.py:2249: UnsupportedFieldAttributeWarning: The 'repr' attribute with value False was provided to the `Field()` function, which has no effect in the context it was used. 'repr' is field-specific metadata, and can only be attached to a model field using `Annotated` metadata or by assignment. This may have happened because an `Annotated` type alias using the `type` statement was used, or if the `Field()` function was attached to a single member of a union type.
warnings.warn(
/opt/venv/lib/python3.12/site-packages/pydantic/_internal/_generate_schema.py:2249: UnsupportedFieldAttributeWarning: The 'frozen' attribute with value True was provided to the `Field()` function, which has no effect in the context it was used. 'frozen' is field-specific metadata, and can only be attached to a model field using `Annotated` metadata or by assignment. This may have happened because an `Annotated` type alias using the `type` statement was used, or if the `Field()` function was attached to a single member of a union type.
warnings.warn(
cfg-path: examples/llm_finetune/qwen/qwen3_8b_squad_spark.yaml
/usr/local/lib/python3.12/dist-packages/torch/distributed/device_mesh.py:604: UserWarning: Slicing a flattened dim from root mesh will be deprecated in PT 2.11. Users need to bookkeep the flattened mesh directly.
sliced_mesh_layout = self._get_slice_mesh_layout(mesh_dim_names)

And I got additional warnings on the x86_64 system:

/opt/Automodel/nemo_automodel/components/models/llama/model.py:338: FutureWarning: `input_embeds` is deprecated and will be removed in version 5.6.0 for `create_causal_mask`. Use `inputs_embeds` instead.
causal_mask = create_causal_mask(
/usr/local/lib/python3.12/dist-packages/torch/distributed/device_mesh.py:604: UserWarning: Slicing a flattened dim from root mesh will be deprecated in PT 2.11. Users need to bookkeep the flattened mesh directly.
sliced_mesh_layout = self._get_slice_mesh_layout(mesh_dim_names)
/opt/Automodel/nemo_automodel/components/models/llama/model.py:338: FutureWarning: `input_embeds` is deprecated and will be removed in version 5.6.0 for `create_causal_mask`. Use `inputs_embeds` instead.
causal_mask = create_causal_mask(

Library Versions

Here for me the most interesting part is the torch version. Because with Axolotl I could not get a version that would support CUDA 13.x and Cuda 12.1 capabilities. But maybe I did something wrong. Who knows?

nemo_automodel: 0.4.0+9687b04c (/opt/Automodel/nemo_automodel/__init__.py)
transformers: 5.5.0 (/opt/venv/lib/python3.12/site-packages/transformers/__init__.py)
torch: 2.11.0a0+eb65b36914.nv26.02 CUDA 13.1

Execution Time

Both systems are not _optimised_ in any way. I just ran the examples on the tutorial. Certainly the RTX Pro 6000 was faster (which had nothing to do with the amount of VRAM). Maybe because the GPU alone is nearly double the price of the complete GB10 / DGX system?

Lora with meta-llama/Llama-3.1-8B

GB10 around 45 seconds per step

RTX Pro 6000 around 11 seconds per step (1 GPU) (tps 5589.07)

RTX Pro 6000 around 6 seconds per step (2 GPU) (tps 9882.63(4941.31/gpu))

RTX Pro 6000 around 3 seconds per step (4 GPU) (tps 19221.33 (4805.33/gpu))

qLora with meta-llama/Meta-Llama-3-70B

GB10 around 251 seconds per step (tps 115.15)

RTX Pro 6000 around 53 seconds per step (1 GPU) (tps 555.74(555.74/gpu))

RTX Pro 6000 around 47 seconds per step (2 GPU) (tps 639.61(319.81/gpu))

RTX Pro 6000 around 27 seconds per step (4 GPU) (tps 1088.10(272.03/gpu))

Full fine-tuning with Qwen/Qwen3-8B

GB10 around 87 seconds per step

RTX Pro 6000 around 17 seconds per step (1 GPU) (tps 3215)

RTX Pro 6000 around 38 seconds per step (2 GPU) (tps 1535.73(767.86/gpu))

RTX Pro 6000 around 22 seconds per step (4 GPU) (tps 2680.50 (670.13/gpu))
NOTE: the power consumption on each GPU was only at 50% to 60% – it seems that the PCI bus was the bottleneck)

System Information

Dell Pro Max GB10

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2025 NVIDIA Corporation
Built on Wed_Aug_20_01:57:39_PM_PDT_2025
Cuda compilation tools, release 13.0, V13.0.88
Build cuda_13.0.r13.0/compiler.36424714_0
Device 0 [NVIDIA GB10] PCIe GEN 1@ 1x RX: N/A TX: N/A
GPU 208MHz MEM N/A MHz TEMP 33°C FAN N/A POW 3 W
GPU[ 0%] MEM[ N/A/119.631Gi]
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.159.03 Driver Version: 580.159.03 CUDA Version: 13.0 |
+-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GB10 On | 0000000F:01:00.0 Off | N/A |
| N/A 33C P8 3W / N/A | Not Supported | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+

Dell Precision 7960 T

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2025 NVIDIA Corporation
Built on Tue_Dec_16_07:23:41_PM_PST_2025
Cuda compilation tools, release 13.1, V13.1.115
Build cuda_13.1.r13.1/compiler.37061995_0
Device 0 [NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition] PCIe GEN 1@16x RX: 4.962 MiB/s TX: 754.0 KiB/s
GPU 180MHz MEM 405MHz TEMP 30°C FAN 30% POW 7 / 300 W
GPU[ 0%] MEM[| 1.171Gi/95.593Gi]
Device 1 [NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition] PCIe GEN 1@16x RX: 1.546 MiB/s TX: 2.764 MiB/s
GPU 180MHz MEM 405MHz TEMP 31°C FAN 30% POW 6 / 300 W
GPU[ 0%] MEM[ 0.626Gi/95.593Gi]
Device 2 [NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition] PCIe GEN 1@16x RX: 439.0 KiB/s TX: 528.0 KiB/s
GPU 180MHz MEM 405MHz TEMP 35°C FAN 30% POW 22 / 300 W
GPU[ 0%] MEM[ 0.635Gi/95.593Gi]
Device 3 [NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition] PCIe GEN 1@16x RX: 1.220 MiB/s TX: 1.501 MiB/s
GPU 180MHz MEM 405MHz TEMP 32°C FAN 30% POW 8 / 300 W
GPU[ 0%] MEM[ 0.626Gi/95.593Gi]
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 595.71.05 Driver Version: 595.71.05 CUDA Version: 13.2 |
+-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA RTX PRO 6000 Blac... On | 00000000:16:00.0 Off | Off |
| 30% 30C P8 6W / 300W | 562MiB / 97887MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA RTX PRO 6000 Blac... On | 00000000:34:00.0 Off | Off |
| 30% 31C P8 5W / 300W | 4MiB / 97887MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 2 NVIDIA RTX PRO 6000 Blac... On | 00000000:AC:00.0 Off | Off |
| 30% 35C P8 23W / 300W | 4MiB / 97887MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 3 NVIDIA RTX PRO 6000 Blac... On | 00000000:CA:00.0 Off | Off |
| 30% 32C P8 8W / 300W | 4MiB / 97887MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+

Summary

So, this was not really an “article” in itself, but a quick listing of the settings and a confirmation that the tutorial works with these (Blackwell) systems.

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.