In this item I show you the steps I took when I followed the Fine-tune with NeMo tutorial. In addition, I show you where I deviated from the the tutorial and what was different from it. I could successfully complete the tutorial on these systems:

Dell Pro Max GB10 (NVIDIA DGX Spark clone)
aarch64 / arm64 with 128 GB (shared) VRAM
Dell Prcision 7960T
x86_64 / amd64 with Intel w9 60 cores and 512 GB RAM, 4 NVIDIA RTX Pro 6000 Blackwell Max-Q Workstation with each 96 GB VRAM

Both systems are “Blackwell” GPUs and the driver support varies. For example, I could not install Axolotl on either of these systems. There, aarch64 support was missing for a torch… library and on the other system, the Cuda capabilities were too high.

So, the good news is, that for these both systems, Nemo AutoModel starts “out of the box” …

Installation

Installation with docker is straight-forward. In the tutorial they start the image via docker. I created a docker-composel.yml. Both should work correctly.

NVIDIA NeMo AutoModel Image

I installed this newwer version of the (multi-arch) image:

nvcr.io/nvidia/nemo-automodel:26.04

Docker Compose and Volume

I created a docker-compose.yml and mounted volumes for:

models
datasets
workspace
results
huggingface cache
checkpoints.

			
services:
  automodel:
    image: nvcr.io/nvidia/nemo-automodel:26.04
    container_name: automodel
    user: "0:0"
    volumes:
      - /data/nvidia/models:/models
      - /data/nvidia/datasets:/datasets
      - /data/nvidia/workspace:/workspace
      - /data/nvidia/results:/results
      - /data/nvidia/hf_cache:/root/.cache/huggingface
      - /data/automodel/checkpoints:/opt/Automodel/checkpoints
    working_dir: /workspace
    ipc: host
    ulimits:
      memlock: -1
      stack: 67108864
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    tty: true
    stdin_open: true
    entrypoint: /usr/bin/bash
    environment:
      - TRANSFORMER_ENGINE_PTE=1
      - NVIDIA_VISIBLE_DEVICES=all

		

Start and Fine-Tune

NOTE: When I started the scripts, I added HF_TOKEN to the current shell With this, the Huggingface downloader did not ask to token information.

I use automodel or (am) instead of examples/llm_finetune/finetune.py:

			
cd /opt/Automodel
automodel --nproc-per-node 4 \
examples/llm_finetune/llama3_2/llama3_2_1b_squad_peft.yaml \
--model.pretrained_model_name_or_path meta-llama/Llama-3.1-8B \
--packed_sequence.packed_sequence_size 1024 \
--step_scheduler.max_steps 20

		

Something suprised me: when I enter the path to the yaml configuration with an absolute path (“/”), automodel could not find the config file. Therefore, I had to specify the relative path.

Warnings

When I started the fine-tuning commands, the programme showed different warnings, but did not stop.

I got these warnings on the aarch64 system:

			
/usr/local/lib/python3.12/dist-packages/torch/cuda/__init__.py:61: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
  import pynvml  # type: ignore[import]
/opt/venv/lib/python3.12/site-packages/pydantic/_internal/_generate_schema.py:2249: UnsupportedFieldAttributeWarning: The 'repr' attribute with value False was provided to the `Field()` function, which has no effect in the context it was used. 'repr' is field-specific metadata, and can only be attached to a model field using `Annotated` metadata or by assignment. This may have happened because an `Annotated` type alias using the `type` statement was used, or if the `Field()` function was attached to a single member of a union type.
  warnings.warn(
/opt/venv/lib/python3.12/site-packages/pydantic/_internal/_generate_schema.py:2249: UnsupportedFieldAttributeWarning: The 'frozen' attribute with value True was provided to the `Field()` function, which has no effect in the context it was used. 'frozen' is field-specific metadata, and can only be attached to a model field using `Annotated` metadata or by assignment. This may have happened because an `Annotated` type alias using the `type` statement was used, or if the `Field()` function was attached to a single member of a union type.
  warnings.warn(
cfg-path: examples/llm_finetune/qwen/qwen3_8b_squad_spark.yaml

		

			
/usr/local/lib/python3.12/dist-packages/torch/distributed/device_mesh.py:604: UserWarning: Slicing a flattened dim from root mesh will be deprecated in PT 2.11. Users need to bookkeep the flattened mesh directly.
  sliced_mesh_layout = self._get_slice_mesh_layout(mesh_dim_names)

And I got additional warnings on the x86_64 system:

			
/opt/Automodel/nemo_automodel/components/models/llama/model.py:338: FutureWarning: `input_embeds` is deprecated and will be removed in version 5.6.0 for `create_causal_mask`. Use `inputs_embeds` instead.
  causal_mask = create_causal_mask(
/usr/local/lib/python3.12/dist-packages/torch/distributed/device_mesh.py:604: UserWarning: Slicing a flattened dim from root mesh will be deprecated in PT 2.11. Users need to bookkeep the flattened mesh directly.
  sliced_mesh_layout = self._get_slice_mesh_layout(mesh_dim_names)
/opt/Automodel/nemo_automodel/components/models/llama/model.py:338: FutureWarning: `input_embeds` is deprecated and will be removed in version 5.6.0 for `create_causal_mask`. Use `inputs_embeds` instead.
  causal_mask = create_causal_mask(

		

Library Versions

Here for me the most interesting part is the torch version. Because with Axolotl I could not get a version that would support CUDA 13.x and Cuda 12.1 capabilities. But maybe I did something wrong. Who knows?

			
nemo_automodel: 0.4.0+9687b04c (/opt/Automodel/nemo_automodel/__init__.py)
transformers: 5.5.0 (/opt/venv/lib/python3.12/site-packages/transformers/__init__.py)
torch: 2.11.0a0+eb65b36914.nv26.02 CUDA 13.1

Execution Time

Both systems are not _optimised_ in any way. I just ran the examples on the tutorial. Certainly the RTX Pro 6000 was faster (which had nothing to do with the amount of VRAM). Maybe because the GPU alone is nearly double the price of the complete GB10 / DGX system?

Lora with `meta-llama/Llama-3.1-8B`

GB10 around 45 seconds per step

RTX Pro 6000 around 11 seconds per step (1 GPU) (tps 5589.07)

RTX Pro 6000 around 6 seconds per step (2 GPU) (tps 9882.63(4941.31/gpu))

RTX Pro 6000 around 3 seconds per step (4 GPU) (tps 19221.33 (4805.33/gpu))

qLora with `meta-llama/Meta-Llama-3-70B`

GB10 around 251 seconds per step (tps 115.15)

RTX Pro 6000 around 53 seconds per step (1 GPU) (tps 555.74(555.74/gpu))

RTX Pro 6000 around 47 seconds per step (2 GPU) (tps 639.61(319.81/gpu))

RTX Pro 6000 around 27 seconds per step (4 GPU) (tps 1088.10(272.03/gpu))

Full fine-tuning with `Qwen/Qwen3-8B`

GB10 around 87 seconds per step

RTX Pro 6000 around 17 seconds per step (1 GPU) (tps 3215)

RTX Pro 6000 around 38 seconds per step (2 GPU) (tps 1535.73(767.86/gpu))

RTX Pro 6000 around 22 seconds per step (4 GPU) (tps 2680.50 (670.13/gpu))
NOTE: the power consumption on each GPU was only at 50% to 60% – it seems that the PCI bus was the bottleneck)

System Information

Dell Pro Max GB10

			
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2025 NVIDIA Corporation
Built on Wed_Aug_20_01:57:39_PM_PDT_2025
Cuda compilation tools, release 13.0, V13.0.88
Build cuda_13.0.r13.0/compiler.36424714_0
Device 0 [NVIDIA GB10] PCIe GEN 1@ 1x RX: N/A TX: N/A
GPU 208MHz  MEM N/A MHz  TEMP  33°C   FAN N/A   POW   3 W
GPU[                           0%] MEM[                N/A/119.631Gi]
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.159.03             Driver Version: 580.159.03     CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GB10                    On  |   0000000F:01:00.0 Off |                  N/A |
| N/A   33C    P8              3W /  N/A  | Not Supported          |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

		

Dell Precision 7960 T

			
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2025 NVIDIA Corporation
Built on Tue_Dec_16_07:23:41_PM_PST_2025
Cuda compilation tools, release 13.1, V13.1.115
Build cuda_13.1.r13.1/compiler.37061995_0
Device 0 [NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition] PCIe GEN 1@16x RX: 4.962 MiB/s TX: 754.0 KiB/s
GPU 180MHz  MEM 405MHz  TEMP  30°C FAN  30% POW   7 / 300 W
GPU[                                                 0%] MEM[|                                  1.171Gi/95.593Gi]
Device 1 [NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition] PCIe GEN 1@16x RX: 1.546 MiB/s TX: 2.764 MiB/s
GPU 180MHz  MEM 405MHz  TEMP  31°C FAN  30% POW   6 / 300 W
GPU[                                                 0%] MEM[                                   0.626Gi/95.593Gi]
Device 2 [NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition] PCIe GEN 1@16x RX: 439.0 KiB/s TX: 528.0 KiB/s
GPU 180MHz  MEM 405MHz  TEMP  35°C FAN  30% POW  22 / 300 W
GPU[                                                 0%] MEM[                                   0.635Gi/95.593Gi]
Device 3 [NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition] PCIe GEN 1@16x RX: 1.220 MiB/s TX: 1.501 MiB/s
GPU 180MHz  MEM 405MHz  TEMP  32°C FAN  30% POW   8 / 300 W
GPU[                                                 0%] MEM[                                   0.626Gi/95.593Gi]
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 595.71.05              Driver Version: 595.71.05      CUDA Version: 13.2     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA RTX PRO 6000 Blac...    On  |   00000000:16:00.0 Off |                  Off |
| 30%   30C    P8              6W /  300W |     562MiB /  97887MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA RTX PRO 6000 Blac...    On  |   00000000:34:00.0 Off |                  Off |
| 30%   31C    P8              5W /  300W |       4MiB /  97887MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA RTX PRO 6000 Blac...    On  |   00000000:AC:00.0 Off |                  Off |
| 30%   35C    P8             23W /  300W |       4MiB /  97887MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA RTX PRO 6000 Blac...    On  |   00000000:CA:00.0 Off |                  Off |
| 30%   32C    P8              8W /  300W |       4MiB /  97887MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

		

More Screenshots

Summary

So, this was not really an “article” in itself, but a quick listing of the settings and a confirmation that the tutorial works with these (at the time of this writing fairly new Blackwell) systems.

Posted on 2 months ago 0 By Ronald Rink Technology Posted in Technology, Various Products Tagged #AutoModel, dgx, docker, Lora, NeMo, nvidia, qLora, rtx pro 6000

Ronald Rink

I am a senior auditor, consultant and architect at d-fens for business processes and information systems.

d-fens GmbH

Fine-tuning with NVIDIA NeMo AutoModel

Installation

NVIDIA NeMo AutoModel Image

Docker Compose and Volume

Start and Fine-Tune

Warnings

Library Versions

Execution Time

Lora with `meta-llama/Llama-3.1-8B`

qLora with `meta-llama/Meta-Llama-3-70B`

Full fine-tuning with `Qwen/Qwen3-8B`

System Information

Dell Pro Max GB10

Dell Precision 7960 T

More Screenshots

Summary

Ronald Rink

Leave a comment Cancel reply

Recent Posts

Follow

Categories

Authors

Top Posts & Pages

Fine-tuning with NVIDIA NeMo AutoModel

Installation

NVIDIA NeMo AutoModel Image

Docker Compose and Volume

Start and Fine-Tune

Warnings

Library Versions

Execution Time

Lora with meta-llama/Llama-3.1-8B

qLora with meta-llama/Meta-Llama-3-70B

Full fine-tuning with Qwen/Qwen3-8B

System Information

Dell Pro Max GB10

Dell Precision 7960 T

More Screenshots

Summary

Share this:

Related

Ronald Rink

Leave a comment Cancel reply

Recent Posts

Follow

Categories

Authors

Top Posts & Pages

Lora with `meta-llama/Llama-3.1-8B`

qLora with `meta-llama/Meta-Llama-3-70B`

Full fine-tuning with `Qwen/Qwen3-8B`