Training CLI reference

areno train

Run SFT, DPO, GSPO, GRPO, or PPO training with the local Areno backend. The command owns the full loop: dataset loading, optional normalization, rollout, reward scoring, loss computation, optimizer steps, metrics, and checkpoint saving.

areno train \
  --ckpt Qwen/Qwen3-0.6B \
  --dataset-path gsm8k:main \
  --dataset-loader-fn examples/math/dataset_loader.py \
  --reward-fn-path examples/math/math_verify_reward.py \
  --algo gspo \
  --tp-size 1 \
  --world-size 1 \
  --batch-size 2 \
  --n-samples 2 \
  --mini-bs 1

areno train

Start a training run.

Options are grouped into sections that match areno train --help, following the RL training loop: Basic (what to run plus devices), Rollout (generate and score completions), Train (consume rollouts and update weights), Checkpoint (produced artifacts), and Observability (logs).

Basic

Inputs, dataset loader, the algorithm and run length, and device counts.

--ckpt TEXT

Actor model/tokenizer checkpoint path or Hugging Face repo ID.

--dataset-path TEXT

Training dataset path, Hugging Face save_to_disk directory, or Hugging Face dataset reference.

Dataset references use repo/name, repo/name:config, or repo/name:config:split. Examples: gsm8k:main and AI-MO/NuminaMath-TIR.

--dataset-path also accepts JSON/JSONL, Parquet, CSV/TSV, Arrow, and datasets.save_to_disk(...) directories.

--dataset-loader-fn TEXT

Optional Python dataset loader function as file.py or file.py:function.

Use --dataset-loader-fn when the raw dataset does not already match the trainer schema. Without a loader, Areno passes dataset rows through unchanged, except for SFT where a loader is required. See Dataset loaders for the per-algorithm loader contracts.

--algo TEXT

Training algorithm registered in areno.api. Default: gspo.

Built-in algorithms: sft, dpo, gspo, grpo, ppo.

--epochs INTEGER

Number of dataset epochs to train. Default: 10.

--max-steps INTEGER

Optional global trainer step cap. Training stops after this many step indices have completed, even if the current epoch still has more batches.

--world-size INTEGER

Total device count for the backend. Default: 8.

--tp-size INTEGER

Tensor parallel size for the backend. Default: 4.

world-size must be divisible by tp-size.

Rollout

Everything that generates and scores completions: batch volume, sequence limits, sampling, decode runtime, the agentic-rollout hooks, and the reward signal.

--batch-size INTEGER

Prompt or pair batch size. Default: 32.

--n-samples INTEGER

Rollout samples per prompt for RL algorithms. Default: 8.

--max-running-prompts INTEGER

Override global concurrent rollout prompts. Defaults to batch-size * n-samples for rollout algorithms.

--max-prompt-tokens INTEGER

Maximum tokenized prompt length. Default: 1024.

--max-new-tokens INTEGER

Maximum generated or supervised response tokens. Default: 3071.

--max-context-len INTEGER

Maximum total context tokens for agentic rollout trajectories. This counts the original prompt plus all generated assistant turns concatenated into the training row. Defaults to the model context limit.

--temperature FLOAT

Rollout sampling temperature. Default: 1.0.

--top-k INTEGER

Rollout top-k; -1 disables top-k filtering. Default: -1.

--top-p FLOAT

Rollout top-p. Default: 1.0.

--greedy

Use greedy rollout decoding.

--eager-decode

Disable decode CUDA graph and run rollout decode eagerly.

--drop-rollout-state

Drop rollout state after each step to save GPU memory. By default, Areno keeps rollout state on GPU between steps for lower rollout setup overhead.

--attn-backend [flash|native]

Attention backend. Default: flash. Use native to run without flash-attn on the areno_accel native compatibility path. AReno automatically falls back to native on flash-attn-unsupported GPUs such as Tesla T4 and prints a warning. native is slower than flash on supported GPUs.

--disable-thinking

Pass enable_thinking=False to tokenizer chat templates when supported. This is useful for models whose tokenizer template exposes an explicit thinking-mode switch, such as some reasoning/chat checkpoints. Tokenizers that do not accept enable_thinking automatically fall back to their normal chat-template call.

Training rollouts run inside a rollout session. The session owns actor onload/offload, rollout cache state, CUDA graph state, and cleanup between rollout and train phases. Direct prompt rollout and agentic rollout both use the same session lifecycle.

--agent-fn TEXT

Python file defining async def run_agent(ctx, batch). When provided, online RL algorithms use agentic rollout mode instead of direct prompt completion. The agent receives a local OpenAI-compatible base URL from ctx.get_base_url() and can call /v1/chat/completions with tools. Use batch.iter_samples() to iterate the expanded batch-size * n-samples agent tasks. The function returns explicit trajectories: one AgentTrajectoryTurn, one AgentTrajectory, or an iterable of either. Each turn must carry its AgentItem, message list, and OpenAI response.

--agent-timeout-s FLOAT

Timeout for agentic proxy requests and the agent function. Default: 300.0.

--train-tool-results

Include tool-result spans in policy loss. Disabled by default because tool results are environment observations rather than policy actions. Assistant text and assistant tool-call spans are trainable by default.

Agentic trajectories can contain multiple chat-completion turns for the same prompt/sample pair. The agent owns the OpenAI-style message list and returns trajectory turns with the model response; Areno converts those turns into token rows, rollout logprobs, parsed assistant tool calls, reward records, and loss masks. Tool-result/context spans are included in the token row for correct scoring but are masked from policy loss unless --train-tool-results is set. When --max-context-len is set, filtering uses the full concatenated token row for the whole agentic trajectory, not only the latest chat-completion turn.

--reward-fn-path TEXT

Python file defining reward_fn(record).

Reward files should expose:

def reward_fn(record):
    return 0.0
--reward-ckpt TEXT

Optional PPO reward model checkpoint path or Hugging Face repo ID.

Parameter tuning

--tune-params

Probe rollout and training memory before starting the real run, then fill safe values for --max-running-prompts, --batch-size and --mini-bs. This is intended for rollout-based algorithms such as GSPO, GRPO and PPO, including agentic rollouts where the right concurrency is hard to estimate by hand.

The tuner uses dummy-loaded model weights and synthetic token rows, so it measures the selected model architecture, --world-size/--tp-size, sequence lengths, CUDA graph setup, optimizer state, and train microbatch memory without consuming a real dataset row or writing checkpoints. It keeps the user-provided --tp-size, --n-samples, --adam-8bit, sequence limits, model path, algorithm, and backend settings.

Search is deliberately conservative:

  • rollout candidates are tried from larger to smaller --max-running-prompts values;

  • if --max-running-prompts is explicitly provided, rollout probing is skipped and that concurrency is used directly for training-parameter tuning;

  • training uses the rollout-selected concurrency to derive a batch size, then tries larger to smaller --mini-bs values;

  • --drop-rollout-state is enabled for the tuned run so rollout memory does not remain resident during the training probe or optimizer step.

--mem-frac FLOAT

Target maximum GPU memory fraction for tuning. Default: 0.9. Lower this when sharing a node or when the real reward/agent path has additional GPU users.

--tune-max-samples INTEGER

Upper bound for sampled rollout/train rows considered during tuning. Default: 256. The rollout search does not try --max-running-prompts above this value, and the derived train batch size is capped by tune-max-samples / n-samples.

Example:

areno train \
  --ckpt /path/to/Qwen3.5-4B \
  --dataset-path examples/agentic/coding/dataset.jsonl \
  --dataset-loader-fn examples/agentic/coding/dataset_loader.py \
  --reward-fn-path examples/agentic/coding/reward.py \
  --agent-fn examples/agentic/coding/run_agent.py \
  --algo gspo \
  --world-size 8 \
  --tp-size 4 \
  --n-samples 8 \
  --max-new-tokens 2048 \
  --max-context-len 32768 \
  --tune-params \
  --mem-frac 0.9 \
  --tune-max-samples 256

Train

Everything that consumes rollouts and updates weights: training batching and memory, the policy optimizer, reference/critic models, and the per-algorithm loss knobs. Each algorithm-specific flag applies only to the algorithms named in its description; flags for other algorithms are ignored.

--mini-bs INTEGER

Backend training microbatch size. Default: 16.

--gradient-accumulation-steps INTEGER

Optimizer step interval in microbatches. Defaults to accumulating all mini-batches in one train call.

--activation-checkpointing / --no-activation-checkpointing

Enable decoder-layer activation recompute during training. Default: enabled.

--lr FLOAT

Policy optimizer learning rate. Default: 1.0e-6.

--min-lr FLOAT

Policy optimizer minimum learning rate. Default: 1.0e-7.

--lr-decay-steps INTEGER

Policy LR decay steps. Default: 1000.

--lr-decay-style TEXT

Policy LR decay style. Default: cosine.

--adam-beta1 FLOAT

Policy optimizer Adam beta1. Default: 0.9.

--adam-beta2 FLOAT

Policy optimizer Adam beta2. Default: 0.999.

--adam-8bit

Use 8-bit Adam moment states instead of FP32 Adam states.

--weight-decay FLOAT

Policy optimizer weight decay. Default: 1.0e-2.

--grad-clip-norm FLOAT

Policy gradient clipping norm. Default: 1.0.

--ref-ckpt TEXT

Optional PPO/DPO reference model checkpoint path or Hugging Face repo ID.

--critic-ckpt TEXT

Optional PPO critic model checkpoint path or Hugging Face repo ID.

--critic-lr FLOAT

PPO critic optimizer learning rate. Default: 1.0e-5.

--critic-warmup-steps INTEGER

PPO critic-only warmup steps before actor updates. Default: 20.

--gspo-clip-eps FLOAT

GSPO sequence-ratio clipping epsilon. Default: 3.0e-4.

--grpo-clip-eps FLOAT

GRPO token-ratio clipping epsilon. Default: 0.2.

--dpo-beta FLOAT

DPO preference margin temperature. Default: 0.1.

--use-kl-loss / --no-use-kl-loss

Enable PPO actor KL loss. Default: enabled.

--kl-loss-coef FLOAT

PPO actor KL loss coefficient. Default: 0.001.

--kl-loss-type TEXT

PPO actor KL loss type. Default: low_var_kl.

--clip-eps FLOAT

PPO policy clipping epsilon. Default: 0.2.

--clip-ratio-c FLOAT

PPO lower policy clipping bound multiplier. Default: 3.0.

--value-clip-eps FLOAT

PPO value clipping epsilon. Default: 0.5.

--value-loss-coef FLOAT

PPO value loss coefficient. Default: 0.5.

--gamma FLOAT

PPO GAE discount. Default: 1.0.

--lam FLOAT

PPO GAE lambda. Default: 0.95.

Checkpoint

--save-path TEXT

Optional checkpoint output directory.

--save-interval INTEGER

Save checkpoint every N train steps. Default: 100.

Observability

--metrics-log-dir TEXT

TensorBoard metrics log directory. See Observability for the rollout/*, train/*, and time/* metric namespaces and debugging log examples.

Examples

Tiny training smoke test

Use this command when you only want to check that a machine can run one small official training task end to end:

areno train \
  --ckpt Qwen/Qwen3-0.6B \
  --dataset-path gsm8k:main \
  --dataset-loader-fn examples/math/dataset_loader.py \
  --reward-fn-path examples/math/math_verify_reward.py \
  --algo gspo \
  --tp-size 1 \
  --world-size 1 \
  --batch-size 1

This verifies the training wiring; it is not intended to measure final model quality.

GSPO math training

areno train \
  --ckpt Qwen/Qwen3-0.6B \
  --dataset-path gsm8k:main \
  --dataset-loader-fn examples/math/dataset_loader.py \
  --reward-fn-path examples/math/math_verify_reward.py \
  --algo gspo \
  --tp-size 1 \
  --world-size 1 \
  --batch-size 2 \
  --n-samples 2 \
  --mini-bs 1

SFT instruction tuning

areno train \
  --ckpt Qwen/Qwen3-0.6B \
  --dataset-path yahma/alpaca-cleaned \
  --dataset-loader-fn examples/sft/alpaca/dataset_loader.py \
  --algo sft \
  --tp-size 1 \
  --world-size 1 \
  --batch-size 2 \
  --mini-bs 1

SFT loaders must normalize raw rows to prompt and response dictionaries. The trainer performs tokenization and trains on the response suffix.

DPO preference training

areno train \
  --ckpt /path/to/policy \
  --ref-ckpt /path/to/reference \
  --dataset-path /path/to/dpo.jsonl \
  --dataset-loader-fn /path/to/dpo_dataset_loader.py \
  --algo dpo \
  --tp-size 1 \
  --world-size 1

The DPO loader should normalize each row to prompt, chosen, and rejected.

PPO with reward and critic roles

areno train \
  --ckpt /path/to/policy \
  --ref-ckpt /path/to/reference \
  --reward-ckpt /path/to/reward-model \
  --critic-ckpt /path/to/critic \
  --dataset-path /path/to/data \
  --dataset-loader-fn examples/math/dataset_loader.py \
  --algo ppo \
  --tp-size 4 \
  --world-size 8

Agentic Tic-Tac-Toe

python examples/agentic/tictactoe/dataset_generator.py \
  --output /tmp/areno-tictactoe.jsonl \
  --count 256 \
  --seed 2026

areno train \
  --ckpt Qwen/Qwen3-0.6B \
  --dataset-path /tmp/areno-tictactoe.jsonl \
  --dataset-loader-fn examples/agentic/tictactoe/dataset_loader.py \
  --reward-fn-path examples/agentic/tictactoe/reward.py \
  --agent-fn examples/agentic/tictactoe/run_agent.py \
  --algo gspo \
  --tp-size 1 \
  --world-size 1 \
  --batch-size 32 \
  --n-samples 8 \
  --max-new-tokens 32

The agent function can use the OpenAI Python client against ctx.get_base_url() and returns explicit AgentTrajectory or AgentTrajectoryTurn objects. Areno converts them to tokens, rollout logprobs, parsed tool_calls, rewards, and loss masks, then feeds the resulting batch to the same policy trainer used by non-agentic rollouts.

Agentic DuelGrid

DuelGrid is a turn-based grid tactics example for agentic RLVR. The user controls U and the LLM controls A with JSON action sequences such as MOVE, ATTACK, RANGED_ATTACK, PICKUP, and SHIELD.

python examples/agentic/duelgrid/dataset_generator.py \
  --count 256 \
  --output /tmp/areno-duelgrid.jsonl

areno train \
  --ckpt Qwen/Qwen3-0.6B \
  --dataset-path /tmp/areno-duelgrid.jsonl \
  --dataset-loader-fn examples/agentic/duelgrid/dataset_loader.py \
  --reward-fn-path examples/agentic/duelgrid/reward.py \
  --agent-fn examples/agentic/duelgrid/run_agent.py \
  --algo gspo \
  --tp-size 1 \
  --world-size 1

The browser UI can replay the same rule engine:

python examples/agentic/duelgrid/web_ui.py \
  --base-url http://127.0.0.1:8000/v1 \
  --api-key EMPTY \
  --model policy

Before GSPO/RLVR post-training, Gemma-E2B-it performs poorly in DuelGrid and often oscillates between nearby tiles. After training, it learns to collect health and energy pickups, chase the user, attack when it has position, and avoid trap tiles while spending its turn energy. The reward curve improves quickly early in training and then stabilizes after the policy has learned the game loop.

Train before

Reward

Train after

Gemma-E2B-it before DuelGrid training DuelGrid training reward curve Gemma-E2B-it after DuelGrid training

Help

areno train --help