Inference CLI reference

areno serve

Start an OpenAI-compatible HTTP server backed by the Areno inference engine. The server exposes /v1/chat/completions, accepts standard chat-completion tools fields, and keeps one rollout session open for the process lifetime so rollout state and CUDA graph state can be reused across requests.

areno serve \
  --model-path /path/to/hf/checkpoint \
  --tp-size 1 \
  --world-size 1 \
  --host 0.0.0.0 \
  --port 8000

areno serve

Serve chat completions.

Options:

--model-path TEXT

Local checkpoint/tokenizer path or Hugging Face repo ID. Required.

--tp-size INTEGER

Tensor parallel size. Default: 1.

--world-size INTEGER

Total number of local worker ranks. Default: 1.

--host TEXT

HTTP bind host. Default: 0.0.0.0.

--port INTEGER

HTTP bind port. Default: 8000.

--max-running-prompts INTEGER

Maximum concurrent rollout prompts per request chunk. Default: 128.

--default-max-tokens INTEGER

Default max generated tokens when requests omit a token budget. Default: 1024.

--decode-progress-interval-s FLOAT

Worker decode progress log interval. Default: 0.0.

--eager-decode

Disable decode CUDA graph and run rollout decode eagerly.

--attn-backend [flash|native]

Attention backend. Default: flash. Use native to run without flash-attn on the areno_accel native compatibility path. AReno automatically falls back to native on flash-attn-unsupported GPUs such as Tesla T4 and prints a warning. native is slower than flash on supported GPUs.

--disable-thinking

Pass enable_thinking=False to tokenizer chat templates when supported. Use this when serving a model whose chat template supports a thinking-mode switch and you want normal responses without reasoning spans. Tokenizers that do not accept enable_thinking automatically fall back to their normal chat-template call.

world-size must be divisible by tp-size.

Examples

Single-rank server

areno serve \
  --model-path /path/to/model \
  --tp-size 1 \
  --world-size 1 \
  --port 8000

TP4 server

areno serve \
  --model-path /path/to/model \
  --tp-size 4 \
  --world-size 4 \
  --port 8000

Chat completion request

curl http://127.0.0.1:8000/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "areno",
    "messages": [
      {"role": "user", "content": "Solve 12 * 13."}
    ],
    "max_tokens": 128,
    "temperature": 0.0
  }'

Request fields

POST /v1/chat/completions

Field

Type

Description

model

str | None

Optional model name echoed by the client.

messages

list[ChatMessage]

Required chat messages.

max_tokens

int | None

Generated token budget.

max_completion_tokens

int | None

Alternative generated token budget.

temperature

float

Sampling temperature. Defaults to 1.0; use 0.0 for greedy decoding.

top_p

float

Nucleus sampling threshold.

top_k

int

Top-k sampling threshold. Defaults to -1; non-positive values disable top-k filtering.

n

int

Number of completions per prompt.

stream

bool

Streaming flag. true is not supported.

stop

str | list[str] | None

Stop string or list of stop strings.

seed

int | None

Deterministic sampling seed when sampling is enabled.

tools

list[Tool] | None

OpenAI-compatible function tools. The same model-native tool-call parser used by agentic rollout converts generated tool-call text into message.tool_calls for supported model families.

tool_choice

str | dict | None

Optional tool-choice directive, including a forced function name.

ChatMessage fields:

role

Usually system, user, assistant, or tool.

content

Message content as str | list | None.

Continuous batching behavior

The server runs inside a long-lived rollout session. Compatible requests can be admitted into an active worker decode loop through continuous batching; requests with different generation settings are scheduled separately. Requests are compatible when these fields match:

  • generated token budget

  • temperature

  • top-p

  • top-k

  • seed

  • stop token ids

  • EOS token id

Requests with different generation settings are scheduled separately.

Decode progress logs

Set --decode-progress-interval-s to a positive value to print worker decode progress:

rollout decode progress: dp=0/4 active=32 cuda_graph=True tokens_per_second=2810.7

tokens_per_second is the scheduled decode throughput for that DP worker in the reporting window. It excludes prefill and is not the same as end-to-end request throughput. cuda_graph=True means the worker used CUDA graph replay for at least one decode step in that window; False means the window ran eagerly.

Tool calls

areno serve supports the Chat Completions tool-call shape and reuses the same tool-call parser as agentic rollout:

from openai import OpenAI

client = OpenAI(base_url="http://127.0.0.1:8000/v1", api_key="unused")
response = client.chat.completions.create(
    model="areno",
    messages=[{"role": "user", "content": "Choose a move: left or right."}],
    tools=[
        {
            "type": "function",
            "function": {
                "name": "choose_move",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "direction": {"type": "string", "enum": ["left", "right"]},
                    },
                    "required": ["direction"],
                },
            },
        }
    ],
    tool_choice={"type": "function", "function": {"name": "choose_move"}},
)

print(response.choices[0].message.tool_calls)

Tool-call parsing is selected from the model/tokenizer family. Current parsers cover Qwen/Qwen3.5/MiniCPM-style <tool_call> blocks, Gemma4 tool-call blocks, and generic JSON tool-call output.

Help

areno serve --help