Reward Usages¶

A reward model (RM) is designed to evaluate and score the quality of outputs generated by a language model, acting as a proxy for human preferences.

Summary¶

Model Usage: reward
Pooling Task:

Model Types	Pooling Tasks
(sequence) (outcome) reward models	classify
token (outcome) reward models	token_classify
process reward models	token_classify

Offline APIs:
- LLM.encode(..., pooling_task="...")
Online APIs:
- Pooling API (/pooling)

Supported Models¶

Reward Models¶

Using sequence classification models as (sequence) (outcome) reward models, the usage and supported features are the same as for normal classification models.

Architecture	Models	Example HF Models	LoRA	PP
`JambaForSequenceClassification`	Jamba	`ai21labs/Jamba-tiny-reward-dev`, etc.	✅︎	✅︎
`Qwen3ForSequenceClassification`^C	Qwen3-based	`Skywork/Skywork-Reward-V2-Qwen3-0.6B`, etc.	✅︎	✅︎
`LlamaForSequenceClassification`^C	Llama-based	`Skywork/Skywork-Reward-V2-Llama-3.2-1B`, etc.	✅︎	✅︎
`Model`^C, `ForCausalLM`^C, etc.	Generative models	N/A	*	*

^C Automatically converted into a classification model via --convert classify. (details)

If your model is not in the above list, we will try to automatically convert the model using as_seq_cls_model. By default, the class probabilities are extracted from the softmaxed hidden state corresponding to the last token.

Token Reward Models¶

The key distinction between (sequence) classification and token classification lies in their output granularity: (sequence) classification produces a single result for an entire input sequence, whereas token classification yields a result for each individual token within the sequence.

Using token classification models as token (outcome) reward models, the usage and supported features are the same as for normal token classification models.

Architecture	Models	Example HF Models	LoRA	PP
`InternLM2ForRewardModel`	InternLM2-based	`internlm/internlm2-1_8b-reward`, `internlm/internlm2-7b-reward`, etc.	✅︎	✅︎
`Qwen2ForRewardModel`	Qwen2-based	`Qwen/Qwen2.5-Math-RM-72B`, etc.	✅︎	✅︎
`Model`^C, `ForCausalLM`^C, etc.	Generative models	N/A	*	*

^C Automatically converted into a classification model via --convert classify. (details)

If your model is not in the above list, we will try to automatically convert the model using as_seq_cls_model.

Process Reward Models¶

The process reward models used for evaluating intermediate steps are crucial to achieving the desired outcome.

Architecture	Models	Example HF Models	LoRA	PP
`LlamaForCausalLM`	Llama-based	`peiyi9979/math-shepherd-mistral-7b-prm`, etc.	✅︎	✅︎
`Qwen2ForProcessRewardModel`	Qwen2-based	`Qwen/Qwen2.5-Math-PRM-7B`, etc.	✅︎	✅︎

Important

For process-supervised reward models such as peiyi9979/math-shepherd-mistral-7b-prm, the pooling config should be set explicitly, e.g.: --pooler-config '{"pooling_type": "STEP", "step_tag_id": 123, "returned_token_ids": [456, 789]}'.

Offline Inference¶

Pooling Parameters¶

The following pooling parameters are supported.

    use_activation: bool | None = None

`LLM.encode`¶

The encode method is available to all pooling models in vLLM.

Reward Models

Set pooling_task="classify" when using LLM.encode for (sequence) (outcome) reward models:

from vllm import LLM

llm = LLM(model="Skywork/Skywork-Reward-V2-Qwen3-0.6B", runner="pooling")
(output,) = llm.encode("Hello, my name is", pooling_task="classify")

data = output.outputs.data
print(f"Data: {data!r}")

Token Reward Models

Set pooling_task="token_classify" when using LLM.encode for token (outcome) reward models:

from vllm import LLM

llm = LLM(model="internlm/internlm2-1_8b-reward", runner="pooling", trust_remote_code=True)
(output,) = llm.encode("Hello, my name is", pooling_task="token_classify")

data = output.outputs.data
print(f"Data: {data!r}")

Process Reward Models