Reward Usages¶
A reward model (RM) is designed to evaluate and score the quality of outputs generated by a language model, acting as a proxy for human preferences.
Summary¶
- Model Usage: reward
- Pooling Task:
| Model Types | Pooling Tasks |
|---|---|
| (sequence) (outcome) reward models | classify |
| token (outcome) reward models | token_classify |
| process reward models | token_classify |
- Offline APIs:
LLM.encode(..., pooling_task="...")
- Online APIs:
- Pooling API (
/pooling)
- Pooling API (
Supported Models¶
Reward Models¶
Using sequence classification models as (sequence) (outcome) reward models, the usage and supported features are the same as for normal classification models.
| Architecture | Models | Example HF Models | LoRA | PP |
|---|---|---|---|---|
JambaForSequenceClassification | Jamba | ai21labs/Jamba-tiny-reward-dev, etc. | ✅︎ | ✅︎ |
Qwen3ForSequenceClassificationC | Qwen3-based | Skywork/Skywork-Reward-V2-Qwen3-0.6B, etc. | ✅︎ | ✅︎ |
LlamaForSequenceClassificationC | Llama-based | Skywork/Skywork-Reward-V2-Llama-3.2-1B, etc. | ✅︎ | ✅︎ |
*ModelC, *ForCausalLMC, etc. | Generative models | N/A | * | * |
C Automatically converted into a classification model via --convert classify. (details)
If your model is not in the above list, we will try to automatically convert the model using as_seq_cls_model. By default, the class probabilities are extracted from the softmaxed hidden state corresponding to the last token.
Token Reward Models¶
The key distinction between (sequence) classification and token classification lies in their output granularity: (sequence) classification produces a single result for an entire input sequence, whereas token classification yields a result for each individual token within the sequence.
Using token classification models as token (outcome) reward models, the usage and supported features are the same as for normal token classification models.
| Architecture | Models | Example HF Models | LoRA | PP |
|---|---|---|---|---|
InternLM2ForRewardModel | InternLM2-based | internlm/internlm2-1_8b-reward, internlm/internlm2-7b-reward, etc. | ✅︎ | ✅︎ |
Qwen2ForRewardModel | Qwen2-based | Qwen/Qwen2.5-Math-RM-72B, etc. | ✅︎ | ✅︎ |
*ModelC, *ForCausalLMC, etc. | Generative models | N/A | * | * |
C Automatically converted into a classification model via --convert classify. (details)
If your model is not in the above list, we will try to automatically convert the model using as_seq_cls_model.
Process Reward Models¶
The process reward models used for evaluating intermediate steps are crucial to achieving the desired outcome.
| Architecture | Models | Example HF Models | LoRA | PP |
|---|---|---|---|---|
LlamaForCausalLM | Llama-based | peiyi9979/math-shepherd-mistral-7b-prm, etc. | ✅︎ | ✅︎ |
Qwen2ForProcessRewardModel | Qwen2-based | Qwen/Qwen2.5-Math-PRM-7B, etc. | ✅︎ | ✅︎ |
Important
For process-supervised reward models such as peiyi9979/math-shepherd-mistral-7b-prm, the pooling config should be set explicitly, e.g.: --pooler-config '{"pooling_type": "STEP", "step_tag_id": 123, "returned_token_ids": [456, 789]}'.
Offline Inference¶
Pooling Parameters¶
The following pooling parameters are supported.
LLM.encode¶
The encode method is available to all pooling models in vLLM.
- Reward Models
Set pooling_task="classify" when using LLM.encode for (sequence) (outcome) reward models:
from vllm import LLM
llm = LLM(model="Skywork/Skywork-Reward-V2-Qwen3-0.6B", runner="pooling")
(output,) = llm.encode("Hello, my name is", pooling_task="classify")
data = output.outputs.data
print(f"Data: {data!r}")
- Token Reward Models
Set pooling_task="token_classify" when using LLM.encode for token (outcome) reward models:
from vllm import LLM
llm = LLM(model="internlm/internlm2-1_8b-reward", runner="pooling", trust_remote_code=True)
(output,) = llm.encode("Hello, my name is", pooling_task="token_classify")
data = output.outputs.data
print(f"Data: {data!r}")
- Process Reward Models
Set pooling_task="token_classify" when using LLM.encode for token (outcome) reward models:
from vllm import LLM
llm = LLM(model="Qwen/Qwen2.5-Math-PRM-7B", runner="pooling")
(output,) = llm.encode("Hello, my name is<extra_0><extra_0><extra_0>", pooling_task="token_classify")
data = output.outputs.data
print(f"Data: {data!r}")
Online Serving¶
Please refer to the pooling API. Pooling task corresponding to reward model types refer to the table above.