Skip to content

Reward Usages

A reward model (RM) is designed to evaluate and score the quality of outputs generated by a language model, acting as a proxy for human preferences.

Summary

  • Model Usage: reward
  • Pooling Task:
Model Types Pooling Tasks
(sequence) (outcome) reward models classify
token (outcome) reward models token_classify
process reward models token_classify
  • Offline APIs:
    • LLM.encode(..., pooling_task="...")
  • Online APIs:
    • Pooling API (/pooling)

Supported Models

Reward Models

Using sequence classification models as (sequence) (outcome) reward models, the usage and supported features are the same as for normal classification models.

Architecture Models Example HF Models LoRA PP
JambaForSequenceClassification Jamba ai21labs/Jamba-tiny-reward-dev, etc. ✅︎ ✅︎
Qwen3ForSequenceClassificationC Qwen3-based Skywork/Skywork-Reward-V2-Qwen3-0.6B, etc. ✅︎ ✅︎
LlamaForSequenceClassificationC Llama-based Skywork/Skywork-Reward-V2-Llama-3.2-1B, etc. ✅︎ ✅︎
*ModelC, *ForCausalLMC, etc. Generative models N/A * *

C Automatically converted into a classification model via --convert classify. (details)

If your model is not in the above list, we will try to automatically convert the model using as_seq_cls_model. By default, the class probabilities are extracted from the softmaxed hidden state corresponding to the last token.

Token Reward Models

The key distinction between (sequence) classification and token classification lies in their output granularity: (sequence) classification produces a single result for an entire input sequence, whereas token classification yields a result for each individual token within the sequence.

Using token classification models as token (outcome) reward models, the usage and supported features are the same as for normal token classification models.

Architecture Models Example HF Models LoRA PP
InternLM2ForRewardModel InternLM2-based internlm/internlm2-1_8b-reward, internlm/internlm2-7b-reward, etc. ✅︎ ✅︎
Qwen2ForRewardModel Qwen2-based Qwen/Qwen2.5-Math-RM-72B, etc. ✅︎ ✅︎
*ModelC, *ForCausalLMC, etc. Generative models N/A * *

C Automatically converted into a classification model via --convert classify. (details)

If your model is not in the above list, we will try to automatically convert the model using as_seq_cls_model.

Process Reward Models

The process reward models used for evaluating intermediate steps are crucial to achieving the desired outcome.

Architecture Models Example HF Models LoRA PP
LlamaForCausalLM Llama-based peiyi9979/math-shepherd-mistral-7b-prm, etc. ✅︎ ✅︎
Qwen2ForProcessRewardModel Qwen2-based Qwen/Qwen2.5-Math-PRM-7B, etc. ✅︎ ✅︎

Important

For process-supervised reward models such as peiyi9979/math-shepherd-mistral-7b-prm, the pooling config should be set explicitly, e.g.: --pooler-config '{"pooling_type": "STEP", "step_tag_id": 123, "returned_token_ids": [456, 789]}'.

Offline Inference

Pooling Parameters

The following pooling parameters are supported.

    use_activation: bool | None = None

LLM.encode

The encode method is available to all pooling models in vLLM.

  • Reward Models

Set pooling_task="classify" when using LLM.encode for (sequence) (outcome) reward models:

from vllm import LLM

llm = LLM(model="Skywork/Skywork-Reward-V2-Qwen3-0.6B", runner="pooling")
(output,) = llm.encode("Hello, my name is", pooling_task="classify")

data = output.outputs.data
print(f"Data: {data!r}")
  • Token Reward Models

Set pooling_task="token_classify" when using LLM.encode for token (outcome) reward models:

from vllm import LLM

llm = LLM(model="internlm/internlm2-1_8b-reward", runner="pooling", trust_remote_code=True)
(output,) = llm.encode("Hello, my name is", pooling_task="token_classify")

data = output.outputs.data
print(f"Data: {data!r}")
  • Process Reward Models

Set pooling_task="token_classify" when using LLM.encode for token (outcome) reward models:

from vllm import LLM

llm = LLM(model="Qwen/Qwen2.5-Math-PRM-7B", runner="pooling")
(output,) = llm.encode("Hello, my name is<extra_0><extra_0><extra_0>", pooling_task="token_classify")

data = output.outputs.data
print(f"Data: {data!r}")

Online Serving

Please refer to the pooling API. Pooling task corresponding to reward model types refer to the table above.