Skip to content

Scoring Usages

The score models is designed to compute similarity scores between two input prompts. It supports three model types (aka score_type): cross-encoder, late-interaction, and bi-encoder.

Note

vLLM handles only the model inference component of RAG pipelines (such as embedding generation and reranking). For higher-level RAG orchestration, you should leverage integration frameworks like LangChain.

Summary

  • Model Usage: Scoring
  • Pooling Task:
Score Types Pooling Tasks scoring function
cross-encoder score linear classifier
late-interaction token_embed late interaction(MaxSim)
bi-encoder embed cosine similarity
  • Offline APIs:
    • LLM.score
  • Online APIs:

Supported Models

Cross-encoder models

Cross-encoder (aka reranker) models are a subset of classification models that accept two prompts as input and output num_labels equal to 1.

Text-only Models

Architecture Models Example HF Models Score template (see note) LoRA PP
BertForSequenceClassification BERT-based cross-encoder/ms-marco-MiniLM-L-6-v2, etc. N/A
GemmaForSequenceClassification Gemma-based BAAI/bge-reranker-v2-gemma(see note), etc. bge-reranker-v2-gemma.jinja ✅︎ ✅︎
GteNewForSequenceClassification mGTE-TRM (see note) Alibaba-NLP/gte-multilingual-reranker-base, etc. N/A
LlamaBidirectionalForSequenceClassificationC Llama-based with bidirectional attention nvidia/llama-nemotron-rerank-1b-v2, etc. nemotron-rerank.jinja ✅︎ ✅︎
Qwen2ForSequenceClassificationC Qwen2-based mixedbread-ai/mxbai-rerank-base-v2(see note), etc. mxbai_rerank_v2.jinja ✅︎ ✅︎
Qwen3ForSequenceClassificationC Qwen3-based tomaarsen/Qwen3-Reranker-0.6B-seq-cls, Qwen/Qwen3-Reranker-0.6B(see note), etc. qwen3_reranker.jinja ✅︎ ✅︎
RobertaForSequenceClassification RoBERTa-based cross-encoder/quora-roberta-base, etc. N/A
XLMRobertaForSequenceClassification XLM-RoBERTa-based BAAI/bge-reranker-v2-m3, etc. N/A
*ModelC, *ForCausalLMC, etc. Generative models N/A N/A * *

C Automatically converted into a classification model via --convert classify. (details)
* Feature support is the same as that of the original model.

Note

Some models require a specific prompt format to work correctly.

You can find Example HF Models's corresponding score template in examples/pooling/score/template/

Examples : examples/pooling/score/using_template_offline.py examples/pooling/score/using_template_online.py

Note

Load the official original BAAI/bge-reranker-v2-gemma by using the following command.

vllm serve BAAI/bge-reranker-v2-gemma --hf_overrides '{"architectures": ["GemmaForSequenceClassification"],"classifier_from_token": ["Yes"],"method": "no_post_processing"}'

Note

The second-generation GTE model (mGTE-TRM) is named NewForSequenceClassification. The name NewForSequenceClassification is too generic, you should set --hf-overrides '{"architectures": ["GteNewForSequenceClassification"]}' to specify the use of the GteNewForSequenceClassification architecture.

Note

Load the official original mxbai-rerank-v2 by using the following command.

vllm serve mixedbread-ai/mxbai-rerank-base-v2 --hf_overrides '{"architectures": ["Qwen2ForSequenceClassification"],"classifier_from_token": ["0", "1"], "method": "from_2_way_softmax"}'

Note

Load the official original Qwen3 Reranker by using the following command. More information can be found at: examples/pooling/score/qwen3_reranker_offline.py examples/pooling/score/qwen3_reranker_online.py.

vllm serve Qwen/Qwen3-Reranker-0.6B --hf_overrides '{"architectures": ["Qwen3ForSequenceClassification"],"classifier_from_token": ["no", "yes"],"is_original_qwen3_reranker": true}'

Multimodal Models

Note

For more information about multimodal models inputs, see this page.

Architecture Models Inputs Example HF Models LoRA PP
JinaVLForSequenceClassification JinaVL-based T + IE+ jinaai/jina-reranker-m0, etc. ✅︎ ✅︎
LlamaNemotronVLForSequenceClassification Llama Nemotron Reranker + SigLIP T + IE+ nvidia/llama-nemotron-rerank-vl-1b-v2
Qwen3VLForSequenceClassification Qwen3-VL-Reranker T + IE+ + VE+ Qwen/Qwen3-VL-Reranker-2B(see note), etc. ✅︎ ✅︎

C Automatically converted into a classification model via --convert classify. (details)
* Feature support is the same as that of the original model.

Note

Similar to Qwen3-Reranker, you need to use the following --hf_overrides to load the official original Qwen3-VL-Reranker.

vllm serve Qwen/Qwen3-VL-Reranker-2B --hf_overrides '{"architectures": ["Qwen3VLForSequenceClassification"],"classifier_from_token": ["no", "yes"],"is_original_qwen3_reranker": true}'

Late-interaction models

All models that support token embedding task also support using the score API to compute similarity scores by calculating the late interaction of two input prompts. See this page for more information about token embedding models.

Text-only Models

Architecture Models Example HF Models LoRA PP
ColBERTModernBertModel ModernBERT lightonai/GTE-ModernColBERT-v1
ColBERTJinaRobertaModel Jina XLM-RoBERTa jinaai/jina-colbert-v2
HF_ColBERT BERT answerdotai/answerai-colbert-small-v1, colbert-ir/colbertv2.0
*ModelC, *ForCausalLMC, etc. Generative models N/A * *

Multimodal Models

Note

For more information about multimodal models inputs, see this page.

Architecture Models Inputs Example HF Models LoRA PP
ColModernVBertForRetrieval ColModernVBERT T / I ModernVBERT/colmodernvbert-merged
ColPaliForRetrieval ColPali T / I vidore/colpali-v1.3-hf
ColQwen3 Qwen3-VL T / I TomoroAI/tomoro-colqwen3-embed-4b, TomoroAI/tomoro-colqwen3-embed-8b
ColQwen3_5 ColQwen3.5 T + I + V athrael-soju/colqwen3.5-4.5B-v3
OpsColQwen3Model Qwen3-VL T / I OpenSearch-AI/Ops-Colqwen3-4B, OpenSearch-AI/Ops-Colqwen3-8B
Qwen3VLNemotronEmbedModel Qwen3-VL T / I nvidia/nemotron-colembed-vl-4b-v2, nvidia/nemotron-colembed-vl-8b-v2 ✅︎ ✅︎
*ForConditionalGenerationC, *ForCausalLMC, etc. Generative models * N/A * *

C Automatically converted into an embedding model via --convert embed. (details)
* Feature support is the same as that of the original model.

If your model is not in the above list, we will try to automatically convert the model using as_embedding_model.

Bi-encoder

All models that support embedding task also support using the score API to compute similarity scores by calculating the cosine similarity of two input prompt's embeddings. See this page for more information about embedding models.

Text-only Models

Architecture Models Example HF Models LoRA PP
BertModel BERT-based BAAI/bge-base-en-v1.5, Snowflake/snowflake-arctic-embed-xs, etc.
BertSpladeSparseEmbeddingModel SPLADE naver/splade-v3
ErnieModel BERT-like Chinese ERNIE shibing624/text2vec-base-chinese-sentence
Gemma2ModelC Gemma 2-based BAAI/bge-multilingual-gemma2, etc. ✅︎ ✅︎
Gemma3TextModelC Gemma 3-based google/embeddinggemma-300m, etc. ✅︎ ✅︎
GritLM GritLM parasail-ai/GritLM-7B-vllm. ✅︎ ✅︎
GteModel Arctic-Embed-2.0-M Snowflake/snowflake-arctic-embed-m-v2.0.
GteNewModel mGTE-TRM (see note) Alibaba-NLP/gte-multilingual-base, etc.
LlamaBidirectionalModelC Llama-based with bidirectional attention nvidia/llama-nemotron-embed-1b-v2, etc. ✅︎ ✅︎
LlamaModelC, LlamaForCausalLMC, MistralModelC, etc. Llama-based intfloat/e5-mistral-7b-instruct, etc. ✅︎ ✅︎
ModernBertModel ModernBERT-based Alibaba-NLP/gte-modernbert-base, etc.
NomicBertModel Nomic BERT nomic-ai/nomic-embed-text-v1, nomic-ai/nomic-embed-text-v2-moe, Snowflake/snowflake-arctic-embed-m-long, etc.
Qwen2ModelC, Qwen2ForCausalLMC Qwen2-based ssmits/Qwen2-7B-Instruct-embed-base (see note), Alibaba-NLP/gte-Qwen2-7B-instruct (see note), etc. ✅︎ ✅︎
Qwen3ModelC, Qwen3ForCausalLMC Qwen3-based Qwen/Qwen3-Embedding-0.6B, etc. ✅︎ ✅︎
RobertaModel, RobertaForMaskedLM RoBERTa-based sentence-transformers/all-roberta-large-v1, etc.
VoyageQwen3BidirectionalEmbedModelC Voyage Qwen3-based with bidirectional attention voyageai/voyage-4-nano, etc. ✅︎ ✅︎
XLMRobertaModel XLMRobertaModel-based BAAI/bge-m3 (see note), intfloat/multilingual-e5-base, jinaai/jina-embeddings-v3 (see note), etc.
*ModelC, *ForCausalLMC, etc. Generative models N/A * *

Note

The second-generation GTE model (mGTE-TRM) is named NewModel. The name NewModel is too generic, you should set --hf-overrides '{"architectures": ["GteNewModel"]}' to specify the use of the GteNewModel architecture.

Note

ssmits/Qwen2-7B-Instruct-embed-base has an improperly defined Sentence Transformers config. You need to manually set mean pooling by passing --pooler-config '{"pooling_type": "MEAN"}'.

Note

For Alibaba-NLP/gte-Qwen2-*, you need to enable --trust-remote-code for the correct tokenizer to be loaded. See relevant issue on HF Transformers.

Note

The BAAI/bge-m3 model comes with extra weights for sparse and colbert embeddings, See this page for more information.

Note

jinaai/jina-embeddings-v3 supports multiple tasks through LoRA, while vllm temporarily only supports text-matching tasks by merging LoRA weights.

Multimodal Models

Note

For more information about multimodal models inputs, see this page.

Architecture Models Inputs Example HF Models LoRA PP
CLIPModel CLIP T / I openai/clip-vit-base-patch32, openai/clip-vit-large-patch14, etc.
LlamaNemotronVLModel Llama Nemotron Embedding + SigLIP T + I nvidia/llama-nemotron-embed-vl-1b-v2
LlavaNextForConditionalGenerationC LLaVA-NeXT-based T / I royokong/e5-v ✅︎
Phi3VForCausalLMC Phi-3-Vision-based T + I TIGER-Lab/VLM2Vec-Full ✅︎
Qwen3VLForConditionalGenerationC Qwen3-VL T + I + V Qwen/Qwen3-VL-Embedding-2B, etc. ✅︎ ✅︎
SiglipModel SigLIP, SigLIP2 T / I google/siglip-base-patch16-224, google/siglip2-base-patch16-224
*ForConditionalGenerationC, *ForCausalLMC, etc. Generative models * N/A * *

C Automatically converted into an embedding model via --convert embed. (details)
* Feature support is the same as that of the original model.

If your model is not in the above list, we will try to automatically convert the model using as_embedding_model. By default, the embeddings of the whole prompt are extracted from the normalized hidden state corresponding to the last token.

Note

Although vLLM supports automatically converting models of any architecture into embedding models via --convert embed, to get the best results, you should use pooling models that are specifically trained as such.

Offline Inference

Pooling Parameters

The following pooling parameters are only supported by cross-encoder models and do not work for late-interaction and bi-encoder models.

    use_activation: bool | None = None

LLM.score

The score method outputs similarity scores between sentence pairs.

from vllm import LLM

llm = LLM(model="BAAI/bge-reranker-v2-m3", runner="pooling")
(output,) = llm.score(
    "What is the capital of France?",
    "The capital of Brazil is Brasilia.",
)

score = output.outputs.score
print(f"Score: {score}")

A code example can be found here: examples/basic/offline_inference/score.py

Online Serving

Score API

Our Score API (/score) is similar to LLM.score, compute similarity scores between two input prompts.

Parameters

The following Score API parameters are supported:

    model: str | None = None
    user: str | None = None
    truncate_prompt_tokens: Annotated[int, Field(ge=-1)] | None = None
    truncation_side: Literal["left", "right"] | None = Field(
        default=None,
        description=(
            "Which side to truncate from when truncate_prompt_tokens is active. "
            "'right' keeps the first N tokens. "
            "'left' keeps the last N tokens."
        ),
    )
    request_id: str = Field(
        default_factory=random_uuid,
        description=(
            "The request_id related to this request. If the caller does "
            "not set it, a random_uuid will be generated. This id is used "
            "through out the inference process and return in response."
        ),
    )
    priority: int = Field(
        default=0,
        ge=-(2**63),
        le=2**63 - 1,
        description=(
            "The priority of the request (lower means earlier handling; "
            "default: 0). Any priority other than 0 will raise an error "
            "if the served model does not use priority scheduling."
        ),
    )
    mm_processor_kwargs: dict[str, Any] | None = Field(
        default=None,
        description="Additional kwargs to pass to the HF processor.",
    )
    cache_salt: str | None = Field(
        default=None,
        description=(
            "If specified, the prefix cache will be salted with the provided "
            "string to prevent an attacker to guess prompts in multi-user "
            "environments. The salt should be random, protected from "
            "access by 3rd parties, and long enough to be "
            "unpredictable (e.g., 43 characters base64-encoded, corresponding "
            "to 256 bit)."
        ),
    )
    use_activation: bool | None = Field(
        default=None,
        description="Whether to use activation for the pooler outputs. "
        "`None` uses the pooler's default, which is `True` in most cases.",
    )

Examples

Single inference

You can pass a string to both queries and documents, forming a single sentence pair.

curl -X 'POST' \
  'http://127.0.0.1:8000/score' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "model": "BAAI/bge-reranker-v2-m3",
  "encoding_format": "float",
  "queries": "What is the capital of France?",
  "documents": "The capital of France is Paris."
}'
Response
{
  "id": "score-request-id",
  "object": "list",
  "created": 693447,
  "model": "BAAI/bge-reranker-v2-m3",
  "data": [
    {
      "index": 0,
      "object": "score",
      "score": 1
    }
  ],
  "usage": {}
}
Batch inference

You can pass a string to queries and a list to documents, forming multiple sentence pairs where each pair is built from queries and a string in documents. The total number of pairs is len(documents).

Request
curl -X 'POST' \
  'http://127.0.0.1:8000/score' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "model": "BAAI/bge-reranker-v2-m3",
  "queries": "What is the capital of France?",
  "documents": [
    "The capital of Brazil is Brasilia.",
    "The capital of France is Paris."
  ]
}'
Response
{
  "id": "score-request-id",
  "object": "list",
  "created": 693570,
  "model": "BAAI/bge-reranker-v2-m3",
  "data": [
    {
      "index": 0,
      "object": "score",
      "score": 0.001094818115234375
    },
    {
      "index": 1,
      "object": "score",
      "score": 1
    }
  ],
  "usage": {}
}

You can pass a list to both queries and documents, forming multiple sentence pairs where each pair is built from a string in queries and the corresponding string in documents (similar to zip()). The total number of pairs is len(documents).

Request
curl -X 'POST' \
  'http://127.0.0.1:8000/score' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "model": "BAAI/bge-reranker-v2-m3",
  "encoding_format": "float",
  "queries": [
    "What is the capital of Brazil?",
    "What is the capital of France?"
  ],
  "documents": [
    "The capital of Brazil is Brasilia.",
    "The capital of France is Paris."
  ]
}'
Response
{
  "id": "score-request-id",
  "object": "list",
  "created": 693447,
  "model": "BAAI/bge-reranker-v2-m3",
  "data": [
    {
      "index": 0,
      "object": "score",
      "score": 1
    },
    {
      "index": 1,
      "object": "score",
      "score": 1
    }
  ],
  "usage": {}
}
Multi-modal inputs

You can pass multi-modal inputs to scoring models by passing content including a list of multi-modal input (image, etc.) in the request. Refer to the examples below for illustration.

To serve the model:

vllm serve jinaai/jina-reranker-m0

Since the request schema is not defined by OpenAI client, we post a request to the server using the lower-level requests library:

Code
import requests

response = requests.post(
    "http://localhost:8000/v1/score",
    json={
        "model": "jinaai/jina-reranker-m0",
        "queries": "slm markdown",
        "documents": [
            {
                "content": [
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": "https://raw.githubusercontent.com/jina-ai/multimodal-reranker-test/main/handelsblatt-preview.png"
                        },
                    }
                ],
            },
            {
                "content": [
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": "https://raw.githubusercontent.com/jina-ai/multimodal-reranker-test/main/handelsblatt-preview.png"
                        },
                    }
                ]
            },
        ],
    },
)
response.raise_for_status()
response_json = response.json()
print("Scoring output:", response_json["data"][0]["score"])
print("Scoring output:", response_json["data"][1]["score"])

Full example:

Rerank API

/rerank, /v1/rerank, and /v2/rerank APIs are compatible with both Jina AI's rerank API interface and Cohere's rerank API interface to ensure compatibility with popular open-source tools.

Code example: examples/pooling/score/rerank_api_online.py

Parameters

The following rerank api parameters are supported:

    model: str | None = None
    user: str | None = None
    truncate_prompt_tokens: Annotated[int, Field(ge=-1)] | None = None
    truncation_side: Literal["left", "right"] | None = Field(
        default=None,
        description=(
            "Which side to truncate from when truncate_prompt_tokens is active. "
            "'right' keeps the first N tokens. "
            "'left' keeps the last N tokens."
        ),
    )
    request_id: str = Field(
        default_factory=random_uuid,
        description=(
            "The request_id related to this request. If the caller does "
            "not set it, a random_uuid will be generated. This id is used "
            "through out the inference process and return in response."
        ),
    )
    priority: int = Field(
        default=0,
        ge=-(2**63),
        le=2**63 - 1,
        description=(
            "The priority of the request (lower means earlier handling; "
            "default: 0). Any priority other than 0 will raise an error "
            "if the served model does not use priority scheduling."
        ),
    )
    mm_processor_kwargs: dict[str, Any] | None = Field(
        default=None,
        description="Additional kwargs to pass to the HF processor.",
    )
    cache_salt: str | None = Field(
        default=None,
        description=(
            "If specified, the prefix cache will be salted with the provided "
            "string to prevent an attacker to guess prompts in multi-user "
            "environments. The salt should be random, protected from "
            "access by 3rd parties, and long enough to be "
            "unpredictable (e.g., 43 characters base64-encoded, corresponding "
            "to 256 bit)."
        ),
    )
    use_activation: bool | None = Field(
        default=None,
        description="Whether to use activation for the pooler outputs. "
        "`None` uses the pooler's default, which is `True` in most cases.",
    )

Examples

Note that the top_n request parameter is optional and will default to the length of the documents field. Result documents will be sorted by relevance, and the index property can be used to determine original order.

Request
curl -X 'POST' \
  'http://127.0.0.1:8000/v1/rerank' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "model": "BAAI/bge-reranker-base",
  "query": "What is the capital of France?",
  "documents": [
    "The capital of Brazil is Brasilia.",
    "The capital of France is Paris.",
    "Horses and cows are both animals"
  ]
}'
Response
{
  "id": "rerank-fae51b2b664d4ed38f5969b612edff77",
  "model": "BAAI/bge-reranker-base",
  "usage": {
    "total_tokens": 56
  },
  "results": [
    {
      "index": 1,
      "document": {
        "text": "The capital of France is Paris."
      },
      "relevance_score": 0.99853515625
    },
    {
      "index": 0,
      "document": {
        "text": "The capital of Brazil is Brasilia."
      },
      "relevance_score": 0.0005860328674316406
    }
  ]
}

More examples

More examples can be found here: examples/pooling/score

Supported Features

AS cross-encoder models are a subset of classification models that accept two prompts as input and output num_labels equal to 1, cross-encoder features should be consistent with (sequence) classification. For more information, see this page.

Score Template

Score templates are supported for cross-encoder models only. If you are using an embedding model for scoring, vLLM does not apply a score template.

Some scoring models require a specific prompt format to work correctly. You can specify a custom score template using the --chat-template parameter (see Chat Template).

Like chat templates, the score template receives a messages list. For scoring, each message has a role attribute—either "query" or "document". For the usual kind of point-wise cross-encoder, you can expect exactly two messages: one query and one document. To access the query and document content, use Jinja's selectattr filter:

  • Query: {{ (messages | selectattr("role", "eq", "query") | first).content }}
  • Document: {{ (messages | selectattr("role", "eq", "document") | first).content }}

This approach is more robust than index-based access (messages[0], messages[1]) because it selects messages by their semantic role. It also avoids assumptions about message ordering if additional message types are added to messages in the future.

Example template file: examples/pooling/score/template/nemotron-rerank.jinja

Enable/disable activation

You can enable or disable activation via use_activation only works for cross-encoder models.