Skip to content

Token Embedding Usages

Summary

  • Model Usage: Token classification models
  • Pooling Tasks: token_embed
  • Offline APIs:
    • LLM.encode(..., pooling_task="token_embed")
  • Online APIs:
    • Pooling API (/pooling)

The difference between the (sequence) embedding task and the token embedding task is that (sequence) embedding outputs one embedding for each sequence, while token embedding outputs a embedding for each token.

Many embedding models support both (sequence) embedding and token embedding. For further details on (sequence) embedding, please refer to this page.

Typical Use Cases

Multi-Vector Retrieval

For implementation examples, see:

Offline: examples/pooling/token_embed/multi_vector_retrieval_offline.py

Online: examples/pooling/token_embed/multi_vector_retrieval_online.py

Late interaction

Similarity scores can be computed using late interaction between two input prompts via the score API. For more information, see Score API.

Extract last hidden states

Models of any architecture can be converted into embedding models using --convert embed. Token embedding can then be used to extract the last hidden states from these models.

Supported Models

Text-only Models

Architecture Models Example HF Models LoRA PP
ColBERTModernBertModel ModernBERT lightonai/GTE-ModernColBERT-v1
ColBERTJinaRobertaModel Jina XLM-RoBERTa jinaai/jina-colbert-v2
HF_ColBERT BERT answerdotai/answerai-colbert-small-v1, colbert-ir/colbertv2.0
*ModelC, *ForCausalLMC, etc. Generative models N/A * *

Multimodal Models

Note

For more information about multimodal models inputs, see this page.

Architecture Models Inputs Example HF Models LoRA PP
ColModernVBertForRetrieval ColModernVBERT T / I ModernVBERT/colmodernvbert-merged
ColPaliForRetrieval ColPali T / I vidore/colpali-v1.3-hf
ColQwen3 Qwen3-VL T / I TomoroAI/tomoro-colqwen3-embed-4b, TomoroAI/tomoro-colqwen3-embed-8b
ColQwen3_5 ColQwen3.5 T + I + V athrael-soju/colqwen3.5-4.5B-v3
OpsColQwen3Model Qwen3-VL T / I OpenSearch-AI/Ops-Colqwen3-4B, OpenSearch-AI/Ops-Colqwen3-8B
Qwen3VLNemotronEmbedModel Qwen3-VL T / I nvidia/nemotron-colembed-vl-4b-v2, nvidia/nemotron-colembed-vl-8b-v2 ✅︎ ✅︎
*ForConditionalGenerationC, *ForCausalLMC, etc. Generative models * N/A * *

C Automatically converted into an embedding model via --convert embed. (details)
* Feature support is the same as that of the original model.

If your model is not in the above list, we will try to automatically convert the model using as_embedding_model.

Offline Inference

Pooling Parameters

The following pooling parameters are supported.

    use_activation: bool | None = None
    dimensions: int | None = None

LLM.encode

The encode method is available to all pooling models in vLLM.

Set pooling_task="token_embed" when using LLM.encode for token embedding Models:

from vllm import LLM

llm = LLM(model="answerdotai/answerai-colbert-small-v1", runner="pooling")
(output,) = llm.encode("Hello, my name is", pooling_task="token_embed")

data = output.outputs.data
print(f"Data: {data!r}")

LLM.score

The score method outputs similarity scores between sentence pairs.

All models that support token embedding task also support using the score API to compute similarity scores by calculating the late interaction of two input prompts.

from vllm import LLM

llm = LLM(model="answerdotai/answerai-colbert-small-v1", runner="pooling")
(output,) = llm.score(
    "What is the capital of France?",
    "The capital of Brazil is Brasilia.",
)

score = output.outputs.score
print(f"Score: {score}")

Online Serving

Please refer to the pooling API and use "task":"token_embed".

More examples

More examples can be found here: examples/pooling/token_embed

Supported Features

Token embedding features should be consistent with (sequence) embedding. For more information, see this page.