Specific Model Examples¶
ColBERT Late Interaction Models¶
ColBERT (Contextualized Late Interaction over BERT) is a retrieval model that uses per-token embeddings and MaxSim scoring for document ranking. Unlike single-vector embedding models, ColBERT retains token-level representations and computes relevance scores through late interaction, providing better accuracy while being more efficient than cross-encoders.
vLLM supports ColBERT models with multiple encoder backbones:
| Architecture | Backbone | Example HF Models |
|---|---|---|
HF_ColBERT | BERT | answerdotai/answerai-colbert-small-v1, colbert-ir/colbertv2.0 |
ColBERTModernBertModel | ModernBERT | lightonai/GTE-ModernColBERT-v1 |
ColBERTJinaRobertaModel | Jina XLM-RoBERTa | jinaai/jina-colbert-v2 |
BERT-based ColBERT models work out of the box:
For non-BERT backbones, use --hf-overrides to set the correct architecture:
# ModernBERT backbone
vllm serve lightonai/GTE-ModernColBERT-v1 \
--hf-overrides '{"architectures": ["ColBERTModernBertModel"]}'
# Jina XLM-RoBERTa backbone
vllm serve jinaai/jina-colbert-v2 \
--hf-overrides '{"architectures": ["ColBERTJinaRobertaModel"]}' \
--trust-remote-code
Then you can use the rerank API:
curl -s http://localhost:8000/rerank -H "Content-Type: application/json" -d '{
"model": "answerdotai/answerai-colbert-small-v1",
"query": "What is machine learning?",
"documents": [
"Machine learning is a subset of artificial intelligence.",
"Python is a programming language.",
"Deep learning uses neural networks."
]
}'
Or the score API:
curl -s http://localhost:8000/score -H "Content-Type: application/json" -d '{
"model": "answerdotai/answerai-colbert-small-v1",
"text_1": "What is machine learning?",
"text_2": ["Machine learning is a subset of AI.", "The weather is sunny."]
}'
You can also get the raw token embeddings using the pooling API with token_embed task:
curl -s http://localhost:8000/pooling -H "Content-Type: application/json" -d '{
"model": "answerdotai/answerai-colbert-small-v1",
"input": "What is machine learning?",
"task": "token_embed"
}'
An example can be found here: examples/pooling/score/colbert_rerank_online.py
ColQwen3 Multi-Modal Late Interaction Models¶
ColQwen3 is based on ColPali, which extends ColBERT's late interaction approach to multi-modal inputs. While ColBERT operates on text-only token embeddings, ColPali/ColQwen3 can embed both text and images (e.g. PDF pages, screenshots, diagrams) into per-token L2-normalized vectors and compute relevance via MaxSim scoring. ColQwen3 specifically uses Qwen3-VL as its vision-language backbone.
| Architecture | Backbone | Example HF Models |
|---|---|---|
ColQwen3 | Qwen3-VL | TomoroAI/tomoro-colqwen3-embed-4b, TomoroAI/tomoro-colqwen3-embed-8b |
OpsColQwen3Model | Qwen3-VL | OpenSearch-AI/Ops-Colqwen3-4B, OpenSearch-AI/Ops-Colqwen3-8B |
Qwen3VLNemotronEmbedModel | Qwen3-VL | nvidia/nemotron-colembed-vl-4b-v2, nvidia/nemotron-colembed-vl-8b-v2 |
Start the server:
Text-only scoring and reranking¶
Use the /rerank API:
curl -s http://localhost:8000/rerank -H "Content-Type: application/json" -d '{
"model": "TomoroAI/tomoro-colqwen3-embed-4b",
"query": "What is machine learning?",
"documents": [
"Machine learning is a subset of artificial intelligence.",
"Python is a programming language.",
"Deep learning uses neural networks."
]
}'
Or the /score API:
curl -s http://localhost:8000/score -H "Content-Type: application/json" -d '{
"model": "TomoroAI/tomoro-colqwen3-embed-4b",
"text_1": "What is the capital of France?",
"text_2": ["The capital of France is Paris.", "Python is a programming language."]
}'
Multi-modal scoring and reranking (text query × image documents)¶
The /score and /rerank APIs also accept multi-modal inputs directly. Pass image documents using the data_1/data_2 (for /score) or documents (for /rerank) fields with a content list containing image_url and text parts — the same format used by the OpenAI chat completion API:
Score a text query against image documents:
curl -s http://localhost:8000/score -H "Content-Type: application/json" -d '{
"model": "TomoroAI/tomoro-colqwen3-embed-4b",
"data_1": "Retrieve the city of Beijing",
"data_2": [
{
"content": [
{"type": "image_url", "image_url": {"url": "data:image/png;base64,<BASE64>"}},
{"type": "text", "text": "Describe the image."}
]
}
]
}'
Rerank image documents by a text query:
curl -s http://localhost:8000/rerank -H "Content-Type: application/json" -d '{
"model": "TomoroAI/tomoro-colqwen3-embed-4b",
"query": "Retrieve the city of Beijing",
"documents": [
{
"content": [
{"type": "image_url", "image_url": {"url": "data:image/png;base64,<BASE64_1>"}},
{"type": "text", "text": "Describe the image."}
]
},
{
"content": [
{"type": "image_url", "image_url": {"url": "data:image/png;base64,<BASE64_2>"}},
{"type": "text", "text": "Describe the image."}
]
}
],
"top_n": 2
}'
Raw token embeddings¶
You can also get the raw token embeddings using the /pooling API with token_embed task:
curl -s http://localhost:8000/pooling -H "Content-Type: application/json" -d '{
"model": "TomoroAI/tomoro-colqwen3-embed-4b",
"input": "What is machine learning?",
"task": "token_embed"
}'
For image inputs via the pooling API, use the chat-style messages field:
curl -s http://localhost:8000/pooling -H "Content-Type: application/json" -d '{
"model": "TomoroAI/tomoro-colqwen3-embed-4b",
"messages": [
{
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": "data:image/png;base64,<BASE64>"}},
{"type": "text", "text": "Describe the image."}
]
}
]
}'
Examples¶
- Multi-vector retrieval: examples/pooling/token_embed/colqwen3_token_embed_online.py
- Reranking (text + multi-modal): examples/pooling/score/colqwen3_rerank_online.py
ColQwen3.5 Multi-Modal Late Interaction Models¶
ColQwen3.5 is based on ColPali, extending ColBERT's late interaction approach to multi-modal inputs. It uses the Qwen3.5 hybrid backbone (linear + full attention) and produces per-token L2-normalized vectors for MaxSim scoring.
| Architecture | Backbone | Example HF Models |
|---|---|---|
ColQwen3_5 | Qwen3.5 | athrael-soju/colqwen3.5-4.5B |
Start the server:
Then you can use the rerank endpoint:
curl -s http://localhost:8000/rerank -H "Content-Type: application/json" -d '{
"model": "athrael-soju/colqwen3.5-4.5B",
"query": "What is machine learning?",
"documents": [
"Machine learning is a subset of artificial intelligence.",
"Python is a programming language.",
"Deep learning uses neural networks."
]
}'
Or the score endpoint:
curl -s http://localhost:8000/score -H "Content-Type: application/json" -d '{
"model": "athrael-soju/colqwen3.5-4.5B",
"text_1": "What is the capital of France?",
"text_2": ["The capital of France is Paris.", "Python is a programming language."]
}'
An example can be found here: examples/pooling/score/colqwen3_5_rerank_online.py
Llama Nemotron Multimodal¶
Embedding Model¶
Llama Nemotron VL Embedding models combine the bidirectional Llama embedding backbone (from nvidia/llama-nemotron-embed-1b-v2) with SigLIP as the vision encoder to produce single-vector embeddings from text and/or images.
| Architecture | Backbone | Example HF Models |
|---|---|---|
LlamaNemotronVLModel | Bidirectional Llama + SigLIP | nvidia/llama-nemotron-embed-vl-1b-v2 |
Start the server:
vllm serve nvidia/llama-nemotron-embed-vl-1b-v2 \
--trust-remote-code \
--chat-template examples/pooling/embed/template/nemotron_embed_vl.jinja
Note
The chat template bundled with this model's tokenizer is not suitable for the embeddings API. Use the provided override template above when serving with the messages-based (chat-style) embeddings API.
The override template uses the message role to automatically prepend the appropriate prefix: set role to "query" for queries (prepends query:) or "document" for passages (prepends passage:). Any other role omits the prefix.
Embed text queries:
curl -s http://localhost:8000/v1/embeddings -H "Content-Type: application/json" -d '{
"model": "nvidia/llama-nemotron-embed-vl-1b-v2",
"messages": [
{
"role": "query",
"content": [
{"type": "text", "text": "What is machine learning?"}
]
}
]
}'
Embed images via the chat-style messages field:
curl -s http://localhost:8000/v1/embeddings -H "Content-Type: application/json" -d '{
"model": "nvidia/llama-nemotron-embed-vl-1b-v2",
"messages": [
{
"role": "document",
"content": [
{"type": "image_url", "image_url": {"url": "data:image/png;base64,<BASE64>"}},
{"type": "text", "text": "Describe the image."}
]
}
]
}'
Reranker Model¶
Llama Nemotron VL reranker models combine the same bidirectional Llama + SigLIP backbone with a sequence-classification head for cross-encoder scoring and reranking.
| Architecture | Backbone | Example HF Models |
|---|---|---|
LlamaNemotronVLForSequenceClassification | Bidirectional Llama + SigLIP | nvidia/llama-nemotron-rerank-vl-1b-v2 |
Start the server:
vllm serve nvidia/llama-nemotron-rerank-vl-1b-v2 \
--runner pooling \
--trust-remote-code \
--chat-template examples/pooling/score/template/nemotron-vl-rerank.jinja
Note
The chat template bundled with this checkpoint's tokenizer is not suitable for the Score/Rerank APIs. Use the provided override template when serving: examples/pooling/score/template/nemotron-vl-rerank.jinja.
Score a text query against an image document:
curl -s http://localhost:8000/score -H "Content-Type: application/json" -d '{
"model": "nvidia/llama-nemotron-rerank-vl-1b-v2",
"data_1": "Find diagrams about autonomous robots",
"data_2": [
{
"content": [
{"type": "image_url", "image_url": {"url": "data:image/png;base64,<BASE64>"}},
{"type": "text", "text": "Robotics workflow diagram."}
]
}
]
}'
Rerank image documents by a text query:
curl -s http://localhost:8000/rerank -H "Content-Type: application/json" -d '{
"model": "nvidia/llama-nemotron-rerank-vl-1b-v2",
"query": "Find diagrams about autonomous robots",
"documents": [
{
"content": [
{"type": "image_url", "image_url": {"url": "data:image/png;base64,<BASE64_1>"}},
{"type": "text", "text": "Robotics workflow diagram."}
]
},
{
"content": [
{"type": "image_url", "image_url": {"url": "data:image/png;base64,<BASE64_2>"}},
{"type": "text", "text": "General skyline photo."}
]
}
],
"top_n": 2
}'
BAAI/bge-m3¶
The BAAI/bge-m3 model comes with extra weights for sparse and colbert embeddings but unfortunately in its config.json the architecture is declared as XLMRobertaModel, which makes vLLM load it as a vanilla ROBERTA model without the extra weights. To load the full model weights, override its architecture like this:
Then you obtain the sparse embeddings like this:
curl -s http://localhost:8000/pooling -H "Content-Type: application/json" -d '{
"model": "BAAI/bge-m3",
"task": "token_classify",
"input": ["What is BGE M3?", "Definition of BM25"]
}'
Due to limitations in the output schema, the output consists of a list of token scores for each token for each input. This means that you'll have to call /tokenize as well to be able to pair tokens with scores. Refer to the tests in tests/models/language/pooling/test_bge_m3.py to see how to do that.
You can obtain the colbert embeddings like this: