Token Classification Usages¶

Summary¶

Model Usage: token classification
Pooling Tasks: token_classify
Offline APIs:
- LLM.encode(..., pooling_task="token_classify")
Online APIs:
- Pooling API (/pooling)

The key distinction between (sequence) classification and token classification lies in their output granularity: (sequence) classification produces a single result for an entire input sequence, whereas token classification yields a result for each individual token within the sequence.

Many classification models support both (sequence) classification and token classification. For further details on (sequence) classification, please refer to this page.

Typical Use Cases¶

Named Entity Recognition (NER)¶

For implementation examples, see:

Offline: examples/pooling/token_classify/ner_offline.py

Online: examples/pooling/token_classify/ner_online.py

Sparse retrieval (lexical matching)¶

The BAAI/bge-m3 model leverages token classification for sparse retrieval. For more information, see this page.

Supported Models¶

Architecture	Models	Example HF Models	LoRA	PP
`BertForTokenClassification`	bert-based	`boltuix/NeuroBERT-NER` (see note), etc.
`ErnieForTokenClassification`	BERT-like Chinese ERNIE	`gyr66/Ernie-3.0-base-chinese-finetuned-ner`
`ModernBertForTokenClassification`	ModernBERT-based	`disham993/electrical-ner-ModernBERT-base`
`Qwen3ForTokenClassification`^C	Qwen3-based	`bd2lcco/Qwen3-0.6B-finetuned`
`Model`^C, `ForCausalLM`^C, etc.	Generative models	N/A	*	*

^C Automatically converted into a classification model via --convert classify. (details)
* Feature support is the same as that of the original model.

If your model is not in the above list, we will try to automatically convert the model using as_seq_cls_model. By default, the class probabilities are extracted from the softmaxed hidden state corresponding to the last token.

As Reward Models¶

Using token classification models as reward models. For details on reward models, see Reward Models.

Architecture	Models	Example HF Models	LoRA	PP
`InternLM2ForRewardModel`	InternLM2-based	`internlm/internlm2-1_8b-reward`, `internlm/internlm2-7b-reward`, etc.	✅︎	✅︎
`Qwen2ForRewardModel`	Qwen2-based	`Qwen/Qwen2.5-Math-RM-72B`, etc.	✅︎	✅︎
`Model`^C, `ForCausalLM`^C, etc.	Generative models	N/A	*	*

^C Automatically converted into a classification model via --convert classify. (details)

If your model is not in the above list, we will try to automatically convert the model using as_seq_cls_model.

Offline Inference¶

Pooling Parameters¶

The following pooling parameters are supported.

    use_activation: bool | None = None

`LLM.encode`¶

The encode method is available to all pooling models in vLLM.

Set pooling_task="token_classify" when using LLM.encode for token classification Models:

from vllm import LLM

llm = LLM(model="boltuix/NeuroBERT-NER", runner="pooling")
(output,) = llm.encode("Hello, my name is", pooling_task="token_classify")

data = output.outputs.data
print(f"Data: {data!r}")

Online Serving¶

Please refer to the pooling API and use "task":"token_classify".

More examples¶

More examples can be found here: examples/pooling/token_classify

Supported Features¶

Token classification features should be consistent with (sequence) classification. For more information, see this page.