vllm.v1.worker.mamba_utils ¶
postprocess_mamba ¶
postprocess_mamba(
scheduler_output: SchedulerOutput,
kv_cache_config: KVCacheConfig,
input_batch: GPUInputBatch,
requests: dict[str, CachedRequestState],
mamba_state_idx: dict[str, int],
forward_context: dict[str, Any],
mamba_state_copy_funcs: tuple[MambaStateCopyFunc, ...],
copy_bufs: MambaCopyBuffers,
)
If a blocks is converted from partial block to full block in this step, copy the state from the block for running state to the new full block.
Source code in vllm/v1/worker/mamba_utils.py
preprocess_mamba ¶
preprocess_mamba(
scheduler_output: SchedulerOutput,
kv_cache_config: KVCacheConfig,
cache_config: CacheConfig,
mamba_state_idx: dict[str, int],
input_batch: GPUInputBatch,
requests: dict[str, CachedRequestState],
forward_context: dict[str, Any],
mamba_state_copy_funcs: tuple[MambaStateCopyFunc, ...],
copy_bufs: MambaCopyBuffers,
)
Copy the mamba state of previous step to the last (1 + num_speculative_blocks) block.
Source code in vllm/v1/worker/mamba_utils.py
update_accepted_tokens_for_prefill_as_decode ¶
update_accepted_tokens_for_prefill_as_decode(
input_batch: GPUInputBatch,
prefill_as_decode_num_tokens: CpuGpuBuffer,
num_accepted_tokens_gpu: Tensor,
scheduler_output: SchedulerOutput,
decode_qlen_threshold: int | None,
num_reqs: int,
)
Adjusts num_accepted_tokens for prefill chunks processed via the decode path. This ensures subsequent iterations read from the correct sequential state slot instead of the default prefill slot 0. Not used by GDN attention, which manually separates short prefills and short decodes when building the attention metadata.