vllm.v1.worker.utils
bind_kv_cache
¶
bind_kv_cache(
kv_caches: dict[str, Tensor],
forward_context: dict[str, Attention],
runner_kv_caches: list[Tensor],
) -> None
Bind the allocated KV cache to both ModelRunner and forward context so that the KV cache can be used in the forward pass.
This function
1) Fills the ModelRunner's kv cache list (runner_kv_caches) with
kv_caches.
2) Associates each attention layer in the forward_context with its
corresponding KV cache in kv_caches.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
kv_caches
|
dict[str, Tensor]
|
The allocated kv_caches with layer names as keys. |
required |
forward_context
|
dict[str, Attention]
|
The global forward context containing all Attention |
required |
runner_kv_caches
|
list[Tensor]
|
The kv_cache declared by ModelRunner. |
required |
Source code in vllm/v1/worker/utils.py
gather_mm_placeholders
¶
Reconstructs the embeddings from the placeholder tokens.
This is the operation of [scatter_mm_placeholders][].
Source code in vllm/v1/worker/utils.py
initialize_kv_cache_for_kv_sharing
¶
initialize_kv_cache_for_kv_sharing(
shared_kv_cache_layers: dict[str, str],
kv_cache_groups: list[KVCacheGroupSpec],
kv_caches: dict[str, Tensor],
) -> None
Sets up KV cache sharing by reusing the allocated KV caches in kv_caches
for layers that do not allocate its own KV cache, based on the mapping in
shared_kv_cache_layers. Adds these layers to the corresponding KV cache
group, which is needed to ensure that attention metadata is assigned later.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
shared_kv_cache_layers
|
dict[str, str]
|
Layer pairings for cross-layer KV sharing.
If an Attention layer |
required |
kv_cache_groups
|
list[KVCacheGroupSpec]
|
The KV cache groups of the model. |
required |
kv_caches
|
dict[str, Tensor]
|
The allocated kv_caches with layer names as keys. Note that layers in shared_kv_cache_layers.keys() are not originally included as it only contains layers which have its own KV cache allocation. |
required |
Source code in vllm/v1/worker/utils.py
sanity_check_mm_encoder_outputs
¶
sanity_check_mm_encoder_outputs(
mm_embeddings: MultiModalEmbeddings,
expected_num_items: int,
) -> None
Perform sanity checks for the result of
vllm.model_executor.models.SupportsMultiModal.get_multimodal_embeddings.
Source code in vllm/v1/worker/utils.py
scatter_mm_placeholders
¶
Scatter the multimodal embeddings into a contiguous tensor that represents the placeholder tokens.
vllm.multimodal.processing.PromptUpdateDetails.is_embed.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
embeds
|
Tensor
|
The multimodal embeddings.
Shape: |
required |
is_embed
|
Optional[Tensor]
|
A boolean mask indicating which positions in the placeholder
tokens need to be filled with multimodal embeddings.
Shape: |
required |