vllm.v1.kv_offload.mediums ¶
BlockIDsLoadStoreSpec ¶
Bases: LoadStoreSpec, ABC
Spec for loading/storing KV blocks from given block numbers.
Source code in vllm/v1/kv_offload/mediums.py
CPULoadStoreSpec ¶
Bases: BlockIDsLoadStoreSpec
Spec for loading/storing a KV block to CPU memory.
Source code in vllm/v1/kv_offload/mediums.py
GPULoadStoreSpec ¶
Bases: BlockIDsLoadStoreSpec
Spec for loading/storing a KV block to GPU memory.
If there are multiple KV groups, the blocks are expected to be ordered by the group index. In that case, group_sizes[i] determines the number of blocks per the i-th KV group, and thus sum(group_sizes) == len(block_ids). group_sizes=None indicates a single KV group.
If block_indices is given, each group (determined by group_sizes) of block IDs will correspond to logically contiguous blocks, e.g. blocks 5-10 of a some request. block_indices[i] will represent the block index of the first block in group #i. Thus, len(block_indices) == len(group_sizes) = number of KV cache groups. This information is required in order to support loading from offloaded blocks which are larger than GPU blocks. In such cases, the first GPU block per each group may be unaligned to the offloaded block size, and so knowing block_indices[i] allows the worker to correctly skip part of the first matching offloaded block. Offloading from GPU is always aligned to offloaded block size, and so block_indices will only be set by the offloading connector when loading into GPU.