Skip to content

vllm.utils.multi_stream_utils

maybe_execute_in_parallel

maybe_execute_in_parallel(
    fn0: Callable[[], Any],
    fn1: Callable[[], Any],
    event0: Event,
    event1: Event,
    aux_stream: Stream | None = None,
) -> tuple[Any, Any]

Run two functions potentially in parallel on separate CUDA streams.

When aux_stream is provided, fn0 runs on the current (default) stream and fn1 runs on aux_stream, synchronized via CUDA events. When aux_stream is None, both functions execute sequentially on the current stream.

This design follows TensorRT-LLM's maybe_execute_in_parallel pattern (tensorrt_llm/_torch/modules/multi_stream_utils.py).

Parameters:

Name Type Description Default
fn0 Callable[[], Any]

Callable for the default stream.

required
fn1 Callable[[], Any]

Callable for the auxiliary stream.

required
event0 Event

CUDA event recorded before fn0 so aux_stream can wait.

required
event1 Event

CUDA event recorded after fn1 so default stream can wait.

required
aux_stream Stream | None

The second CUDA stream for fn1. Multi-stream is disabled when aux_stream is None.

None

Returns:

Type Description
tuple[Any, Any]

Tuple of (fn0_result, fn1_result).

Source code in vllm/utils/multi_stream_utils.py
def maybe_execute_in_parallel(
    fn0: Callable[[], Any],
    fn1: Callable[[], Any],
    event0: torch.cuda.Event,
    event1: torch.cuda.Event,
    aux_stream: torch.cuda.Stream | None = None,
) -> tuple[Any, Any]:
    """Run two functions potentially in parallel on separate CUDA streams.

    When aux_stream is provided, fn0 runs on the current (default) stream and
    fn1 runs on aux_stream, synchronized via CUDA events.  When aux_stream is
    None, both functions execute sequentially on the current stream.

    This design follows TensorRT-LLM's maybe_execute_in_parallel pattern
    (tensorrt_llm/_torch/modules/multi_stream_utils.py).

    Args:
        fn0: Callable for the default stream.
        fn1: Callable for the auxiliary stream.
        event0: CUDA event recorded before fn0 so aux_stream can wait.
        event1: CUDA event recorded after fn1 so default stream can wait.
        aux_stream: The second CUDA stream for fn1.
            Multi-stream is disabled when aux_stream is None.

    Returns:
        Tuple of (fn0_result, fn1_result).
    """
    if aux_stream is not None:
        event0.record()
        result0 = fn0()
        with torch.cuda.stream(aux_stream):
            event0.wait()
            result1 = fn1()
            event1.record()
        event1.wait()
    else:
        result0 = fn0()
        result1 = fn1()
    return (result0, result1)