Reinforcement Learning from Human Feedback¶

Reinforcement Learning from Human Feedback (RLHF) is a technique that fine-tunes language models using human-generated preference data to align model outputs with desired behaviors. vLLM can be used to generate the completions for RLHF.

The following open-source RL libraries use vLLM for fast rollouts (sorted alphabetically and non-exhaustive):

For weight synchronization between training and inference, see the Weight Transfer documentation, which covers the pluggable backend system with NCCL (multi-GPU) and IPC (same-GPU) engines.

For pipelining generation and training to improve GPU utilization and throughput, see the Async Reinforcement Learning guide, which covers the pause/resume API for safely updating weights mid-flight.

See the following notebooks showing how to use vLLM for GRPO: