Reinforcement Learning from Human Feedback¶
Reinforcement Learning from Human Feedback (RLHF) is a technique that fine-tunes language models using human-generated preference data to align model outputs with desired behaviors. vLLM can be used to generate the completions for RLHF.
The following open-source RL libraries use vLLM for fast rollouts (sorted alphabetically and non-exhaustive):
For weight synchronization between training and inference, see the Weight Transfer documentation, which covers the pluggable backend system with NCCL (multi-GPU) and IPC (same-GPU) engines.
For pipelining generation and training to improve GPU utilization and throughput, see the Async Reinforcement Learning guide, which covers the pause/resume API for safely updating weights mid-flight.
See the following notebooks showing how to use vLLM for GRPO: