Rewards Model Explained
Chaiverse: Reward Model Submission
Last updated
Chaiverse: Reward Model Submission
Last updated
Reward Modelling is an essential part of making high-quality models.
Best-of-N sampling is the main part of InstructGPT loop
In fact, if you read the OpenAI's paper carefully, PPO out-of-the-box only achieves best-of-2 performance!
Contrary to PPO (Proximal Policy Optimization), Best-of-N sampling is simple, robust and high-performant
Figure below shows the architecture for best-of-4 sampling with reward