Rewards Model Explained

Chaiverse: Reward Model Submission

Reward Modelling is an essential part of making high-quality models.
Best-of-N sampling is the main part of InstructGPT loop
In fact, if you read the OpenAI's paper carefully, PPO out-of-the-box only achieves best-of-2 performance!
Contrary to PPO (Proximal Policy Optimization), Best-of-N sampling is simple, robust and high-performant

Figure below shows the architecture for best-of-4 sampling with reward

Keep in mind these parameters

Last updated 1 year ago