Chaiverse
  • Welcome to Chaiverse!
  • Quick Start
  • Prompting Guide
  • Rewards model
  • Blending Models
  • Getting Feedback
  • Talking to your model
  • Deactivating your Model
  • Leaderboard
  • 🤑Model Competition
  • How everything Works
    • 🤑Model Competition Explained
    • Rewards Model Explained
    • Prompt Guiding Explained
    • What we do with your model
Powered by GitBook
On this page
  1. How everything Works

Rewards Model Explained

Chaiverse: Reward Model Submission

PreviousModel Competition ExplainedNextPrompt Guiding Explained

Last updated 1 year ago

  • Reward Modelling is an essential part of making high-quality models.

  • Best-of-N sampling is the main part of InstructGPT loop

  • In fact, if you read the OpenAI's paper carefully, PPO out-of-the-box only achieves best-of-2 performance!

  • Contrary to PPO (Proximal Policy Optimization), Best-of-N sampling is simple, robust and high-performant

Figure below shows the architecture for best-of-4 sampling with reward

Keep in mind these parameters

In depth explanation