The Bradley-Terry model works like this:
- It's based on a chosen/rejected split
- The model is trained on binary judgements of specific content/samples as being either 'preferred' or 'dispreferred'
- The log ratio between preferred and dispreferred can be used as the natural reward signal
For my experimental setup I am doing chunks of the last 64 tokens in the sequence to train my reward model, and evaluating each chunk on a sliding window. Then, I am taking the average of these judgements across the sequence as the reward for the whole longform generation.
In addition to this, I'm making synthetic preferred/unpreferred data via the Qwen2.5 7b base model at varying temperatures. For future revisions, I want to experiment with intentionally making the text worse in more diverse ways, such as translating to and from another language.
This creates a preference modeling baseline that by default is normalized at different positions, and is always judging the same relative "volume" of information at a time on average.
The model expects input in this precise format:
[Original text from previous 64-token chunks]...
<<JUDGEMENT_REGION>>
[Next 64-token chunk to evaluate]
<</JUDGEMENT_REGION>>
<<JUDGEMENT>>letter
In my setup, the letter
is A (chosen) or B (rejected).
I use vllm to evaluate the probability distribution for the A/B comparison, for every chunk.
https://huggingface.co/Quest-AI/pretrain-rm-baseline-7b
https://huggingface.co/datasets/Quest-AI/quest-270k-chunked-64-judgement