the generic basics of preference reward modeling

The Bradley-Terry model works like this:

It's based on a chosen/rejected split
The model is trained on binary judgements of specific content/samples as being either 'preferred' or 'dispreferred'
The log ratio between preferred and dispreferred can be used as the natural reward signal

what parts are new when it comes to what i am trying to do

For my experimental setup I am doing chunks of the last 64 tokens in the sequence to train my reward model, and evaluating each chunk on a sliding window. Then, I am taking the average of these judgements across the sequence as the reward for the whole longform generation.

In addition to this, I'm making synthetic preferred/unpreferred data via the Qwen2.5 7b base model at varying temperatures. For future revisions, I want to experiment with intentionally making the text worse in more diverse ways, such as translating to and from another language.

This creates a preference modeling baseline that by default is normalized at different positions, and is always judging the same relative "volume" of information at a time on average.

The model expects input in this precise format:

[Original text from previous 64-token chunks]...

<<JUDGEMENT_REGION>>
[Next 64-token chunk to evaluate]
<</JUDGEMENT_REGION>>

<<JUDGEMENT>>letter

In my setup, the letter is A (chosen) or B (rejected).

I use vllm to evaluate the probability distribution for the A/B comparison, for every chunk.

link to the model and dataset

https://huggingface.co/Quest-AI/pretrain-rm-baseline-7b

https://huggingface.co/datasets/Quest-AI/quest-270k-chunked-64-judgement

kalomaze/pref_model.md

the generic basics of preference reward modeling

what parts are new when it comes to what i am trying to do

link to the model and dataset