Hao Zhang hbcbh1999

Reinforcement Learning for Language Models

Yoav Goldberg, April 2023.

Why RL?

With the release of the ChatGPT model and followup large language models (LLMs), there was a lot of discussion of the importance of "RLHF training", that is, "reinforcement learning from human feedback". I was puzzled for a while as to why RL (Reinforcement Learning) is better than learning from demonstrations (a.k.a supervised learning) for training language models. Shouldn't learning from demonstrations (or, in language model terminology "instruction fine tuning", learning to immitate human written answers) be sufficient? I came up with a theoretical argument that was somewhat convincing. But I came to realize there is an additional argumment which not only supports the case of RL training, but also requires it, in particular for models like ChatGPT. This additional argument is spelled out in (the first half of) a talk by John Schulman from OpenAI. This post pretty much

Purpose

Bootstrap knowledge of LLMs ASAP. With a bias/focus to GPT.

Avoid being a link dump. Try to provide only valuable well tuned information.

Prelude

Neural network links before starting with transformers.

Debug with GDB on macOS 11

The big reason to do this is that LLDB has no ability to "follow-fork-mode child", in other words, a multi-process target that doesn't have a single-process mode (or, a bug that only manifests when in multi-process mode) is going to be difficult or impossible to debug, especially if you have to run the target over and over in order to make the bug manifest. If you have a repeatable bug, no big deal, break on the fork from the parent process and attach to the child in a second lldb instance. Otherwise, read on.

Install GDB

Don't make the mistake of thinking you can just brew install gdb. Currently this is version 10.2 and it's mostly broken, with at least two annoying bugs as of April 29th 2021, but the big one is https://sourceware.org/bugzilla/show_bug.cgi?id=24069

$ xcode-select install  # install the XCode command-line tools

A Few Useful Things to Know about Machine Learning

The paper presents some key lessons and "folk wisdom" that machine learning researchers and practitioners have learnt from experience and which are hard to find in textbooks.

1. Learning = Representation + Evaluation + Optimization

All machine learning algorithms have three components:

Representation for a learner is the set if classifiers/functions that can be possibly learnt. This set is called hypothesis space. If a function is not in hypothesis space, it can not be learnt.
Evaluation function tells how good the machine learning model is.
Optimisation is the method to search for the most optimal learning model.

	# This script adds a "Webclip" shortcut to your homescreen.
	# The shortcut can be used to open a web page in full-screen mode,
	# or to launch a custom URL (e.g. a third-party app).
	# You'll be asked for a title, a URL, and an icon (from your camera roll)

	import plistlib
	import BaseHTTPServer
	import webbrowser
	import uuid
	from io import BytesIO