Skip to content

Weak-to-strong generalization

A new research direction for superalignment: can we leverage the generalization properties of deep learning to control strong models with weak supervisors?

Leopold Aschenbrenner
Leopold Aschenbrenner
1 min read

I'm very excited and proud to share some of the work we've been up to: a new research direction, with some promising initial results, for aligning superhuman AI systems.

OpenAI blog post - Paper (oral at ICML)

A core challenge for aligning future superhuman AI systems (superalignment) is that humans will need to supervise AI systems much smarter than them. We study a simple analogy: can small models supervise large models? We show that we can use a GPT-2-level model to elicit most of GPT-4’s capabilities—close to GPT-3.5-level performance—generalizing correctly even to hard problems where the small model failed. This opens up a new research direction that allows us to directly tackle a central challenge of aligning future superhuman models while making iterative empirical progress today.