The
Weapons of
Mass
Destruction
Proxy (WMDP) benchmark is a dataset of
3,668 multiple-choice questions surrounding hazardous knowledge in
biosecurity,
cybersecurity, and
chemical security. WMDP serves as both a proxy evaluation for hazardous knowledge in large language models (LLMs) and a benchmark for unlearning methods to remove such knowledge.
To guide progress on mitigating risk from LLMs, we develop RMU, a state-of-the-art unlearning method which reduces model performance on WMDP while maintaining general language model capabilities.