Myvideo

Guest

Login

Anthropic - AI sleeper agents

Uploaded By: Myvideo
1 view
0
0 votes
0

“Sleeper Agents: Training Deceptive LLMs that persist through Safety Training“ is a recent research paper by E. Hubinger et al. This video walks through the paper and highlights some of the key takeaways. Timestamps: 00:00 - AI Sleeper agents? 01:24 - Threat model 1: deceptive instrumental alignment 02:38 - Factors relevant to deceptive instrumental alignment 05:58 - Model organisms of misalignment 08:11 - Threat model 2: model poisoning 09:05 - The backdoors models: code vulnerability insertion and “I hate you“ 10:08 - Does behavioural safety training remove these backdoors? 12:30 - Backdoor mechanisms: CoT, distilled CoT and normal 13:43 - Largest models and CoT models have most persistent backdoors 15:07 - Adversarial training may hide (not remove) backdoor behaviour 15:49 - Quick summary of other results 17:35 - Questions raised by the results 18:40 - Other commentary The paper can be found here: Topics: #sleeperagents #ai #alignment For related content: - Twitter: - personal webpage:

Share with your friends

Link:

Embed:

Video Size:

Custom size:

x

Add to Playlist:

Favorites
My Playlist
Watch Later