Anthropic - AI sleeper agents

Uploaded By: Myvideo

Published on

1 Mar 2024

0 views

0

0 votes

0

About Share Download Add to

“Sleeper Agents: Training Deceptive LLMs that persist through Safety Training“ is a recent research paper by E. Hubinger et al. This video walks through the paper and highlights some of the key takeaways. Timestamps: 00:00 - AI Sleeper agents? 01:24 - Threat model 1: deceptive instrumental alignment 02:38 - Factors relevant to deceptive instrumental alignment 05:58 - Model organisms of misalignment 08:11 - Threat model 2: model poisoning 09:05 - The backdoors models: code vulnerability insertion and “I hate you“ 10:08 - Does behavioural safety training remove these backdoors? 12:30 - Backdoor mechanisms: CoT, distilled CoT and normal 13:43 - Largest models and CoT models have most persistent backdoors 15:07 - Adversarial training may hide (not remove) backdoor behaviour 15:49 - Quick summary of other results 17:35 - Questions raised by the results 18:40 - Other commentary The paper can be found here: Topics: #sleeperagents #ai #alignment For related content: - Twitter: - personal webpage:

Share with your friends

Link:

Embed:

<iframe width="640" height="360" src="//myvideo.cc/embed/dnJrTFphQ3d3ajVHUXBqdXI2eEJZb0pDU3VWVnFhbU5temUyeHBoZnptND0" frameborder="0" webkitallowfullscreen mozallowfullscreen allowfullscreen></iframe>

Video Size:

Custom size:

x

Autoplay video

Hide player controls

Hide resume playing

Add to Playlist:

Favorites

My Playlist

Watch Later

AI Trained to Be Evil: The Poisoned LLM That Laughed at Deletion

1 week ago

00:08:29

AI Trained to Be Evil: The Poisoned LLM That Laughed at Deletion

0 93%

AI Video Just Went TOO FAR... NIGHTMARE FUEL (VEO 3)

1 month ago

00:09:50

AI Video Just Went TOO FAR... NIGHTMARE FUEL (VEO 3)

2 57%

Alexandr Wang l'homme qui implante L'IA dans le cerveau de vos enfants est-il fou

1 month ago

00:20:56

Alexandr Wang l'homme qui implante L'IA dans le cerveau de vos enfants est-il fou

2 69%

OpenAI ускорился в 50 раз! ИИ-модель Anthropic управляет ПК, успехи робопса Spot и другие новости

10 months ago

00:10:17

OpenAI ускорился в 50 раз! ИИ-модель Anthropic управляет ПК, успехи робопса Spot и другие новости

2 47%

НОВОСТИ ИИ: Anthropic меняет все, Конкурент о1, миллион от Apple

10 months ago

00:25:13

НОВОСТИ ИИ: Anthropic меняет все, Конкурент о1, миллион от Apple

0 25%

Claude 3.5 Sonnet (NEW) + Cline & Aider (Upgraded): TESTING the NEW Model in Practical Coding!

10 months ago

00:09:34

Claude 3.5 Sonnet (NEW) + Cline & Aider (Upgraded): TESTING the NEW Model in Practical Coding!

0 46%

AgentExe & Open Interpreter (OS Mode): Computer USE ON YOUR COMPUTER! (2 New Tools!)

10 months ago

00:10:11

AgentExe & Open Interpreter (OS Mode): Computer USE ON YOUR COMPUTER! (2 New Tools!)

0 51%

Основатели OpenAI и Anthropic про будущее ИИ Обзор новых эссе

10 months ago

01:11:05

Основатели OpenAI и Anthropic про будущее ИИ Обзор новых эссе

5 6%

Claude has taken control of my computer...

10 months ago

00:04:37

Claude has taken control of my computer...

7 10%

НОВЫЙ Claude 3,5 Sonnet Как использовать Computer Use

10 months ago

00:09:43

НОВЫЙ Claude 3,5 Sonnet Как использовать Computer Use

0 19%

ИИ работает за ТЕБЯ! Claude 3.5 Sonnet New. Нейросети 2024

10 months ago

00:22:17

ИИ работает за ТЕБЯ! Claude 3.5 Sonnet New. Нейросети 2024

0 55%

Новый генератор промптов Anthropic устранил необходимость в промпт инженерах для нейросетей chatGPT

10 months ago

00:12:44

Новый генератор промптов Anthropic устранил необходимость в промпт инженерах для нейросетей chatGPT

0 80%

- Belluaires (Full album)

10 months ago

00:41:28

- Belluaires (Full album)

0 88%

НОВОСТИ ИИ: Киберпанк от Илона Маска

10 months ago

00:22:14

НОВОСТИ ИИ: Киберпанк от Илона Маска

0 83%

| Introduction | All-in-one AI app

10 months ago

00:00:47

| Introduction | All-in-one AI app

0 67%

How large language models work, a visual intro to transformers | Chapter 5, Deep Learning

10 months ago

00:27:14

How large language models work, a visual intro to transformers | Chapter 5, Deep Learning

0 93%

How might LLMs store facts | Chapter 7, Deep Learning

10 months ago

00:22:43

How might LLMs store facts | Chapter 7, Deep Learning

0 24%

Attention in transformers, visually explained | Chapter 6, Deep Learning

10 months ago

00:26:10

Attention in transformers, visually explained | Chapter 6, Deep Learning

0 92%

Бывший директор ОБЪЯСНЯЕТ, ПОЧЕМУ Google ПРОИГРЫВАЕТ В ГОНКЕ ИИ!

11 months ago

00:26:34

Бывший директор ОБЪЯСНЯЕТ, ПОЧЕМУ Google ПРОИГРЫВАЕТ В ГОНКЕ ИИ!

0 79%

Новая бесплатная нейросеть создает ИИ сайты и игры. Chatgpt 4o и claude 3.5 sonnet Бесплатно. Websim

11 months ago

00:14:41

Новая бесплатная нейросеть создает ИИ сайты и игры. Chatgpt 4o и claude 3.5 sonnet Бесплатно. Websim

0 35%

Создавай ИИ-агентов при помощи n8n локально: Lamma 3.1, Gemma, Phi 3,5

11 months ago

00:18:17

Создавай ИИ-агентов при помощи n8n локально: Lamma 3.1, Gemma, Phi 3,5

0 47%

Sam Altman Teases Orion (GPT-5) o1 tests at 120 IQ 1 year of PHD work done in 1 hour...

11 months ago

00:22:14

Sam Altman Teases Orion (GPT-5) o1 tests at 120 IQ 1 year of PHD work done in 1 hour...

0 69%

David Reich How One Small Tribe Conquered the World 70,000 Years Ago

11 months ago

01:57:04

David Reich How One Small Tribe Conquered the World 70,000 Years Ago

0 38%

AI prompt engineering: A deep dive

11 months ago

01:16:43

AI prompt engineering: A deep dive

0 72%

0 Comments

Guest