Jacob Pfau @jacob_pfau

Alignment at UKAISI and PhD student at NYU jacobpfau.com London Joined June 2019

Tweets

758
Followers

2K
Following

1K
Likes

27K

Cas (Stephen Casper) @StephenLCasper

4 weeks ago

🧵 New paper from @AISecurityInst x @AiEleuther that I led with Kyle O’Brien: Open-weight LLM safety is both important & neglected. But we show that filtering dual-use knowledge from pre-training data improves tamper resistance *>10x* over post-training baselines.

7 40 197 26K 82

Download Image

Geoffrey Irving @geoffreyirving

a month ago

I am very excited that AISI is announcing over £15M in funding for AI alignment and control, in partnership with other governments, industry, VCs, and philanthropists! Here is a 🧵 about why it is important to bring more independent ideas and expertise into this space.

AI Security Institute @AISecurityInst

a month ago

7 62 188 98K 100

9 27 166 24K 37

Geoffrey Irving @geoffreyirving

2 months ago

Short background note about relativisation in debate protocols: if we want to model AI training protocols, we need results that hold even if our source of truth (humans for instance) is a black box that can't be introspected. 🧵

1 2 9 2K 6

Download Image

Geoffrey Irving @geoffreyirving

3 months ago

New alignment theory paper! We present a new scalable oversight protocol (prover-estimator debate) and a proof that honesty is incentivised at equilibrium (with large assumptions, see 🧵), even when the AIs involved have similar available compute.

8 55 346 28K 242

Download Image

Benjamin Hilton @benjamin_hilton

3 months ago

Come work with me!! I'm hiring a research manager for @AISecurityInst's Alignment Team. You'll manage exceptional researchers tackling one of humanity’s biggest challenges. Our mission: ensure we have ways to make superhuman AI safe before it poses critical risks. 1/4

5 19 80 13K 28

William Merrill @lambdaviking

3 months ago

Padding a transformer’s input with blank tokens (...) is a simple form of test-time compute. Can it increase the computational power of LLMs? 👀 New work with @Ashish_S_AI addresses this with *exact characterizations* of the expressive power of transformers with padding 🧵