Peter Jansen ( @peterjanse @peterjansen_ai, Twitter Profile

Peter Jansen ( @peterjansen-ai.bsky.social ) @peterjansen_ai

a year ago

Can language models be used as world simulators? In our ACL 2024 paper, we show -- not really. GPT-4 is only ~60% accurate at simulating state changes based on common-sense tasks, like boiling water. Preprint: arxiv.org/pdf/2406.06485 @allen_ai @MSFTResearch @aclmeeting

22 182 747 409K 587

Download Image

Peter Jansen ( @peterjansen-ai.bsky.social ) @peterjansen_ai

a year ago

This is follow-on work to our paper that asks "Can LLMs generate code for world simulators?" Our EMNLP paper showed -- sort of. Our best technique generated runnable simulations 57% of the time, but half of what they allow doesn't make sense. Paper: arxiv.org/abs/2305.14879 /1 Following this paper, the big question was: Can we just use LLMs directly as simulators, without them generating code as an intermediate step? And if so, how good are they at this task?

1 2 32 5K 6

Download Image

Teknium (e/λ) @Teknium1

a year ago

@peterjansen_ai @allen_ai @MSFTResearch @aclmeeting Doesn't this show in all cases but the changes by the environment? that GPT4 is about as good on average as a human?

1 0 25 4K 4

Download Image

Teknium (e/λ) @Teknium1

a year ago

@peterjansen_ai @allen_ai @MSFTResearch @aclmeeting @teortaxesTex awaiting your review

1 0 3 2K 1

Ronen Tamari @rtk254

a year ago

@peterjansen_ai @allen_ai @MSFTResearch @aclmeeting Cool work! We explored a different approach to using LLMs as world simulators in our Breakpoint Transformers paper x.com/ai2_aristo/sta… We similarly found that simulation abilities were not reliable, especially in OOD settings

0 0 2 122 0

Shao-Hua Sun @shaohua0116

a year ago

@peterjansen_ai @allen_ai @MSFTResearch @aclmeeting Congrats on this inspiring work!

0 0 0 185 0

BensenHsu @BensenHsu

a year ago

The findings suggest that while LLMs show promise, they are still unreliable as direct text-based world simulators, especially when it comes to capturing environment-driven transitions and transitions that require complex reasoning. The authors highlight the need for further innovations to improve LLM's world modeling capabilities. full paper: openread.academy/en/paper/readi…

1 2 32 4K 9

Download Image

Eran Hirsch @hirscheran

a year ago

@peterjansen_ai @allen_ai @MSFTResearch @aclmeeting I really like this work that deeply dives into language models as world models. You might also find our work interesting, which provides more such analyses on classic planning domains: arxiv.org/pdf/2402.11489

0 1 8 1K 5

Download Image

John Warner @SwampFox

a year ago

@peterjansen_ai @allen_ai @MSFTResearch @aclmeeting Not surprising results from general-purpose LLMs. Can they be better world models when augmented by proprietary and domain-specific applications in an RAG method? One would expect so.

0 0 1 312 0

Razoyo @RazoyoDev

a year ago

@peterjansen_ai @allen_ai @MSFTResearch @aclmeeting 60%... so, less than a toddler

1 0 0 714 0

Avinash @avipateldak

a year ago

@peterjansen_ai @allen_ai @MSFTResearch @aclmeeting To me the question tells already tells alot what we have achieved

0 0 0 495 0

Saquib Mehmood @SaquibOptimusAI

a year ago

@peterjansen_ai @allen_ai @MSFTResearch @aclmeeting Of course not. Their Worldview is entirely based on training and alignment.

0 0 0 470 0

Franco Calabrese @Franco_Calabres

a year ago

@peterjansen_ai @allen_ai @MSFTResearch @aclmeeting Very nice. It would be interesting to see a human/vs/LLM performance comparison where the humans don't have the refined cognitive skills typical of high-level AI researchers.

0 0 0 368 0

Mauricio @Mauricio_asz

a year ago

@peterjansen_ai @allen_ai @MSFTResearch @aclmeeting @AkitaOnRails

0 0 0 58 0

JonnyMadFox🏴 @JonnyMadFox

a year ago

@peterjansen_ai @allen_ai @MSFTResearch @aclmeeting Still can't handle Nethack👏👏

0 0 0 48 0

Sina Shahandeh @SinaShahandeh

a year ago

@peterjansen_ai @allen_ai @MSFTResearch @aclmeeting This is tested on text-based GPT4 model. I am wondering if we would get a substantial improvements on GPT4-o given its multi-modality. The world simulators require observational knowledge not necessarily accessible in text format. Even if Qs are given in text.

0 0 0 180 0

xx @xwithwx

a year ago

@peterjansen_ai @allen_ai @MSFTResearch @aclmeeting read this

0 0 0 1 0

randal @rando_137

a year ago

@peterjansen_ai @allen_ai @MSFTResearch @aclmeeting Here’s another paper worth digging into around benchmarks. arxiv.org/pdf/2406.04744

1 0 0 35 0

AI Deeply @AiDeeply

a year ago

@peterjansen_ai @allen_ai @MSFTResearch @aclmeeting "Sparks of AGI". (Not.) Snark aside: work like this is important for systematically demonstrating what many of us feel in everyday use.

0 0 0 38 0