Can language models be used as world simulators? In our ACL 2024 paper, we show -- not really. GPT-4 is only ~60% accurate at simulating state changes based on common-sense tasks, like boiling water. Preprint: arxiv.org/pdf/2406.06485 @allen_ai @MSFTResearch @aclmeeting
This is follow-on work to our paper that asks "Can LLMs generate code for world simulators?" Our EMNLP paper showed -- sort of. Our best technique generated runnable simulations 57% of the time, but half of what they allow doesn't make sense. Paper: arxiv.org/abs/2305.14879 /1 Following this paper, the big question was: Can we just use LLMs directly as simulators, without them generating code as an intermediate step? And if so, how good are they at this task?
@peterjansen_ai @allen_ai @MSFTResearch @aclmeeting Doesn't this show in all cases but the changes by the environment? that GPT4 is about as good on average as a human?
@peterjansen_ai @allen_ai @MSFTResearch @aclmeeting @teortaxesTex awaiting your review
@peterjansen_ai @allen_ai @MSFTResearch @aclmeeting Cool work! We explored a different approach to using LLMs as world simulators in our Breakpoint Transformers paper x.com/ai2_aristo/sta… We similarly found that simulation abilities were not reliable, especially in OOD settings
@peterjansen_ai @allen_ai @MSFTResearch @aclmeeting Congrats on this inspiring work!
The findings suggest that while LLMs show promise, they are still unreliable as direct text-based world simulators, especially when it comes to capturing environment-driven transitions and transitions that require complex reasoning. The authors highlight the need for further innovations to improve LLM's world modeling capabilities. full paper: openread.academy/en/paper/readi…
@peterjansen_ai @allen_ai @MSFTResearch @aclmeeting I really like this work that deeply dives into language models as world models. You might also find our work interesting, which provides more such analyses on classic planning domains: arxiv.org/pdf/2402.11489
@peterjansen_ai @allen_ai @MSFTResearch @aclmeeting Not surprising results from general-purpose LLMs. Can they be better world models when augmented by proprietary and domain-specific applications in an RAG method? One would expect so.
@peterjansen_ai @allen_ai @MSFTResearch @aclmeeting 60%... so, less than a toddler
@peterjansen_ai @allen_ai @MSFTResearch @aclmeeting To me the question tells already tells alot what we have achieved
@peterjansen_ai @allen_ai @MSFTResearch @aclmeeting Of course not. Their Worldview is entirely based on training and alignment.
@peterjansen_ai @allen_ai @MSFTResearch @aclmeeting Very nice. It would be interesting to see a human/vs/LLM performance comparison where the humans don't have the refined cognitive skills typical of high-level AI researchers.
@peterjansen_ai @allen_ai @MSFTResearch @aclmeeting Still can't handle Nethack👏👏
@peterjansen_ai @allen_ai @MSFTResearch @aclmeeting This is tested on text-based GPT4 model. I am wondering if we would get a substantial improvements on GPT4-o given its multi-modality. The world simulators require observational knowledge not necessarily accessible in text format. Even if Qs are given in text.
@peterjansen_ai @allen_ai @MSFTResearch @aclmeeting Here’s another paper worth digging into around benchmarks. arxiv.org/pdf/2406.04744
@peterjansen_ai @allen_ai @MSFTResearch @aclmeeting "Sparks of AGI". (Not.) Snark aside: work like this is important for systematically demonstrating what many of us feel in everyday use.