Nikhil Chandak @nikhilchandak29

PhD Student at Max Planck Institute. Past @iiit_hyderabad @VectorInst. Interested in better evals, forecasting, and open-endedness. nikhilchandak.github.io Tübingen, Germany Joined December 2016

Tweets

81
Followers

384
Following

418
Likes

879

Arvindh Arun @arvindh__a

3 weeks ago

SEMMA (transliteration of செம்ம - meaning awesome), my first PhD work, is accepted to #EMNLP2025 Main! I also found out today that SEMMA has the (tied) highest average reviewer score in this ARR cycle 💪 📜: arxiv.org/abs/2505.20422

Arvindh Arun @arvindh__a

3 months ago

2 9 31 8K 11

Download Image

4 3 33 4K 2

Download Image

Nikhil Chandak @nikhilchandak29

a month ago

We have hit new high in chart crime

Akshit @akshitwt

a month ago

We have hit new high in chart crime

5 5 159 9K 2

Download Image

2 0 1 184 0

Greg Burnham @GregHBurnham

2 months ago

Pretty happy with how my predictions are holding up. 5/6 was the gold medal threshold this year. OAI's "experimental reasoning LLM" got that exactly, failing only to solve the one hard combinatorics problem, P6. My advice remains: look beyond the medal. Brief thread. 1/

Alexander Wei @alexwei_

2 months ago

411 1K 7K 5.6M 2K

Download Image

6 32 253 46K 87

Download Image

Nikhil Chandak @nikhilchandak29

2 months ago

Meanwhile, @Kimi_Moonshot has actually cooked with K2. Even without extended reasoning, it is on par with frontier models like Grok-4 on GPQA free-form. Massive congrats to them.

Nikhil Chandak @nikhilchandak29

2 months ago

Meanwhile, @Kimi_Moonshot has actually cooked with K2. Even without extended reasoning, it is on par with frontier models like Grok-4 on GPQA free-form. Massive congrats to them. https://t.co/gsJVfm2dN7

25 31 269 71K 84

Download Image

7 21 223 183K 46

Download Image

Florian Tramèr @florian_tramer

2 months ago

Very cool result. In hindsight, this shouldn't be too surprising to anyone who has ever taken a multiple choice exam. Eg if you have a trigonometry problem and the possible solutions are A: 1 B: 3.7 C: -5 D: pi/2 which would you pick (with no knowledge of the question)?

Nikhil Chandak @nikhilchandak29

2 months ago

3 22 69 14K 32

Download Image

1 8 31 4K 17

Shashwat Goel @ShashwatGoel7

2 months ago

TIL half of SWE-Bench-Verified is fixing issues in a single repository. We really need to be careful with how we name benchmarks, and be explicit about which capabilities they test. Fix-issues-in-the-Django-repo-Bench doesnt have the same ring to it, and thats the point.

Epoch AI @EpochAIResearch

3 months ago

3 4 79 14K 24

Download Image

2 2 14 941 0

Deepak Pathak @pathak2206

2 months ago

A great example of scientific discourse at its best—thoughtful, constructive, and conclusive. We now have more rigorous evidence that confidence maximization improves reasoning. 👇

Mihir Prabhudesai @mihirp98

2 months ago

A great example of scientific discourse at its best—thoughtful, constructive, and conclusive. We now have more rigorous evidence that confidence maximization improves reasoning. 👇

1 13 50 15K 19

Download Image

2 2 23 4K 11

Mihir Prabhudesai @mihirp98

2 months ago

1/ Maximizing confidence indeed improves reasoning. We worked with @ShashwatGoel7, @nikhilchandak29 @AmyPrb for the past 3 weeks (over a zoom call and many emails!) and revised our evaluations to align with their suggested prompts/parsers/sampling params. This includes changing…

Shashwat Goel @ShashwatGoel7

3 months ago

33 125 878 318K 535

Download Image

1 13 50 15K 19

Download Image

Jonas Geiping @jonasgeiping

3 months ago

Forecasting future events is a fascinating task for language models. Arguably the hardest application for a pure "oracle" that can't take actions; requiring reasoning about conflicting info, planning, information seeking... But, forecasting is also uniquely hard to evaluate: