Thank you Elias for pushing our thinking this direction, and please pass my admiration to Drago for his productivity… I’m wondering why did you use another LLM as the ground truth here? it just adds another abstraction… I mean, wouldn’t it be more convincing to estimate distribution via some cross-classification table, like it’s done in epidemiology (sorry, couldn’t resist 😊) and then check it against the performance of llms, btw, pls consider the seer database, seer.cancer.gov/data/access.ht… the best population-based registry in the world… health surveys data are laced with selection bias, and rarely let to even descriptional evidence… but SEER is simply the best — curated, studied, and recognized by cancer epidemiologists
Thank you Elias for pushing our thinking this direction, and please pass my admiration to Drago for his productivity… I’m wondering why did you use another LLM as the ground truth here? it just adds another abstraction… I mean, wouldn’t it be more convincing to estimate distribution via some cross-classification table, like it’s done in epidemiology (sorry, couldn’t resist 😊) and then check it against the performance of llms, btw, pls consider the seer database, seer.cancer.gov/data/access.ht… the best population-based registry in the world… health surveys data are laced with selection bias, and rarely let to even descriptional evidence… but SEER is simply the best — curated, studied, and recognized by cancer epidemiologists
@soboleffspaces Thank you, Boris! We didn’t use other LLMs as ground truth, but the datasets listed on p. 5: causalai.net/r136.pdf. We’ve been looking for additional sources of ground truth and enlarging the benchmark, so the link is appreciated. Of course, open to collaborations!