FutureSearch benchmarks like Deep Research Bench, find Opus 4.1 and Opus 4 the same on average, but clearly different: better at numeric & data tasks (kind of like code?), worse at qualitative reasoning.
FutureSearch benchmarks like Deep Research Bench, find Opus 4.1 and Opus 4 the same on average, but clearly different: better at numeric & data tasks (kind of like code?), worse at qualitative reasoning.
Deep Research Bench live leaderboard is up! Results on new DeepSeek, new Gemini, Claude-4 on web tasks.
Link in reply. tl;dr of new results: DeepSeek still bad, Gemini seemingly worse for 2nd release in a row; Claude-4 is king.
Thanks @EuginaJordan for highlighting our @BuiltIn write-up in your newsletter! If you're thinking about AI hallucinations, read our research-backed take on what we can actually do to fix them builtin.com/articles/ai-ha…
Just launched: Deep Research Bench by FutureSearch.
89 tasks. Real-world research.
Agents tested: GPT‑4o, Claude 3, Gemini 2.5, DeepSeek.
Find out who leads — and where every model still fails: drb.futuresearch.ai#AIbenchmark#WebResearch#LLM
AI research tools are rapidly improving. But when accuracy matters, interactive research will remain substantially better than existing deep research tools. This will be true as long as humans are better than AI at error-checking and ideation
builtin.com/articles/deep-…
Deep Research is a very prominent LLM use case, only backed by handwaving claims. Now, for the first time, we can actually put companies' claims to the test, tell users who is best at what, and rank which models are best at research in general
arxiv.org/abs/2506.06287
Thank you to @futuristdotai
for covering Deep Research Bench! The first of its kind to score LLM agents on web-based research tasks. unite.ai/how-good-are-a…
Also, one interesting trend we noticed:
The fastest time for companies to reach $10b revenue, and then $100b revenue, is decreasing at a rate that is entirely consistent with OpenAI reaching their projected milestones!
How's that for naive extrapolation?
OpenAI reported yesterday they forecast $125B revenue in 2029.
This is way overoptimistic about ChatGPT, the API, and "monetizing free users".
But I think $125B in 2029 is still plausible, based on the AI 2027 scenario.
Short thread on where AI revenue is headed: 🧵
Ever use a "Deep Research" tool for work? New FutureSearch finding: How these tools —Gemini Deep Research, OpenAI Deep Research, and Perplexity Deep Research— have surprising failures on tasks you might ask them to do: futuresearch.ai/dr-persist-ada…
Ever wonder what happened with Google's first prediction market, that ran from 2005 to 2010?
The previously unreported story, by yours truly in @asteriskmgzn, including the wild finale.
Plus the story of the new prediction market that grew from its ashes that runs there today.
Ever wonder what happened with Google's first prediction market, that ran from 2005 to 2010?
The previously unreported story, by yours truly in @asteriskmgzn, including the wild finale.
Plus the story of the new prediction market that grew from its ashes that runs there today.
Last week, we got a leak on OpenAI subscribers.
On some - ChatGPT Plus subs, and API revenue - FutureSearch's numbers from June look prescient.
Very surprised how much Enterprise growth slowed. And almost nobody pays for ChatGPT Team!?
Our sleuthing: futuresearch.ai/openai-case-st…
OpenAI says o1 plans better. Does it?
We read a bunch of agent traces line-by-line, with 4 good agent archs, on 8 messy web+stats white-collar tasks with detailed partial credit.
Result: o1-preview aces some tasks others fail, but it's... moody?
futuresearch.ai/llm-agent-eval
Much is still unknown about this important AI capability. We at futuresearch.ai are still in the trenches. Let’s not mislead the public about how good AI forecasting is.
Full takedown: lesswrong.com/posts/uGkRcHqa…
(7/7)
6K Followers 7K FollowingFOLLOWS YOU 🫵 https://t.co/F7MzDOTC1k
ML/AI, R&D eng, quant trading, ASR in noise, TTS.
OPEN weights, thoughts, ... AGI, ASI - open AI computation for */acc—NOW 🥰
128 Followers 268 Followinghttps://t.co/bgoFTv0NnT https://t.co/dqV2JIImJl https://t.co/Deh7wOoOvw
Kierkegaard, Ellul, Barth
Interpassivity and Phenomenology
Christ is King
557 Followers 1K FollowingApplied AI @OpenAI | Physicist | Autonomous systems | ex-@PalantirTech; ex-@AppliedInt | @uniheidelberg | Personal Views Only
44 Followers 1K Following#NLProc PhD LORIA/CNRS/Université de Lorraine, I work on generation of questions, from structured and unstructured data. https://t.co/3mUSCnSHTf
204K Followers 100 FollowingThe world’s largest AI newsletter keeping 2,000,000+ daily readers ahead of the curve. Get the latest AI news and how to apply it in 5 minutes. By @rowancheung
4K Followers 120 FollowingWe build AI systems that natively reason, so they can partner with us on our most important problems. Join us https://t.co/BcjWCoID0G.
17K Followers 21 FollowingAn AI research and product company 🫠. We are a team of scientists and engineers building state-of-the-art multimodal models 😻
180K Followers 4K FollowingWriting at https://t.co/m6EtO60SiY and host of the Core Memory podcast. 2X NYT best-seller. Filmmaker @HBO (Wild, Wild Space) + @Netflix (Don't Die).
32K Followers 3K FollowingTech reporter. I try to make public policy relevant to you. Past: @CNN @WashingtonPost @TheAtlantic
Bluesky: https://t.co/JTVswBpl7m
12K Followers 718 FollowingMetaculus is a forecasting platform that optimally aggregates quantitative predictions of future events.
News & Announcements: https://t.co/EnjbicboHx
1K Followers 617 FollowingTrying to make the world a bit better - or at least a bit funnier.
MD and PhD student in Infectious Disease Forecasting, working at Metaculus.
13K Followers 2K Following@timesradio Breakfast 6-10am, Fri-Sun | @whitehallsource / @holyroodsources podcasts | Member # 1, Times Radio Early Breakfast Club / [email protected]
25.6M Followers 1K FollowingTop and breaking news, pictures and videos from Reuters. For breaking business news, follow @ReutersBiz. Our daily podcast is here: https://t.co/KO0QFy0d3a
20.9M Followers 1K FollowingSign up for our newsletters and alerts: https://t.co/QevH0DLQi8 | Got a tip? https://t.co/iXIigdPjEZ | For WSJ customer support: https://t.co/DZgH9n53qg
27.1M Followers 150 FollowingNews and analysis with a global perspective. We’re here to help you understand the world around you. Subscribe here: https://t.co/RpUQAAnhog
231K Followers 28 FollowingAward-winning reporting and analysis on the latest scientific breakthroughs and technological innovations. Sign up for our newsletter: https://t.co/Ln3t3inWEG
152K Followers 1K FollowingRANE is a risk intelligence company that provides access to critical insights, analysis, and support on cyber, geopolitical, compliance and other threats
156K Followers 36 FollowingI have a place where I say complicated things about philosophy and science. That place is my blog. This is where I make terrible puns.
410K Followers 149 FollowingBuilding the infrastructure that powers the startup economy.
🚀 Venture & Rolling Funds
🌐 Scout Funds
⚡ SPVs
💻 Digital Subscriptions
🧠 AI Tools (beta)
No recent Favorites. New Favorites will appear here.