High quality math is the secret sauce for reasoning models.
The best math data is in old papers. But OCRing that math is full of insane edge cases.
Let's talk about how to solve this, and how you can get better math data than many frontier labs 🧵
Extracting specific data from long documents is a challenge, even with LLMs. We're shipping Datalab Extract beta, which makes it simpler.
- Pass pydantic or json schema
- Visual editor
- More accurate than gemini (internal benchmark)
- Built on open source
- Can run on-prem
We're hiring fullstack and research engineers for @datalabto:
- 7-figure ARR, 5x growth in 2025, 40k GH stars, team of 4
- Used by tier 1 AI labs, governments, universities
- Train SoTA document models w/novel arch
- 200k-400k, 1-3% eq
- Culture: low ego, collaborative, GSD
If you are at #NAACL2025, don’t miss our Oral and Poster sessions showcasing three exciting papers from our lab! 🚀 We'll dive into data protection, memorization in LLMs, and the impact of fine-tuning on CoT reasoning.
1K Followers 1K FollowingCS (HCC) PhD student @UMich | ex research @adobe | Human AI Interaction, Social Computing, Interaction Design | ruminating about des, tech & society🍂💫🌊
2K Followers 545 FollowingAssistant Professor @UVA; PI of Aikyam Lab; Prev - @Harvard, @Adobe @BoschGlobal @thisisUIC ; Increasing the sample size of my thoughts
113 Followers 796 FollowingPhD student at The Ohio State University @osunlp
Previously - @AmazonScience, @Microsoft, @GeorgiaTech, @bitspilaniindia
Research interests: Language Models
141 Followers 3K FollowingCerco di seguire persone in buona fede che abbiano opinioni diverse dalla mia.
L'ignorante non si conosce mica dal lavoro che fa ma da come lo fa (C. Pavese)
143 Followers 435 FollowingI like most things Computer Science. Alias - megatron10(599). Fascinated by startups. Software Engineer @ Google. IIT H CSE.
76K Followers 460 FollowingLong, deep videos and short, stupid videos.
I run @TechLinkedYT and @GameLinkedYT at @LinusTech.
Movie takes at @TJMpod (RIP)
Opinions (everyone's) are my own
5K Followers 193 FollowingThe official supporters’ club of FC Barcelona in New York City. Join us for every match at Smithfield Hall - 138 West 25th St
1K Followers 1K FollowingCS (HCC) PhD student @UMich | ex research @adobe | Human AI Interaction, Social Computing, Interaction Design | ruminating about des, tech & society🍂💫🌊
2K Followers 545 FollowingAssistant Professor @UVA; PI of Aikyam Lab; Prev - @Harvard, @Adobe @BoschGlobal @thisisUIC ; Increasing the sample size of my thoughts
30K Followers 6 FollowingA network of engineers enhanced by and building with AI.
Organizers of the AI Engineer Summit, AI Engineer World's Fair, and AI Engineer Europe.
19K Followers 68 Followingcreation is destruction is creation is destruction is creation is destruction is creation is destruction is creation is destruction is...
23K Followers 78 FollowingScaling education somehow. Formerly @khanacademy. Now making things at https://t.co/O7StbNzWXo, https://t.co/wxK2GObTmU, and sometimes https://t.co/hZEJcP4Lr9
131K Followers 7K Following28 | I like finance, business, tech and memes. building @insiderwave_ an app that tracks Nancy Pelosi’s Trades (her real full port unlike the competitors)
35K Followers 628 FollowingMLST is by Dr. Tim Scarfe @ecsquendor w/ cameos from @DoctorDuggar https://t.co/5YCv2SdFwN (early access/priv.discord) - Sponsor us!
13K Followers 3K FollowingCosmologist. Cricket analysis for India Men 22-24. Missing Delhi. Contributor to @ESPNcricinfo. Postdoc @Obs_Paris - dark energy simulations.