PyTorch @PyTorch, Twitter Profile

PyTorch @PyTorch

3 months ago

torchft + TorchTitan: 1200+ failures, no checkpoints, model convergence. A Llama 3 model was trained across 300 L40S GPUs with synthetic failures every 15s. No restarts. No rollbacks. Just asynchronous recovery and continued progress. 📘 hubs.la/Q03t1Z0b0 #PyTorch #DistributedTraining #FaultTolerance #OpenSourceAI

16 49 397 75K 170

Download Image

Arthur Douillard @Ar_Douillard

3 months ago

@PyTorch Omg, really awesome work!

0 0 5 1K 0

mike64_t @mike64_t

3 months ago

@PyTorch nice

0 0 3 1K 0

Umar Jamil @hkproj

3 months ago

@PyTorch Great work, congrats!

0 0 3 1K 0

Ross Wightman @wightmanr

3 months ago

@PyTorch 😍

0 0 2 791 0

elie @eliebakouch

3 months ago

@PyTorch Awesome work, congrats!

0 0 1 764 0

ManyMangoes Pty Ltd @somanymangoes

3 months ago

@PyTorch Data gladiators unite. Failures? Just the bread crumbs of brilliance.

0 0 0 4 0

stochasm @stochasticchasm

3 months ago

@PyTorch Super cool

0 0 0 588 0

Lisa Price ⚡ @LisaPriceFX

3 months ago

@PyTorch 2000 synthetic failures, zero restarts, progress anyway—true anti-fragile AI. This is what future markets, blockchains, and even social networks will be built on.