wrote a paper: it lets you *train* in 1.58b! could use 97% less energy, 90% less weight memory. leads to a new model format which can store a 175B model in ~20mb. also, no backprop!
what about bitnet? bitnet does inference in 1.58b, but training uses precision weights. basically they clamp weights to ternary {-1,0,1} in forward pass, and pretend they didn’t in backward pass.
@_brickner Tried running the pdf, says it’s not an executable
@_brickner does this mean *every* part for the architecture could be implemented without floats? like we could train and infer a model in a chip without FP arithmetic at all?
@_brickner Are you planning on releasing any code with this?
@_brickner Drop code. GitHub doesn't have reviewers.
Hello. I am Reviewer #2, destroyer of dreams. My assessment is that the presented work cannot yet be taken seriously. The best that can be said is that if the claims are true, then their current presentation does them great disservice. It is likely that the author is not yet well-trained enough to understand what a rigorous demonstration of new techniques entails. This is not a question of gatekeeping but rather coherence and verifiability. The paper is too short to prove the striking claims being made about memory and energy. The given experimental results appear to be from a toy problem (MLP applied to MNIST) with no implementation code available for inspection. It is not made clear how one is meant to compute gradient sign without backpropagation. It is not clear how one can compute efficiently with a model that must be reconstructed from a random seed and perturbations at each step. The estimated memory footprint is announced as being made “with great hubris” and the argument for correctness appears to be a naked claim that ideas are “a priori” correct. This is not adequate: results that sound too good to be true are clearly not “a priori” to be regarded as true without argument or implementation. Indeed this appears more in alignment with the style of an amateur attempted proof of the Riemann hypothesis than a legitimate scientific exposition. Other language used throughout is nonstandard or otherwise too fluid to be meaningful. I recommend the manuscript be rejected.
@_brickner fact check please if not busy @teortaxesTex i would amp and the thread is funny
@_brickner it's impressive that you're storing ~1k bits per bit there!
@_brickner Do you have an implementation of this anywhere? Would love to try this with LLM training.
@_brickner Your thread is everybody's favorite! #TopUnroll threadreaderapp.com/thread/1871348… 🙏🏼@borsali24 for 🥇unroll