Will @_brickner, Twitter Profile

Will @_brickner

9 months ago

wrote a paper: it lets you *train* in 1.58b! could use 97% less energy, 90% less weight memory. leads to a new model format which can store a 175B model in ~20mb. also, no backprop!

107 329 5K 969K 4K

Download Image

Will @_brickner

9 months ago

what about bitnet? bitnet does inference in 1.58b, but training uses precision weights. basically they clamp weights to ternary {-1,0,1} in forward pass, and pretend they didn’t in backward pass.

1 3 174 35K 28

Download Image

gfodor.id @gfodor

9 months ago

@_brickner Tried running the pdf, says it’s not an executable

2 2 127 9K 1

Taelin @VictorTaelin

9 months ago

@_brickner does this mean *every* part for the architecture could be implemented without floats? like we could train and infer a model in a chip without FP arithmetic at all?

4 1 88 13K 7

Eric Hartford @QuixiAI

9 months ago

@_brickner Can you please link the weights?

0 0 31 4K 0

Zy @ZyMazza

9 months ago

@_brickner Seems big if true

0 0 23 4K 0

Dimitri von Rütte @dvruette

9 months ago

@_brickner Are you planning on releasing any code with this?

2 0 21 6K 0

Mark Schmidt 🌐 @MarkSchmidty

9 months ago

@_brickner Drop code. GitHub doesn't have reviewers.

1 0 13 2K 0

Hello. I am Reviewer #2, destroyer of dreams. My assessment is that the presented work cannot yet be taken seriously. The best that can be said is that if the claims are true, then their current presentation does them great disservice. It is likely that the author is not yet well-trained enough to understand what a rigorous demonstration of new techniques entails. This is not a question of gatekeeping but rather coherence and verifiability. The paper is too short to prove the striking claims being made about memory and energy. The given experimental results appear to be from a toy problem (MLP applied to MNIST) with no implementation code available for inspection. It is not made clear how one is meant to compute gradient sign without backpropagation. It is not clear how one can compute efficiently with a model that must be reconstructed from a random seed and perturbations at each step. The estimated memory footprint is announced as being made “with great hubris” and the argument for correctness appears to be a naked claim that ideas are “a priori” correct. This is not adequate: results that sound too good to be true are clearly not “a priori” to be regarded as true without argument or implementation. Indeed this appears more in alignment with the style of an amateur attempted proof of the Riemann hypothesis than a legitimate scientific exposition. Other language used throughout is nonstandard or otherwise too fluid to be meaningful. I recommend the manuscript be rejected.