For anyone planning to efficiently fine tune a LLM, this article by @rasbt on LoRA could be helpful. He explains with clarity the trade-offs to consider such as choice of quantizing pretrained weights, choice of optimizers (Adam vs SGD), impact of schedulers etc. What is LoRA? For simplicity, imagine all the weights of a model as one large matrix W. The key insight of LoRA is that, unlike in pretraining, during fine-tuning we can approximate the gradient update matrix (which is the same shape as W) with two smaller matrix thereby achieving savings in both compute and memory. magazine.sebastianraschka.com/p/practical-ti… Sebastian's contributions authorswithcode.org/researchers/?a…
0
0
5
886
1