SOTA techniques in Deep Learning
Why Your Transformer Might Not Need Normalization
=> Substack Article Paper
Researchers, including Kaiming He and Yann LeCun from FAIR (Meta AI), introduced Dynamic Tanh (DyT) in a research paper titled "Transformers without Normalization" (CVPR 2025). DyT is an element-wise operation that can replace normalization layers, such as LayerNorm and RMSNorm. The operation is born from the observation that layers often exhibit a characteristic tanh-like shape after normalization.
Parameter-efficient Finetuning (PEFT): for compute restrained environment
For compute restrained finetuning, one of the widely used PEFT technique is LORA. It can be applied to both Vision and Language models.
Following are some LLM variants:
- LoRA
- Add two low-rank trainable matrices, A and B, alongside weight matrices.
- Instead of fine-tuning W, adjust the updates in these low-rank matrices.
Even for the largest of LLMs, LoRA matrices take up a few MBs of memory.
- LoRA-FA
While LoRA significantly decreases the total trainable parameters, it requires substantial activation memory to update the low-rank weights.
LoRA-FA (FA stands for Frozen-A) freezes matrix A and only updates matrix B.
- VeRA
- In LoRA, low-rank matrices A and B are unique for each layer.
- In VeRA, A and B are frozen, random, and shared across all layers.
- Instead, it learns layer-specific scaling VECTORS (b and d) instead.
- Delta-LoRA
- It tunes the matrix W as well, but not in the traditional way.
- Here, the difference (or delta) between the product of matrices A and B in two consecutive training steps is added to W.
- LoRA+
- In LoRA, both matrices A and B are updated with the same learning rate.
- Authors of LoRA+ found that setting a higher learning rate for matrix B results in better convergence.