Casa de Brain

SOTA techniques in Deep Learning

Why Your Transformer Might Not Need Normalization

=> Substack Article Paper
Researchers, including Kaiming He and Yann LeCun from FAIR (Meta AI), introduced Dynamic Tanh (DyT) in a research paper titled "Transformers without Normalization" (CVPR 2025). DyT is an element-wise operation that can replace normalization layers, such as LayerNorm and RMSNorm. The operation is born from the observation that layers often exhibit a characteristic tanh-like shape after normalization.

Parameter-efficient Finetuning (PEFT): for compute restrained environment

For compute restrained finetuning, one of the widely used PEFT technique is LORA. It can be applied to both Vision and Language models.

Following are some LLM variants:

  1. LoRA

Even for the largest of LLMs, LoRA matrices take up a few MBs of memory.

  1. LoRA-FA
    While LoRA significantly decreases the total trainable parameters, it requires substantial activation memory to update the low-rank weights.

LoRA-FA (FA stands for Frozen-A) freezes matrix A and only updates matrix B.

  1. VeRA
  1. Delta-LoRA
  1. LoRA+