AI Tricks and Techniques
How do you teach a model a new set of data, without the old data, and prevent "Catastrophic Forgetting"?
Scenario: We deleted our original training dataset for GDPR compliance. We need to teach the live model a new class of data today.
-> Elastic Weight Consolidation ( Paper 1 Paper 2 ) or LoRA adapters.
Another new option is Nested Learning, a new machine learning paradigm that treats models as a system of interconnected optimization problems running at different speeds. In a snapshot:
- From snapshots to continuous memory: Traditional deep-learning models store knowledge in fixed parameters (pre-training) plus a short-term context window. Once trained, they struggle to truly โlearn new things while remembering old ones.โ Nested Learning instead treats different parts of the model as independent optimization modules, each with its own โcontext flowโ (what it sees/learns) and its own update frequency. Some modules change quickly (for immediate context), others only slowly โ much like short-term vs long-term memory in the brain.
- Towards real continual learning: Because memory modules update at different rates, the system can integrate new information without overwriting older, more foundational knowledge. That helps solve a long-standing barrier for AI โ โcatastrophic forgettingโ โ and brings us closer to systems that learn over time and accumulate knowledge, rather than being static snapshots.
New Model architectures such as HRM and TRM also fall under N-L model category and are first breed of impactful continual learning models.
Source of Non-determinism in LLMs
Adapt a 70B model to the highly technical Medical domain, with GPU constraint.
(We are GPU-constrained, so we can't do full fine-tuning. How do we proceed?)
-> To match Full Fine-Tuning performance on a LoRA budget, you need to implement ๐๐ก๐ ๐๐ฒ๐ฉ๐๐ซ-๐๐๐๐ฉ๐ญ๐๐ญ๐ข๐จ๐ง ๐๐ซ๐ข๐๐๐๐ญ๐.
1.๐๐ข๐ฏ๐ฌ-๐๐ต๐ข๐ฃ๐ช๐ญ๐ช๐ป๐ฆ๐ฅ ๐๐ฐ๐๐ (๐๐-๐๐ฐ๐๐): Standard LoRA scales adapters by 1/r. This collapses learning as you increase rank. You must switch to scaling by 1/sqrtยฎ to stabilize gradients at higher ranks (e.g., r=256).
2.๐๐ฐ๐ง๐ต๐ ๐๐ฏ๐ช๐ต๐ช๐ข๐ญ๐ช๐ป๐ข๐ต๐ช๐ฐ๐ฏ: Random initialization of adapters is inefficient. Use LoftQ to quantize the backbone and initialize adapters to minimize the approximation error immediately.
3.๐๐ช๐ง๐ง๐ฆ๐ณ๐ฆ๐ฏ๐ต๐ช๐ข๐ญ ๐๐ฆ๐ข๐ณ๐ฏ๐ช๐ฏ๐จ ๐๐ข๐ต๐ฆ๐ด: Not all layers learn at the same speed. You must apply a lower LR to embedding layers (to retain vocabulary stability) and a higher LR to the projection layers.
Debugging a new Transformer implementation:
Scenario 1 >>
Say the code runs without errors. The training loop executes. But the loss curve is completely flat. What is your first move?
-> ๐๐ข๐ง๐ ๐ฅ๐-๐๐๐ญ๐๐ก ๐๐ฏ๐๐ซ๐๐ข๐ญ. Before you touch a single hyperparameter, you must prove the architecture is capable of memorization.
-- ๐๐ฌ๐จ๐ฅ๐๐ญ๐: Take exactly ONE batch of data (e.g., 32 samples).
-- ๐๐ญ๐ซ๐ข๐ฉ: Turn off all regularization (Dropout = 0.0, Weight Decay = 0.0, Data Augmentation = Off).
-- ๐
๐จ๐ซ๐๐: Train on that single batch for certain epochs.
If the model implementation is correct, the loss should drive to absolute zero (0.00) and training accuracy should hit 100%. The model should perfectly memorize the inputs.
Scenario 2 >>
With the model overfitting (memorizing), now to make it learn on multi-class detection, try following options:
-- lower the learning rate, say from 1e-3 to 1e-4, and optimize the Optimizer params.
-- Add proper data augmentation for improving training loss and improve dataset by class balancing.
-- Fine-tune regularization(Batch vs. Layer) (maybe data normalization too) and update Loss Function by adding imbalanced dataset focused weighted class loss.
-- Re-balance loss weights: If bbox loss is small compared to classification loss, increase bbox loss weight. Try multiplying bbox losses by 2โ5 and observe effect.
-- Monitor training set vs validation: If bbox loss decreases on train but not val, you overfitting. Reduce augmentation or add regularization.
-- Lower IoU threshold for positives (e.g., from 0.6 โ 0.4) or use an adaptive assigner (ATSS / SimOTA) if available.
-- You can play with different loss functions such as,
- BBox: GIOU, DIOU, Smooth L1 loss
- Classification: Focal loss(gamma=2, alpha=0.25) and add label smoothing 0.05.
- Specialized hybrid losses: Gradient Harmonizing Mechanism (GHM) Loss, Joint Optimization/Multi-task Loss.
- ...
-- Gradient clipping and Batch size are also some fine-tuning options.
After that, if performance is still not reached, time to update model architecture.
4 strategies for multi-GPU training
- Model parallelism
- Different parts (or layers) of the model are placed on different GPUs.
- Useful for huge models that do not fit on a single GPU.
- However, model parallelism also introduces severe bottlenecks as it requires data flow between GPUs when activations from one GPU are transferred to another GPU.
- Tensor parallelism
- Distributes and processes individual tensor operations across multiple devices or processors.
- It is based on the idea that a large tensor operation, such as matrix multiplication, can be divided into smaller tensor operations, and each smaller operation can be executed on a separate device or processor.
- Such parallelization strategies are inherently built into standard implementations of PyTorch and other deep learning frameworks, but they become much more pronounced in a distributed setting.
- Data parallelism
- Replicate the model across all GPUs.
- Divide the available data into smaller batches, and each batch is processed by a separate GPU.
- The updates (or gradients) from each GPU are then aggregated and used to update the model parameters on every GPU.
- Pipeline parallelism
- This is often considered a combination of data parallelism and model parallelism.
- So the issue with standard model parallelism is that 1st GPU remains idle when data is being propagated through layers available in 2nd GPU:
- Pipeline parallelism addresses this by loading the next micro-batch of data once the 1st GPU has finished the computations on the 1st micro-batch and transferred activations to layers available in the 2nd GPU.
- The process looks like this:
โณ 1st micro-batch passes through the layers on 1st GPU.
โณ 2nd GPU receives activations on 1st micro-batch from 1st GPU.
โณ While the 2nd GPU passes the data through the layers, another micro-batch is loaded on the 1st GPU.
โณ And the process continues. - GPU utilization drastically improves this way.
Validating Production readiness of a Segmentation model
Validate on metrics such as:
- Mask-based metrics: Intersection over Union (IoU) or Jaccard Index, Dice Score / F1 Score, Average Precision, mAP and Average Recall at IoU thresholds (COCO style), Panoptic Quality (PQ) for panoptic segmentation, Specificity, mIOU.
- Boundary-based metrics: Hausdorff Distance (HD) or HD95, Normalized Surface Dice (NSD), Meticulosity Quality (MQ) etc.
The most highly used metrics for segmentation models in top academic papers, particularly in semantic and medical image analysis, are the Dice Similarity Coefficient (DSC) and the Intersection over Union (IoU). These overlap-based metrics are often complemented by boundary-based (such as HD95) and traditional classification metrics to provide a comprehensive evaluation.
AFter that, to check for production readiness follow ๐๐ก๐ ๐๐ฅ๐๐๐ค ๐๐๐ญ๐๐ก ๐๐ซ๐จ๐ญ๐จ๐๐จ๐ฅ.
Torture the model with "Adversarial Stress Testing" using two specific techniques:
- ๐๐ง๐ฏ๐๐ซ๐ข๐๐ง๐๐ ๐๐๐ฌ๐ญ๐ข๐ง๐ :
Rotate the input image by 15ยฐ or crop the edges. The anatomy hasn't changed, so the prediction shouldn't either. If confidence swings from 0.99 to 0.40 just because you rotated the camera, the model is overfitting on pixel-level noise. - ๐๐ข๐ซ๐๐๐ญ๐ข๐จ๐ง๐๐ฅ ๐๐ฑ๐ฉ๐๐๐ญ๐๐ญ๐ข๐จ๐ง (๐๐ก๐ "๐๐ฅ๐๐๐ค ๐๐๐ญ๐๐ก"): Manually black out the actual gallbladder in the image.
- ๐๐น๐ฑ๐ฆ๐ค๐ต๐ฆ๐ฅ ๐๐ฆ๐ด๐ถ๐ญ๐ต: The model should scream "No Gallbladder Found."
- ๐๐ข๐ต๐ข๐ญ ๐๐ฆ๐ด๐ถ๐ญ๐ต: If the model still predicts "Gallbladder" with 90% confidence, it is looking at the background, not the organ.
Is maintaining model precision (Float32) advisable, in order to reduce adversarial vulnerability, given that quantization would improve latency but might reduce robustness? In other words how to reduce adversarial noise while keeping latency restriction?
High Precision (FP32) means your model is hyper-sensitive. It captures the signal perfectly, but it also captures the noise perfectly. In a high-stakes edge environment, that sensitivity is a liability, not an asset. By combining Lipschitz constraints with Int8 quantization, we turn discretization error into a defensive feature, stripping out small-scale adversarial noise without adding a single microsecond of latency (๐๐ก๐ ๐๐ข๐ญ-๐๐๐ฉ๐ญ๐ก ๐๐๐ซ๐ซ๐ข๐๐ซ).
Here is the physics of why "dumbing down" the model actually saves it:
- ๐๐ฉ๐ฆ ๐๐ถ๐ค๐ฌ๐ฆ๐ต๐ช๐ฏ๐จ ๐๐ง๐ง๐ฆ๐ค๐ต: Quantization forces continuous values into discrete bins. If the adversarial noise (perturbation) is small enough, it falls into the same "bucket" as the clean signal.
- ๐๐ฉ๐ฆ ๐๐ฐ๐ถ๐ฏ๐ฅ๐ช๐ฏ๐จ ๐๐ฉ๐ช๐ฆ๐ญ๐ฅ: When the value is rounded to the nearest centroid, the noise is effectively stripped out. The lower precision acts as a low-pass filter for free.
- ๐๐ฉ๐ฆ ๐๐ช๐ฑ๐ด๐ค๐ฉ๐ช๐ต๐ป ๐๐ฐ๐ฏ๐ด๐ต๐ณ๐ข๐ช๐ฏ๐ต: The only risk is error amplification deeper in the network. You solve this by applying Lipschitz Regularization during the quantization-aware training. This caps how much the output can change relative to the input.