AI Tricks and Techniques

28 Nov, 2025

How do you teach a model a new set of data, without the old data, and prevent "Catastrophic Forgetting"?

Scenario: We deleted our original training dataset for GDPR compliance. We need to teach the live model a new class of data today.
-> Elastic Weight Consolidation ( Paper 1 Paper 2 ) or LoRA adapters.

Another new option is Nested Learning, a new machine learning paradigm that treats models as a system of interconnected optimization problems running at different speeds. In a snapshot:

From snapshots to continuous memory: Traditional deep-learning models store knowledge in fixed parameters (pre-training) plus a short-term context window. Once trained, they struggle to truly “learn new things while remembering old ones.” Nested Learning instead treats different parts of the model as independent optimization modules, each with its own “context flow” (what it sees/learns) and its own update frequency. Some modules change quickly (for immediate context), others only slowly — much like short-term vs long-term memory in the brain.
Towards real continual learning: Because memory modules update at different rates, the system can integrate new information without overwriting older, more foundational knowledge. That helps solve a long-standing barrier for AI — “catastrophic forgetting” — and brings us closer to systems that learn over time and accumulate knowledge, rather than being static snapshots.

New Model architectures such as HRM and TRM also fall under N-L model category and are first breed of impactful continual learning models.

Source of Non-determinism in LLMs

Adapt a 70B model to the highly technical Medical domain, with GPU constraint.

(We are GPU-constrained, so we can't do full fine-tuning. How do we proceed?)

-> To match Full Fine-Tuning performance on a LoRA budget, you need to implement 𝐓𝐡𝐞 𝐇𝐲𝐩𝐞𝐫-𝐀𝐝𝐚𝐩𝐭𝐚𝐭𝐢𝐨𝐧 𝐓𝐫𝐢𝐟𝐞𝐜𝐭𝐚.
1.𝘙𝘢𝘯𝘬-𝘚𝘵𝘢𝘣𝘪𝘭𝘪𝘻𝘦𝘥 𝘓𝘰𝘙𝘈 (𝘙𝘚-𝘓𝘰𝘙𝘈): Standard LoRA scales adapters by 1/r. This collapses learning as you increase rank. You must switch to scaling by 1/sqrt® to stabilize gradients at higher ranks (e.g., r=256).
2.𝘓𝘰𝘧𝘵𝘘 𝘐𝘯𝘪𝘵𝘪𝘢𝘭𝘪𝘻𝘢𝘵𝘪𝘰𝘯: Random initialization of adapters is inefficient. Use LoftQ to quantize the backbone and initialize adapters to minimize the approximation error immediately.
3.𝘋𝘪𝘧𝘧𝘦𝘳𝘦𝘯𝘵𝘪𝘢𝘭 𝘓𝘦𝘢𝘳𝘯𝘪𝘯𝘨 𝘙𝘢𝘵𝘦𝘴: Not all layers learn at the same speed. You must apply a lower LR to embedding layers (to retain vocabulary stability) and a higher LR to the projection layers.

Debugging a new Transformer implementation:

Scenario 1 >>
Say the code runs without errors. The training loop executes. But the loss curve is completely flat. What is your first move?
-> 𝐒𝐢𝐧𝐠𝐥𝐞-𝐁𝐚𝐭𝐜𝐡 𝐎𝐯𝐞𝐫𝐟𝐢𝐭. Before you touch a single hyperparameter, you must prove the architecture is capable of memorization.
-- 𝐈𝐬𝐨𝐥𝐚𝐭𝐞: Take exactly ONE batch of data (e.g., 32 samples).
-- 𝐒𝐭𝐫𝐢𝐩: Turn off all regularization (Dropout = 0.0, Weight Decay = 0.0, Data Augmentation = Off).
-- 𝐅𝐨𝐫𝐜𝐞: Train on that single batch for certain epochs.
If the model implementation is correct, the loss should drive to absolute zero (0.00) and training accuracy should hit 100%. The model should perfectly memorize the inputs.

Scenario 2 >>
With the model overfitting (memorizing), now to make it learn on multi-class detection, try following options:
-- lower the learning rate, say from 1e-3 to 1e-4, and optimize the Optimizer params.
-- Add proper data augmentation for improving training loss and improve dataset by class balancing.
-- Fine-tune regularization(Batch vs. Layer) (maybe data normalization too) and update Loss Function by adding imbalanced dataset focused weighted class loss.
-- Re-balance loss weights: If bbox loss is small compared to classification loss, increase bbox loss weight. Try multiplying bbox losses by 2–5 and observe effect.
-- Monitor training set vs validation: If bbox loss decreases on train but not val, you overfitting. Reduce augmentation or add regularization.
-- Lower IoU threshold for positives (e.g., from 0.6 → 0.4) or use an adaptive assigner (ATSS / SimOTA) if available.
-- You can play with different loss functions such as,

BBox: GIOU, DIOU, Smooth L1 loss
Classification: Focal loss(gamma=2, alpha=0.25) and add label smoothing 0.05.
Specialized hybrid losses: Gradient Harmonizing Mechanism (GHM) Loss, Joint Optimization/Multi-task Loss.
...

-- Gradient clipping and Batch size are also some fine-tuning options.

After that, if performance is still not reached, time to update model architecture.

4 strategies for multi-GPU training

Model parallelism

Different parts (or layers) of the model are placed on different GPUs.
Useful for huge models that do not fit on a single GPU.
However, model parallelism also introduces severe bottlenecks as it requires data flow between GPUs when activations from one GPU are transferred to another GPU.

Tensor parallelism

Distributes and processes individual tensor operations across multiple devices or processors.
It is based on the idea that a large tensor operation, such as matrix multiplication, can be divided into smaller tensor operations, and each smaller operation can be executed on a separate device or processor.
Such parallelization strategies are inherently built into standard implementations of PyTorch and other deep learning frameworks, but they become much more pronounced in a distributed setting.

Data parallelism

Replicate the model across all GPUs.
Divide the available data into smaller batches, and each batch is processed by a separate GPU.
The updates (or gradients) from each GPU are then aggregated and used to update the model parameters on every GPU.

Pipeline parallelism

This is often considered a combination of data parallelism and model parallelism.
So the issue with standard model parallelism is that 1st GPU remains idle when data is being propagated through layers available in 2nd GPU:
Pipeline parallelism addresses this by loading the next micro-batch of data once the 1st GPU has finished the computations on the 1st micro-batch and transferred activations to layers available in the 2nd GPU.
The process looks like this:
↳ 1st micro-batch passes through the layers on 1st GPU.
↳ 2nd GPU receives activations on 1st micro-batch from 1st GPU.
↳ While the 2nd GPU passes the data through the layers, another micro-batch is loaded on the 1st GPU.
↳ And the process continues.
GPU utilization drastically improves this way.

Validating Production readiness of a Segmentation model

Validate on metrics such as:

Mask-based metrics: Intersection over Union (IoU) or Jaccard Index, Dice Score / F1 Score, Average Precision, mAP and Average Recall at IoU thresholds (COCO style), Panoptic Quality (PQ) for panoptic segmentation, Specificity, mIOU.
Boundary-based metrics: Hausdorff Distance (HD) or HD95, Normalized Surface Dice (NSD), Meticulosity Quality (MQ) etc.

The most highly used metrics for segmentation models in top academic papers, particularly in semantic and medical image analysis, are the Dice Similarity Coefficient (DSC) and the Intersection over Union (IoU). These overlap-based metrics are often complemented by boundary-based (such as HD95) and traditional classification metrics to provide a comprehensive evaluation.

AFter that, to check for production readiness follow 𝐓𝐡𝐞 𝐁𝐥𝐚𝐜𝐤 𝐏𝐚𝐭𝐜𝐡 𝐏𝐫𝐨𝐭𝐨𝐜𝐨𝐥.
Torture the model with "Adversarial Stress Testing" using two specific techniques:

𝐈𝐧𝐯𝐚𝐫𝐢𝐚𝐧𝐜𝐞 𝐓𝐞𝐬𝐭𝐢𝐧𝐠:
Rotate the input image by 15° or crop the edges. The anatomy hasn't changed, so the prediction shouldn't either. If confidence swings from 0.99 to 0.40 just because you rotated the camera, the model is overfitting on pixel-level noise.
𝐃𝐢𝐫𝐞𝐜𝐭𝐢𝐨𝐧𝐚𝐥 𝐄𝐱𝐩𝐞𝐜𝐭𝐚𝐭𝐢𝐨𝐧 (𝐓𝐡𝐞 "𝐁𝐥𝐚𝐜𝐤 𝐏𝐚𝐭𝐜𝐡"): Manually black out the actual gallbladder in the image.

𝘌𝘹𝘱𝘦𝘤𝘵𝘦𝘥 𝘙𝘦𝘴𝘶𝘭𝘵: The model should scream "No Gallbladder Found."
𝘍𝘢𝘵𝘢𝘭 𝘙𝘦𝘴𝘶𝘭𝘵: If the model still predicts "Gallbladder" with 90% confidence, it is looking at the background, not the organ.

Is maintaining model precision (Float32) advisable, in order to reduce adversarial vulnerability, given that quantization would improve latency but might reduce robustness? In other words how to reduce adversarial noise while keeping latency restriction?

High Precision (FP32) means your model is hyper-sensitive. It captures the signal perfectly, but it also captures the noise perfectly. In a high-stakes edge environment, that sensitivity is a liability, not an asset. By combining Lipschitz constraints with Int8 quantization, we turn discretization error into a defensive feature, stripping out small-scale adversarial noise without adding a single microsecond of latency (𝐓𝐡𝐞 𝐁𝐢𝐭-𝐃𝐞𝐩𝐭𝐡 𝐁𝐚𝐫𝐫𝐢𝐞𝐫).

Here is the physics of why "dumbing down" the model actually saves it:

𝘛𝘩𝘦 𝘉𝘶𝘤𝘬𝘦𝘵𝘪𝘯𝘨 𝘌𝘧𝘧𝘦𝘤𝘵: Quantization forces continuous values into discrete bins. If the adversarial noise (perturbation) is small enough, it falls into the same "bucket" as the clean signal.
𝘛𝘩𝘦 𝘙𝘰𝘶𝘯𝘥𝘪𝘯𝘨 𝘚𝘩𝘪𝘦𝘭𝘥: When the value is rounded to the nearest centroid, the noise is effectively stripped out. The lower precision acts as a low-pass filter for free.
𝘛𝘩𝘦 𝘓𝘪𝘱𝘴𝘤𝘩𝘪𝘵𝘻 𝘊𝘰𝘯𝘴𝘵𝘳𝘢𝘪𝘯𝘵: The only risk is error amplification deeper in the network. You solve this by applying Lipschitz Regularization during the quantization-aware training. This caps how much the output can change relative to the input.