Casa de Brain

AI Tricks and Techniques

How do you teach a model a new set of data, without the old data, and prevent "Catastrophic Forgetting"?

Scenario: We deleted our original training dataset for GDPR compliance. We need to teach the live model a new class of data today.
-> Elastic Weight Consolidation ( Paper 1 Paper 2 ) or LoRA adapters.

Another new option is Nested Learning, a new machine learning paradigm that treats models as a system of interconnected optimization problems running at different speeds. In a snapshot:

New Model architectures such as HRM and TRM also fall under N-L model category and are first breed of impactful continual learning models.

Source of Non-determinism in LLMs

Adapt a 70B model to the highly technical Medical domain, with GPU constraint.

(We are GPU-constrained, so we can't do full fine-tuning. How do we proceed?)

-> To match Full Fine-Tuning performance on a LoRA budget, you need to implement ๐“๐ก๐ž ๐‡๐ฒ๐ฉ๐ž๐ซ-๐€๐๐š๐ฉ๐ญ๐š๐ญ๐ข๐จ๐ง ๐“๐ซ๐ข๐Ÿ๐ž๐œ๐ญ๐š.
1.๐˜™๐˜ข๐˜ฏ๐˜ฌ-๐˜š๐˜ต๐˜ข๐˜ฃ๐˜ช๐˜ญ๐˜ช๐˜ป๐˜ฆ๐˜ฅ ๐˜“๐˜ฐ๐˜™๐˜ˆ (๐˜™๐˜š-๐˜“๐˜ฐ๐˜™๐˜ˆ): Standard LoRA scales adapters by 1/r. This collapses learning as you increase rank. You must switch to scaling by 1/sqrtยฎ to stabilize gradients at higher ranks (e.g., r=256).
2.๐˜“๐˜ฐ๐˜ง๐˜ต๐˜˜ ๐˜๐˜ฏ๐˜ช๐˜ต๐˜ช๐˜ข๐˜ญ๐˜ช๐˜ป๐˜ข๐˜ต๐˜ช๐˜ฐ๐˜ฏ: Random initialization of adapters is inefficient. Use LoftQ to quantize the backbone and initialize adapters to minimize the approximation error immediately.
3.๐˜‹๐˜ช๐˜ง๐˜ง๐˜ฆ๐˜ณ๐˜ฆ๐˜ฏ๐˜ต๐˜ช๐˜ข๐˜ญ ๐˜“๐˜ฆ๐˜ข๐˜ณ๐˜ฏ๐˜ช๐˜ฏ๐˜จ ๐˜™๐˜ข๐˜ต๐˜ฆ๐˜ด: Not all layers learn at the same speed. You must apply a lower LR to embedding layers (to retain vocabulary stability) and a higher LR to the projection layers.

Debugging a new Transformer implementation:

Scenario 1 >>
Say the code runs without errors. The training loop executes. But the loss curve is completely flat. What is your first move?
-> ๐’๐ข๐ง๐ ๐ฅ๐ž-๐๐š๐ญ๐œ๐ก ๐Ž๐ฏ๐ž๐ซ๐Ÿ๐ข๐ญ. Before you touch a single hyperparameter, you must prove the architecture is capable of memorization.
-- ๐ˆ๐ฌ๐จ๐ฅ๐š๐ญ๐ž: Take exactly ONE batch of data (e.g., 32 samples).
-- ๐’๐ญ๐ซ๐ข๐ฉ: Turn off all regularization (Dropout = 0.0, Weight Decay = 0.0, Data Augmentation = Off).
-- ๐…๐จ๐ซ๐œ๐ž: Train on that single batch for certain epochs.
If the model implementation is correct, the loss should drive to absolute zero (0.00) and training accuracy should hit 100%. The model should perfectly memorize the inputs.

Scenario 2 >>
With the model overfitting (memorizing), now to make it learn on multi-class detection, try following options:
-- lower the learning rate, say from 1e-3 to 1e-4, and optimize the Optimizer params.
-- Add proper data augmentation for improving training loss and improve dataset by class balancing.
-- Fine-tune regularization(Batch vs. Layer) (maybe data normalization too) and update Loss Function by adding imbalanced dataset focused weighted class loss.
-- Re-balance loss weights: If bbox loss is small compared to classification loss, increase bbox loss weight. Try multiplying bbox losses by 2โ€“5 and observe effect.
-- Monitor training set vs validation: If bbox loss decreases on train but not val, you overfitting. Reduce augmentation or add regularization.
-- Lower IoU threshold for positives (e.g., from 0.6 โ†’ 0.4) or use an adaptive assigner (ATSS / SimOTA) if available.
-- You can play with different loss functions such as,

-- Gradient clipping and Batch size are also some fine-tuning options.

After that, if performance is still not reached, time to update model architecture.

4 strategies for multi-GPU training

  1. Model parallelism
  1. Tensor parallelism
  1. Data parallelism
  1. Pipeline parallelism

Validating Production readiness of a Segmentation model

Validate on metrics such as:

  1. Mask-based metrics: Intersection over Union (IoU) or Jaccard Index, Dice Score / F1 Score, Average Precision, mAP and Average Recall at IoU thresholds (COCO style), Panoptic Quality (PQ) for panoptic segmentation, Specificity, mIOU.
  2. Boundary-based metrics: Hausdorff Distance (HD) or HD95, Normalized Surface Dice (NSD), Meticulosity Quality (MQ) etc.

The most highly used metrics for segmentation models in top academic papers, particularly in semantic and medical image analysis, are the Dice Similarity Coefficient (DSC) and the Intersection over Union (IoU). These overlap-based metrics are often complemented by boundary-based (such as HD95) and traditional classification metrics to provide a comprehensive evaluation.

AFter that, to check for production readiness follow ๐“๐ก๐ž ๐๐ฅ๐š๐œ๐ค ๐๐š๐ญ๐œ๐ก ๐๐ซ๐จ๐ญ๐จ๐œ๐จ๐ฅ.
Torture the model with "Adversarial Stress Testing" using two specific techniques:

  1. ๐ˆ๐ง๐ฏ๐š๐ซ๐ข๐š๐ง๐œ๐ž ๐“๐ž๐ฌ๐ญ๐ข๐ง๐ :
    Rotate the input image by 15ยฐ or crop the edges. The anatomy hasn't changed, so the prediction shouldn't either. If confidence swings from 0.99 to 0.40 just because you rotated the camera, the model is overfitting on pixel-level noise.
  2. ๐ƒ๐ข๐ซ๐ž๐œ๐ญ๐ข๐จ๐ง๐š๐ฅ ๐„๐ฑ๐ฉ๐ž๐œ๐ญ๐š๐ญ๐ข๐จ๐ง (๐“๐ก๐ž "๐๐ฅ๐š๐œ๐ค ๐๐š๐ญ๐œ๐ก"): Manually black out the actual gallbladder in the image.

Is maintaining model precision (Float32) advisable, in order to reduce adversarial vulnerability, given that quantization would improve latency but might reduce robustness? In other words how to reduce adversarial noise while keeping latency restriction?

High Precision (FP32) means your model is hyper-sensitive. It captures the signal perfectly, but it also captures the noise perfectly. In a high-stakes edge environment, that sensitivity is a liability, not an asset. By combining Lipschitz constraints with Int8 quantization, we turn discretization error into a defensive feature, stripping out small-scale adversarial noise without adding a single microsecond of latency (๐“๐ก๐ž ๐๐ข๐ญ-๐ƒ๐ž๐ฉ๐ญ๐ก ๐๐š๐ซ๐ซ๐ข๐ž๐ซ).

Here is the physics of why "dumbing down" the model actually saves it: