08A: Model Fitness

Materials:

Date: Tuesday, 17-Sep-2024

Pre-work:

  1. [tools] Pytorch-ood - a collection of techniques to detect OOD in PyTroch. Mostly image focussed.
  2. [tools] PyOD - a collection of anomaly detection techniques

In-Class

  1. Characterizing data difficulty or sample hardness : Understanding Dataset difficulty
  2. [notebook] Sample Fitness Metrics based on Information Theory where we walk through these concept on a toy dataset
  3. Model diagnostics of LLMs: Application of Random Matrix Theory (RMT) to assess model generalization ability
    • Implicit Self-Regularization in Deep Neural Networks: Evidence from Random Matrix Theory and Implications for Learning paper [JMLR, 2021]
    • Predicting trends in the quality of state-of-the-art neural networks without accesss to training or testing data paper [Nature Communications, 2021]
    • Application of RMT to analyze an LSA (Latent Semantic Analysis) model notebook

Post-class

  1. [paper] The Unlocking Spell on Base LLMs: Rethinking Alignment via In-Context Learning
  2. [tools] AlignTDS - common metrics to detect differences between token distributions in LLMs
  3. [video] Heavy Tails in ML: Structures, Stability Dynamics Invited talk at NeurIPS’23
  4. [notebooks] RMT Application for Diagnosing LLMs. See the git repo for navigation.
  5. [blog] blog Explain double descent using Weight Watchers.

Notes

  1. Many metrics proposed to understand and characterize data and models are based on information theory.
  2. Likelihood Ratio, Deviance, Cross-Entropy, Perplexity, \(\nu\text{-information}\) all are related in linear and Generalized Linear models. They can be useful in modern deep learning and LLM context as well.
  3. Weight|Watchers is a very interesting application of Random Matrix Theory (RMT) to study the training dynamics of LLMs and other blackbox models. We need access to the model weights, not necessarily the entire training data. Fit power-law to the eigen spectrum of the weights, and based on the model (of learning dynamics), characterize the training regime into different stages. Models that have good generalization capabilities exhibit different characteristics in in the eigen spectrum.