Sample Fitness

We will look at dataset difficulty and sample perplexity using Iris data.

Iris is a relative easy dataset. We will pick two features, and pick versicolor and virginica lables as there seems to be some overalp in the feature space. We will look at this two dimensional data from many angles and see what can we learn about

See Sample Hardness notebook for prior work. We will build from there.

from sklearn.datasets import load_iris
import matplotlib.pyplot as plt

iris = load_iris()
y = iris.target
ind = (y==1) | (y==2)
X = iris.data[ind, 0:2]
y = y[ind]-1

We will build a simple logistic model and calculate the deviance between two models given data. This idea is intimately tied to likelihood ratio tests (LRT), Bayes Factors.

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2, random_state=128)


import statsmodels.api as sm
import matplotlib.pyplot as plt


model = sm.Logit(y_train, X_train).fit()
print(model.summary())
Optimization terminated successfully.
         Current function value: 0.667806
         Iterations 5
                           Logit Regression Results                           
==============================================================================
Dep. Variable:                      y   No. Observations:                   80
Model:                          Logit   Df Residuals:                       78
Method:                           MLE   Df Model:                            1
Date:                Sun, 15 Sep 2024   Pseudo R-squ.:                 0.03482
Time:                        17:04:38   Log-Likelihood:                -53.424
converged:                       True   LL-Null:                       -55.352
Covariance Type:            nonrobust   LLR p-value:                   0.04961
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
x1             0.6830      0.377      1.811      0.070      -0.056       1.422
x2            -1.4199      0.815     -1.742      0.082      -3.018       0.178
==============================================================================

Likelihood

The model summary contains many useful statistics like Log-Likelihood, LL-Null and LLR p-value. Let us explore what these different objects.

The generative model for Logistic regression can be written like as follows:

\[y_i \sim Binomial(1,\pi_i) \\ \log \left(\frac{\pi_i}{1-\pi_i} \right) = x_i^T\beta\] where \(y_i \in {0,1}\) is the binary response drawn from a Bernoulli trial with \(P(y_i=1)=\pi_i\), \(x_i\) is a $p $ vector of input features, \(\beta\) is a vector of \(p \times 1\) coefficients (weights).

The likelihood of \(n\) examples (from a training set) can be written as: \[\ell(\beta, D) = \sum_{i=1}^{n} y_i \log(\pi_i) + (1-y_i) \log(1-\pi_i)\] where \(D = \{x_i, y_i\}_{i=1}^{n}\) represents all the data available to fit (train) the model, and \(\pi_i\) is as defined earlier.

Typically, log likelihood \(\log(\ell)\) is reported. One can see that, cross-entropy is the negative log-likelihood for this problem. By following this procedure, we can come up new loss functions - define the loss as the negative log-likelihood.

# log likelihood 
print('log likelihood', model.llf)
log likelihood -53.424489860304064

AIC

AIC is defined as \(2p - 2 \ln(\hat{\ell})\) where \(p\) is the number of parameters, and \(\hat{\ell}\) is the log-likelihood evaluated at the estimated model parameters. The correction factor or penalty term \(2p\) penalizes complex models. AIC is often used in model comparison. Smaller the AIC, better is the fit to the data. Complex models are more penalized, of course.

Cross-Validation is a very popular hyper parameter tuning technique among ML community. It is worth noting that, asymptotically, AIC and LOOCV (Leave-one-out Cross Validation) are equivalent. See An Asymptotic Equivalence of Choice of Model by Cross Validation and Akaike’s Criterion[1977].

# AIC
print('aic', model.aic)
aic 110.84897972060813

LLR and Deviance

The simplest model one can fit to the data is to not have any features or just fit an intercept model. Then, we have two models \(M_r \equiv \log \left(\frac{\pi_i}{1-\pi_i} \right) = \beta_0\) where \(\beta_0\) is the intercept term and no features are in the model. Here \(M_r\) stands for the reduced model.

Whereas \(M_f \equiv \log \left(\frac{\pi_i}{1-\pi_i} \right) = x_i^T\beta\) defined earlier is the full model, meaning, all features are used in the model and perhaps, this is the best in the model class available like an Oracle. One may say, $M_r M_f $ when set of \(x_i\) includes the constant \(1\) as one of the features and both \(M_r\) and \(M_f\) belong to the same model class. The difference in the log-likelihoods is a useful indicator of how large the discrepancy between \(M_r\) and \(M_f\) is.

The Log-likelihood Ratio (LLR) is a statistic that computes this quantity:

\[LLR(M_r, M_f) = \log\left( \frac{\ell(M_r;D)}{\ell(M_f;D)} \right) = \log(\ell(M_r;D))-\log(\ell(M_f;D))\]

# LLR
print('LLR', model.llr)
LLR 3.8544857609679326

LLR expressed differently with a scaling factor is known as the Deviance defined as: \[D(M_r, M_f) = -2 \left[\log(\ell(M_r;D))-\log(\ell(M_f;D))\right]\]

Asymptotically, Deviance follows a \(\chi^2\) distribution with \(p\) degrees of freedom (which is the difference in the parameters between the full and the reduced models). So, we can see how useful the predictors are in \(M_f\) compared to \(M_r\). The evidence can be expressed in terms of p-value.

# Wald's LLR test

print('LLR Test', model.llr_pvalue)
LLR Test 0.04961313336509171

We reject the null (i.e. prefer \(M_f\), a more complex model over a simpler model \(M_r\)) at type-1 error rate \(\alpha\) if \(p_{val} < \alpha\). Typical choice for \(\alpha\) is 0.05.

\(\nu\text{-information criteria}\)

At the core of this procedure is a way to, given the same data, compare two different models. Generally, one model will be simpler than the other. This idea is recently explored in the paper Understanding Dataset Difficulty with \(\nu\)-Usable Information. The motivation for the authors was different, however. They want to characterize the dataset difficulty.

The paper introduces two new information-theoretic measures called \(\nu\text{-information}\) denoted as \(I_{\mathcal{V}}(X \rightarrow Y)\) and \(\text{pointwise }\nu\text{-information}\) denoted as \(\text{PVI}(x \rightarrow y)\). Note that \(I_{\mathcal{V}}(X \rightarrow Y) = \mathcal{E}_{x,y \sim P(X,Y)}[PVI(x,y)]\).

Please see the paper for definitions and the treatment. But informally, from information theory point of view, \(\nu\text{-information}\) is the information gain due to conditioning (observing X). As a matter of fact, it closely resembles mutual information \(I(X,Y) = H(Y)-H(Y|X)\) where \(H(X)\) is the entropy of \(Y\) and \(H(Y|X)\) is the conditional (on \(X\)) entropy of \(Y\). But the difference is in how \(X\) and \(Y\) are related. \(\nu\text{-information}\) restricts the mappings \(f: X \rightarrow Y\) to the admissable class of functions that can be learnt under the hypothesis class \(\mathcal{V}\) and not any function which is used in classical mutual information (in fact, it is not specified).

The procedure to estimate \(\nu\text{-information}\) is given in Algorithm 1. After adapting the notations, \(PVI\) and \(\nu\text{-information}\) can be estimated from data as:
\[ \hat{PVI}(x_i,y_i) = -\log \hat{\ell}(M_r; x_i,y_i) + \log \hat{\ell}(M_f; x_i,y_i)\\ \hat{I}_{M_r,M_f}(X \rightarrow Y) = \sum_{i=1}^{n} \hat{PVI}(x_i,y_i) \].

If we scale it appropriately, we notice that, \(D(M_r,M_f) = 2I_{M_r,M_f}(X \rightarrow Y)\). So, the \(I_{\mathcal{V}}(X \rightarrow Y)\) is actually \(LLR\) with an information theoretic lense, applied to modern deep learning context, more specifically to LLMs.

\(PVI\) was used to identify mislabelled examples. Across datasets, and training regimes, examples with high \(PVI\) are generally accurate and low \(PVI\) examples are not. This suggests that one can use \(PVI\) to define the confidence in the predictions.

It is to be noted that the connection between Likelihood and Information Theory is not new. Prof. Manny Parzen’s works have demonstrated this connection between in Goodness-of-Fit Tests and Entropy. See for example, Entropy Interpretation of Goodness of Fit Tests(1983) and Goodness of Fit tests and Entropy(1990). Indeed, LLR is a Goodness-of-Fit Test and statsmodel API gives p-vlaue for this test, as we have seen before.

Perplexity

Perplexity, as the name implies, is about the element of surprise. Again, it is just a fancy word for a monotonic transformation of cross-entropy, widely used in the NLP/LLM community. Having trained an LLM, one wants to see how perplexing the observed data to the LLM. More perplexing the data (higher the cross entropy), less likely is the model to have generated data or put in a more relatable fashion, higher is the additional number of bits needed to code the data.

Like the recent \(\nu\text{-information}\) criteria, cross-entropy and other information-theoretic metrics such as the KL-divergence are also used to study LLM performance. For example, in The Unlocking Spell on Base LLMs: Rethinking Alignment via In-Context Learning, token distributions are compared. Multiple metrics such as KL-divergence, Jaccard Similarity, among others, are implemented see code. Using these metrics, they showed that LLM Alignment via RLHF mostly did stylistic modifications to the output than it did to the factual content.

Once log-likelihood is available, perplexity can be calculates as \[ \text{perplexity} = 2^{H(P,Q)} \] where \(P\) is the true distribution and \(Q\) is the distribution under proposed model. Since true \(P\) is unknown, we often replace it with its empirical version. When a model is fit, the log-likelhood is an estimate of the negative cross-entropy. Therefore, \[ \text{perplexity} = 2^{-\log \hat{\ell}(M; D)} \] where \(M\) is the model under consideration.

It is possible to compute cross-entropy between two models, where \(P\) takes the role of a reference distribution and \(Q\) takes the role of a probing model. Note that, this metric is no symmetric in \(P\) and \(Q\).