06A: Evaluate

Materials:

Date: Tuesday, 03-Sep-2024

Pre-work:

Review Kolmogorov-Smirnov Test as a way to measure divergence between two continous, univariate distributions
Review KL Divergence as a way to measure divergence between two arbitrary (continuous/discrete and univariate/multivariate) distributions

In-Class

Motivation for why we need Design of Experiments and Hypothesis Testing. Where do they appear in the ML Life Cycle.
Lesson 1 - quick intro to DoE, scientific objectives, basic principles of DoEs, steps for planning, conducting, and analyzing an experiment.
Lesson 2 - a simple comparative experiment. The name A/B Testing, perhaps, comes from testing difference between two groups A and B, which is a simple comparative experiment. How to define a business problem as a hypothesis test, collect data, perform the test, draw conclusion are demonstrated. How to calculate the sample size, probably the most important question that gets asked, is also explained in this simple case.

Post-Class

Review Model Selection from CS329s
Review Model Evaluation chapter of ML Engineering book

References:

[book] A/B Testing. This book gives a non technical introduction to A/B testing and how they get applied in the e-commerce, website UX optimization, running marketing campaigns. In the appendix, many scenarios of A/B testing are covered.
[book] Design and Analysis of Experiments. This is a classic in DoE. Lessons 1-2 are necessary.
[course]STAt 503 Design of Experiments - online course at UPenn
[course]STAt 514 Design of Experiments - course at Purdue (stats oriented). Chapters 1-4 are needed.
[book] Statistical Design. This is another classic from George Casella, a celebrated science author.

Notes

QA for Data discussed earlier is a specific case of A/B testing. CS folks call it A/B testing, but they are all different types of hypothesis tests.
Hypothesis Tests are tools for testing aspects of data. In the ML context, they can be used for testing
- data drift (concept drift, covariate drift, label drift)
- data quality (implicit and explicit)
- model performance
On Model Testing/ Comparison as a Hypothesis
- Is the “alternate” model better than the “baseline” model? Often, ML folks report performance metrics on Test split for all the models and pick the model with the highest performance. It is not uncommon to claim SOTA even when the performance difference is in 1/100ths of decimal places and replication is almost absent :). We do not even know if this difference is due to just chance (randomness in the data) alone. A rigorous (and perhaps, the right) approach would have been to formulate this as a hypothesis test, design an experiment, collect evidence and then conclude which model is better (and is statistically significant). Note that, statistical significance does not mean practical significance, which is often the case with most SOTA claims :). Another classical example to drive this point home is the (in)famous Netflix prize. The top performing model in the million dollar competition never made it to deployment. Find out why.
Hyperparameter Tuning and AutoML is a DoE in disguise
- the experimental factors are, for example, the architecture, optimizer, learning rate, batch size, among others. The response variable is the performance. Techniques like Bayesian Optimization, and many techniques used in AutoML are indeed Sequential DoEs (you explore the search space sequentially by looking at the past exploration data). Grid-search, the naïve approach to hyperparameter tuning, can be seen as an implementation of a full-factorial DoE.
- Will ideas from DoE such as a Blocking lead to better search strategies? Can we find out if the learning rate and the optimizer (eg. Adam vs AdamW) interact with each other? The experiments to address these questions are often referred to as Ablation Studies in the ML community. So, by learning the principles of DoEs, we will be able to design (data and compute) efficient ablation studies.
How to plan and collect data for an ML problem?
- in the model-centric ML development, which most, if not all students, start their ML training in, will work on a data given to them in a platter. It is rare for an (undergraduate) student to have taken part in the data collection planning exercise. But once they are thrown into industry, they have to confront a very difficult question? We will address these questions in the next session.