04B: Monitor (Data)
Materials:
Date: Friday, 23-Aug-2024.
Pre-work:
- See how logistic regression test cases were written in sklearn. In particular, see how the test was prepared which makes it possible to test the fitted coeffcients analytically.
- See how Peceptron test cases were written on Iris data.
In-Class
- QA for Data. It is all about asserting the statistical properties of the data.
- Discussion on extracting statistical quality features into a table for any modality (tabular, image, speech, text)
- Testing columnar data with explicit conditions. Great Expectations defines them as expectations and validates them given a new data. For example, an expectation can be
- a column can have at most 5% missing values
- the range of the columns can be between [-2,10]
- Testing columnar data with implicit conditions. One dataset will be compared against a reference dataset. Evidently comes with many tests for reporting, model comparison, data drift detection. For example, we can compare whether or not the label distribution is same between the Train set and the Test set. A question for all readers - how often have you “actually” ran any statistical test to see if the Train and Test set are actually similar in distribution. My guess, less than 10% would have done it. The remaining would have called sklearn’s train_test_split function :)
Post-class:
- [Blog] ETL testing with Great Expectations
- [Docs] Great Expectations Documentation. Please note that with version 1.0 released, even the examples from its repo are not working. Read them to understand what are typical tests on columnar data look like.
- [Notebooks] Evidently examples. Browse through and run how Evidently automates many test cases. Also see community examples here