06B: Govern

Materials:

Date: Friday, 06-Sep-2024

Pre-work:

CRUD- create, read, update and delete operations are the primitives we need to make the IT systems work.
ACID- against those CRUD operations, a database needs to support atomicity, consistency, isolation, durability to guarantee data validity, despite errors, and failures.
Set Theory - is the mathematical basis for relational databases (RDBMS)
Graph Theory - is probably the mathematical basis for graph databases like Neo4j. NoSQL databases might lie somewhere in the middle. Since Graphs are more general mathematical objects, we might think for now that, it is important to understand graph theory to design graph databases.
D4M- Dynamic Distributed Dimensional Data Model - is a computer programming model that combines the advantages of distinct data processing technologies (sparse linear algebra, associate arrays, fuzzy algebra, distributed arrays, triple-store/NoSQL databases) to improve search, retrieval and analysis of data.

In-Class

The CRUD operations on Model Predictions
The R4 (Read, Replay, Recall, Replace) Framework for Level 5 Data Governance.

Post-class

[book] An Elementary Introduction to Statistical Learning Theory. See Chapter 8 for how the theory of VC Dimension can be applied to get an idea on the sample size.
[site] GDPR - in particular focussing on privacy, and right to be forgotten (or right to erasure)
[wiki] Self Driving Cars read the six levels (Level 0 to Level 5) of autonomy. Level 5 is fully autonomous self driving car.

Additional Reading (optional)

[book] Mathematics of Big Data

Notes

At at the heart of it, all of IT applications, eventually perform CRUD operations. Depending on the type of transaction, purpose, and SLAs, the choice of the database (SQL vs NoSQL), Batch-vs-Stream, OLAP vs OLTP, among others will be made. A whole lof of mathematical theories (set theory, graph theory, associative arrays, multi linear algebra, etc) are necessary to build the technologies.

In yet another simplified view, ML systems are glorified auto-fills. Take out the fundamental uncertainty in the data produced by ML systems, they are akin to IT systems. Therefore, same infrastructure and engineering process to support CRUD operations on ML produced data can be applied. But is that sufficient?

One major difference is - ML systems have memory. Therefore, deleting a record, does not lead to deleting its influence in downstream tasks right? A simple example will drive home the point. Imagine a model is trained on a dataset. There are some records which are leaking private data. What would deleting mean here. Obviously, one can delete those records from the training set. But one must also edit the models to get rid of their influence. Think one step further. What if, these marked records are very close the some other records. Deleting and retraining on the updated training data does not mean that their residual effect is removed. So, it is not only complex but also complicated. When ML models are cascaded and appear in a sequence of data events – controlling their exposure and affects requires a very good control on the downstream consumer applications.

Let us try to re-interpret CRUD in the ML context. It leads us to the R4 {Read, Repeat, Recall, Replace} framework.

Read

On demand, read (or retrieve) the prediction made by an ML model. This require maintaining proper metadata to retrive the records. For example, if the input (request to the API) is available, as is, and the output (response) of the API is logged into a persistent object store, this is possible.

Repeat

On demand, repeat (or reproduce) the prediction made by an ML model. This require maintaining the three things for the sake reproducibility: 1. The data (inputs) and model artifacts 2. The code which loads the model, and scores on the given data and 3. The runtime to execute the code & data.

Recall

On demand, recall the prediction made by an ML model. Depending on the “definition” of recall, the implementation and engineering complexity varies. In the simplest case which itself can be very complex, all past instances where the model scored, the predictions have to be replaced with say nulls. However, this implementation is not yet actionable by itself. The downstream consumer for example, can be notified of the recall, and take an appropriate action. The downstream consumer must have the Read ability in the R4 framework. In the best case scenario, the downstream consumer updates the decision (say with human-in-the-loop) or has another back-off strategy implemented already.

In the highest form of recall, all downstream consumers of the predictions are notified and decisions are propagaed in the entire chain of events.

Replace

On demand, replace the prediction made by an ML model with another prediction (after correcting). Once the record is updated, in the most sophisticated case, all downstream consumers propagate the change downstream.

Like in self-driving cars, the Level of Automation can be categorized.

** R4 Levels**
Level	Name	Supported Actions	Platform Features
L0	R0	Fire and Forget Zero Traceability	Model APIs
L1	R1 Read	Trace past decisions	Observability
L2	R2 Read & Repeat	Trace and Recreate past decisions	Observability + Versioning + CD
L3	R3 Read, Repeat, Recall	Trace, Recreate, and Nullify past decisions	Observability Versioning + CD + CImp
L4	R3-N Read, Repeat, Recall with Notify	Trace, Recreate, Nullify, and Notify past decisions	Observability Versioning + CD + CImp + Event-IO
L5	R4 Read, Repeat, Recall, Replace	Trace, Recreate, and Rescore past decisions	Observability Versioning + CD + CImp
L6	R4-N Read, Repeat, Recall, Replace with Notify	Trace, Recreate, Rescore, and Notify past decisions	Observability Versioning + CD + CImp + Event-IO

Are these four primitive operations sufficient to support any model governance? 1. Can a model be debiased? 2. Can right to be erased to enabled? 3. Many more.

As one can see, from an engineering stand point, certain platform features and functionalities are needed.

Read: An observation platform to log data, with CRUD apis on the logs - can enable reading any past decisions (model predictions)
Repeat: Versioning of data, models, code, and run time with Continuous Deployment (CD) can ensure reproducibility.
Recall: A Continuous Improvement (CImp) capability that can take any alternate model (which can produce Nulls), deploy it, and rescore on any past data, can enable recalling past decisions.
Replace: Same capability as Recall except that, a new model would correct past decisions.
Notify: One has to make the memory-less system into a system with memory, in in async fashion. Implementing a pub-sub model, where downstream consumer subcribes to the R4 topics published by the prediction maker service. For example, a model service published a recall topic with all the decisions that need to be annulled in someway. The subscribers of this topic will take action however they deem fit. They can even publish another recall event and all of its subscribers can act accordingly. This way, a recall pipeline is created based on event triggers.