ML Documentation
Good documentation is a necessity for any organization to present and represent itself to the world via its products, services and people. Several bodies of work exist on written and oral communication. For example:
- The ELement of Style for general communication
- The Visual Display of Quantitative Information a classic from Edward Tufte on conveying informaiton with data via rich visuals.
- Clean Code and Clean Coder and Uncle Bob’s wisdom on writing good (clean) code
- Style Guide for Python from Google to improve code readability
- Literate Programming conceived by Don Knuth, the God of computer science, a paradigm that combines documenting code along with code.
As one can imagine, the above cover the skills needed to write documents, codes, and reports, such that they can be understood by humans and computers, and in some cases by both simultaneously. But they fall short of expectations, particularly in regards to ML Documentation. Few differentiating factors that must be considered
- Data is at the heart of all documentation.
- ML Lifecycle is very iterative - implying change is a constant and data changes (we can treat Model as a special type of data).
- Anything changes, Everything changes - so change in data causes changes in all documents depending on that data and their derivatives.
As a result, maintaining veracity of the documentation is very hard. Further, when different stakeholders need different type of results, format and at different periodicity, it is even more problematic. Further, it is not entirely what to document. The CLeAR framework has some really nice (additional) goals for a document. A document must be 1) comparable 2) legible 3) actionable and 4) robust to achieve an aspect of AI Transparency. Just to be clear, here, we are talking about documenting Data, Models, and AI Systems. We can look at them individually.
Model Cards
Google published Model Cards for Model Reporting to improve transparency in model reporting. Idea is very similar to how information related to the nutritional content is published on the packaging. See their blog for details.
Data Cards
Similar efforts were made to document the details about the data too. See Data Cards: Purposeful and Transparent Dataset Documentation for Responsible AI for details and the accompanying blog with playbook here from Google Research.
Both Model Card and Data Card only capture “an” aspect of the entire problem but at a greater depth. Both of them are reporting aspects most useful in post deployment scenarios. But what about during development? In particular, given the exploratory and iterative nature of ML model and system development, how do we give structure to this process as well as document it?
AI Canvas
AI-Canvas fills the gap on the business side during the Design phase of the development. However, the terminology is used somewhat OOD for ML folks and is not easy to get it right in first pass :).
Project Canvas is more elaborate and easier to understand than AI Canvas. But again, the terming Project can made specific and unambiguous.
That leads to project cards.
Project Cards
The purpose of Project Cards is two folds:
During development helps the developer think about the problem in a structured way w.r. t framing the problem, assessing the business value, viability, and many other aspects.
It can also serve as a document giving a high level overview of the system developed and deployed. With proper versioning, one can also see the evolution of the problem. It is meant to be a high level document and as details emerge, documents such Model Cards and Data Cards can be linked.
The following are the different sections of the Project Cards.
Business View
It focusses on the following questions:
- What is the problem being solved?
- Who is the customer?
- Why it needs to be solved?
- How does the solution like - a mental or a conceptual model or a mock of the product?
- What objectives does it achieve?
- What are the risks and challenges?
ML view
It is bit more in-depth focussing on the execution will have the following questions:
- What is the prediction problem?
- How the objective will be measured?
- How will it be tested?
- Data: What kind, how and how much and what for?
- What is the roadmap/plan?
- What resources are needed - both human, compute and admin?
Data-driven Documentation
The above cards define what is to be captured, and in part, certain goals mentioend in the CLeAR framework can be addressed. But what about the process to produce the documents? It begets the question: what qualities should an ML document should posses. We propose the following - A document must be:
- Accurate (and up to date)
- Available
- Accessible
- Reproducible
- Auditable
- Versioned
- Serves more than one audience at a time (eg: purpose of a document, and as a result, the content, form of a document will vary by the audience).
But they are not easy to achieve. An ML Developer can only deliver on some fronts. Conventionally, - (code) Development and Documentation are not part of the same process. The toolset and mindset are different, even separated in time & space. - Even a simple edit or change request requires a copy-and-paste from somewhere. If data or ask or both change, one must redo the documentation repeatedly. This is neither repeatable nor reproducible and also not sustainable.
As a result, a rigorous process oversight is needed for compliance. For example, a reporting manager may periodically check if the documentation is maintained and up to date as a part of the review processing. But this is not sustainable. We know it all too well.
We have some pointers to realize those quality attributes.
What is the solution
- Surprisingly simple. Just tag a notebook cell - who is it for?
- And take benefit of modern document publishing tools and workflows like Quarto, GitHub, GitHub actions
Core tenets
- One content - many views
- Data + Code + Content > should drive the documentation (format, style, purpose)
- Each stakeholder’s documentation need is just a view or a content rendering problem
- Publishing documentation = Publishing code
- Use the same tools and mental models both for code and documents
- Single source of truth for any derived document
- Fix in only one place and only once.
- Physical and mental distance between Documentation and Code should be (close to) ZERO
- Set up once and automate subsequently
- Automate the publishing process
- No human oversight should be necessary for process compliance
- Commit code + documentation content > rendering must be automated
Above points are put in a Project Card Template which is a Jupyter notebook.