Lineage & Provenance
Every AI Product must declare its lineage and provenance.
This describes where it came from, how it was built, and what dependencies it carries.
Lineage enables trust, reproducibility, and accountability.
Why Lineage & Provenance Matter
- Trust → Consumers and regulators need transparency into origins.
- Reproducibility → Downstream users can recreate results or validate claims.
- Governance → Provenance reveals risks from upstream dependencies.
- Compliance → Regulations (e.g., EU AI Act) require traceability of AI lifecycle.
Provenance Requirements
An AI Product must declare:
- Source Models → Pre-trained models, foundation models, or fine-tuned bases.
- Training Data Sources → Origin, licensing, and governance of data used.
- Preprocessing Pipelines → Feature engineering, augmentation, labeling.
- Contributors → Teams, organizations, or individuals involved in creation.
- Date of Creation / Release → First availability.
Lineage Requirements
An AI Product must document:
- Dependencies → External libraries, frameworks, data products, or other AI Products.
- Version History → Changes across releases, retrainings, or fine-tunes.
- Upstream Products → Data Products or AI Products that feed into this one.
- Downstream Impact → Known consumers or dependent products (if declared).
Metadata Characteristics
- Provenance metadata must be machine-readable.
- Lineage must be linked across AIPROD and AIPDS specifications.
- Lineage declarations should support integration with data lineage tools.
Example
AI Product: Medical Diagnosis Classifier
- Provenance:
- Source Model: Fine-tuned ResNet-50.
- Training Data: NIH Chest X-ray dataset (licensed), augmented with proprietary scans.
- Contributors: Healthcare AI Lab, MedTech Corp.
- Created: 2025-03-01.
- Lineage:
- Dependencies: PyTorch v2.0, CUDA toolkit.
- Upstream Products: Imaging Data Product
urn:dp:chest-xray:v3.1. - Version History: v1.0 → v1.1 retrained on new cases.
- Downstream Impact: Used in Clinical Decision Support AI Suite.
Summary
- Provenance describes origins of data, models, and contributors.
- Lineage describes dependencies, version history, and relationships.
- Together, they provide traceability, governance, and accountability.
Principle: An AI Product without declared lineage and provenance is an opaque asset — not a transparent product.