Quality Metrics
The credibility of an AI Product rests upon the quality of its outputs.
Every AI Product must declare the metrics by which its quality is assessed, including performance, fairness, robustness, and reliability.
Without explicit metrics, claims of quality are unverifiable and governance cannot be enforced.
Why Quality Metrics Matter
- Scientific Rigor → Validates that the AI Product meets stated objectives.
- Trust → Builds confidence among consumers, regulators, and auditors.
- Comparability → Enables evaluation across products with similar capabilities.
- Governance → Provides measurable signals for compliance and monitoring.
- Lifecycle Evolution → Offers a baseline for detecting drift and regression.
Categories of Quality Metrics
1. Performance Metrics
- Accuracy → Correct predictions or outputs relative to ground truth.
- Precision / Recall / F1 Score → Especially relevant in classification tasks.
- BLEU, ROUGE, METEOR → For natural language generation.
- PSNR, SSIM, FID → For vision and generative media.
- Domain-Specific Benchmarks → E.g., AUROC for medical diagnosis, perplexity for language models.
2. Fairness Metrics
- Demographic Parity → Equal outcomes across groups.
- Equalized Odds → Equal error rates across protected attributes.
- Disparate Impact Ratio → Ratio of favorable outcomes across groups.
- Counterfactual Fairness → Consistency under hypothetical attribute changes.
3. Robustness Metrics
- Adversarial Robustness → Resistance to adversarial perturbations.
- Generalization → Performance on out-of-distribution data.
- Stability → Variance in outputs across repeated runs.
- Resilience → Ability to handle missing or corrupted inputs.
4. Reliability Metrics
- Uptime / Availability → Alignment with declared SLAs.
- Latency → Response times under load.
- Throughput → Requests or inferences processed per unit time.
- Error Rates → Frequency of failures or exceptions.
Required Declarations
Every AI Product must declare:
- Primary evaluation metrics → chosen to reflect intended purpose.
- Benchmark datasets or tasks → used for validation.
- Threshold values → minimum acceptable performance.
- Testing protocols → procedures for validation and revalidation.
- Audit intervals → frequency of quality re-assessment.
Governance Integration
- Quality metrics must align with Governance & Policy.
- Thresholds must be linked to risk classification (higher-risk products require stricter quality controls).
- Metrics must be machine-readable for catalog integration and monitoring.
- Reports must be archived for audit and traceable to product versions (see Lifecycle & Versioning).
Example
AI Product: Legal Document Summarizer
- Performance Metrics: ROUGE-L ≥ 0.65, BLEU ≥ 0.50.
- Fairness Metrics: Consistency of summaries across dialectal English inputs.
- Robustness Metrics: Stable summarization under reordered sections.
- Reliability Metrics: 99.9% uptime, latency ≤ 300ms per request.
- Thresholds: Product flagged if any fairness or robustness metric drops below defined thresholds.
Summary
- Quality must be quantified, declared, and governed.
- Metrics span performance, fairness, robustness, and reliability.
- Thresholds and testing protocols are essential for trust, comparability, and compliance.
Principle: An AI Product without explicit quality metrics is unverifiable — and thus cannot be considered a true product.