Skip to main content

Inference, Serving & Scaling

An AI Product must declare how it provides inference services, how it scales under load, and what guarantees consumers can expect.
This ensures AI Products are predictable, reliable, and fit for consumption in enterprise and ecosystem settings.


Why This Matters

  • Reliability → Consumers can trust the product’s availability and performance.
  • Transparency → Declared limits prevent misuse and misinterpretation.
  • Scalability → Supports different consumer needs (batch jobs, low-latency APIs, high-throughput streaming).
  • Governance → Performance and scaling policies influence compliance and cost.

Inference Modes

AI Products must declare supported inference modes:

  1. Batch Inference

    • Large volumes of data processed at scheduled intervals.
    • Typical for training data augmentation, analytics pipelines.
  2. Online (Real-Time) Inference

    • Low-latency responses to single or small requests.
    • Typical for APIs powering user-facing applications.
  3. Streaming Inference

    • Continuous input and output streams.
    • Useful for speech recognition, event-driven systems, or agent coordination.
  4. Hybrid Inference

    • Combination of modes (e.g., batch preprocessing + real-time refinement).

Serving Characteristics

Each AI Product must declare:

  • Latency profile → expected response times under typical conditions.
  • Throughput capacity → number of requests/second supported.
  • Concurrency model → limits on simultaneous consumers.
  • Resource utilization → expected CPU/GPU/TPU footprint per inference.
  • Error handling → error codes, retry policies, fallback behaviors.
  • SLA / SLOs → availability and performance guarantees, if applicable.

Scaling Strategies

AI Products must specify how they scale to meet demand:

  • Horizontal Scaling → multiple replicas across nodes.
  • Vertical Scaling → larger instances with more compute/memory.
  • Elastic Scaling → automatic adjustment based on load.
  • Edge Scaling → distributed deployments on edge devices.

Attributes:

  • Scaling limits (min/max replicas, resource caps).
  • Cold start behavior (warmup time).
  • Cost model impacts (per-request, per-hour, reserved capacity).

Example

LLM Inference API

  • Modes: Online (real-time), Streaming.
  • Latency Profile: < 200ms average per request (short prompts).
  • Throughput: 100 RPS per replica.
  • Scaling: Horizontal autoscaling (1–50 replicas).
  • Resource Utilization: 1 GPU per replica.
  • SLA: 99.9% uptime.
  • Error Handling: Graceful fallback if GPU quota exceeded.

Summary

  • AI Products must declare inference modes, serving characteristics, and scaling strategies.
  • Consumers need visibility into latency, throughput, concurrency, and error handling.
  • Transparent declarations prevent over-commitment and ensure responsible consumption.

Principle: An AI Product is not truly consumable unless its inference and scaling behavior are explicit, measurable, and reliable.