Glossary

Definitive one-sentence answers to AI infrastructure terms. Each term has its own page with the “how it works” and “when it matters” deep-dive.

AI FinOps: The emerging discipline of governing AI API spend — budgets, allocation, audit, and policy enforcement across teams and projects.
AI gateway: A proxy that sits between applications and AI providers, handling routing, caching, observability, and governance for LLM API traffic.
Cache fingerprinting: Deterministic hashing of a request's full input (messages, model, parameters) into a cache key — the discipline that makes cache lookups correct.
Edge inference: Running AI model inference (or proxy logic that fronts AI providers) at edge data centers near the customer, instead of from a single origin.
Exact vs semantic cache: Exact caches require byte-identical inputs; semantic caches use embeddings + cosine similarity to match near-equivalent prompts.
LLM observability: Instrumentation that captures per-request latency, cost, tokens, cache status, errors, and feedback — the data plane for AI cost engineering.
LLM routing: Dynamically choosing which AI model handles each request based on task type, latency budget, and cost tolerance.
Multi-provider failover: Automatically routing a request to a backup AI provider when the primary returns an error or times out.
OpenAI-compatible endpoint: An API endpoint that speaks the OpenAI Chat Completions wire protocol, so any OpenAI SDK works against it without code changes.
Prompt caching: Reusing the model's processing of a shared prompt prefix across multiple requests to cut input-token cost.
Provider-native caching: Caching primitives built into the AI provider (Anthropic's cache_control blocks, OpenAI's automatic prompt cache) rather than gateway-layer caching.
Semantic cache: A cache that returns prior responses when a new prompt is semantically similar (not just byte-identical) to a previously cached one.
Speculative routing: Firing the same request to two AI providers in parallel and returning the first successful response — a tail-latency reduction technique.
Task-type routing: Classifying a request's task category (code, summary, chat, code-fix, etc.) and routing to the model that's best at that category.