Glossary
Definitive one-sentence answers to AI infrastructure terms. Each term has its own page with the “how it works” and “when it matters” deep-dive.
- AI FinOps
- The emerging discipline of governing AI API spend — budgets, allocation, audit, and policy enforcement across teams and projects.
- AI gateway
- A proxy that sits between applications and AI providers, handling routing, caching, observability, and governance for LLM API traffic.
- Cache fingerprinting
- Deterministic hashing of a request's full input (messages, model, parameters) into a cache key — the discipline that makes cache lookups correct.
- Edge inference
- Running AI model inference (or proxy logic that fronts AI providers) at edge data centers near the customer, instead of from a single origin.
- Exact vs semantic cache
- Exact caches require byte-identical inputs; semantic caches use embeddings + cosine similarity to match near-equivalent prompts.
- LLM observability
- Instrumentation that captures per-request latency, cost, tokens, cache status, errors, and feedback — the data plane for AI cost engineering.
- LLM routing
- Dynamically choosing which AI model handles each request based on task type, latency budget, and cost tolerance.
- Multi-provider failover
- Automatically routing a request to a backup AI provider when the primary returns an error or times out.
- OpenAI-compatible endpoint
- An API endpoint that speaks the OpenAI Chat Completions wire protocol, so any OpenAI SDK works against it without code changes.
- Prompt caching
- Reusing the model's processing of a shared prompt prefix across multiple requests to cut input-token cost.
- Provider-native caching
- Caching primitives built into the AI provider (Anthropic's cache_control blocks, OpenAI's automatic prompt cache) rather than gateway-layer caching.
- Semantic cache
- A cache that returns prior responses when a new prompt is semantically similar (not just byte-identical) to a previously cached one.
- Speculative routing
- Firing the same request to two AI providers in parallel and returning the first successful response — a tail-latency reduction technique.
- Task-type routing
- Classifying a request's task category (code, summary, chat, code-fix, etc.) and routing to the model that's best at that category.