Last updated:

Cache fingerprinting

Deterministic hashing of a request's full input (messages, model, parameters) into a cache key — the discipline that makes cache lookups correct.

How it works

Cache fingerprinting is the process of computing a deterministic identifier for an LLM API request — typically a SHA-256 hash over the canonicalised request — so that byte-equivalent requests map to the same cache key. The discipline matters because LLM API requests carry many fields and even tiny structural differences (a renamed field, a reordered messages array, an explicit null where the field was previously absent) produce different hashes, missing legitimate cache hits.

A well-designed fingerprint covers: the messages array (role + content per message, deterministically ordered), the model name, temperature, top_p, max_tokens, stop sequences (sorted), and tool definitions (when present). It excludes: request-ID, timestamps, idempotency keys, user-supplied opaque metadata, and any field that doesn't affect the response. Getting the inclusion and exclusion right is the whole game.

The normalisation discipline

Two requests that are logically equivalent need to fingerprint to the same hash. The canonical normalisations:

  • Sort tool definitions and stop sequences. A tools array of [A, B] and [B, A] produce the same model behaviour; they should fingerprint the same.
  • Resolve nullable fields to a canonical state. "temperature: null" and "temperature missing" and "temperature: 1.0" are often equivalent (1.0 is OpenAI's default). Pick one form, normalise to it.
  • Strip non-functional fields. Customer-supplied request IDs, internal extension fields, debug flags. None of these affect the model's response.
  • Use deterministic JSON serialisation. Python's default `json.dumps` doesn't guarantee field ordering across versions; pin it with `sort_keys=True`.
  • Normalise whitespace where semantically equivalent. Trailing newlines, repeated spaces — depends on whether the model treats them as semantically meaningful. Conservative: don't normalise. Aggressive: normalise.

The pitfalls that break implementations

The failure modes that show up in production:

Field ordering drift. The application uses one JSON serialiser; the cache layer uses another. Same data, different field order in the serialised form, different hash. Fix: canonicalise the request before hashing, using a single shared library.

Optional fields appearing inconsistently.Some requests have "top_p: 1.0", others omit it entirely (and the SDK defaults to 1.0). Different hashes for semantically identical requests. Fix: explicit defaults applied before hashing.

Extensions leaking into the fingerprint. A request carries a _prism_cache_controlmarker block that's relevant to the cache layer but shouldn't affect the model response. If the fingerprint includes it, two requests differing only in cache-control fingerprint differently — and the cache misses on what should be a hit. Fix: strip extensions before fingerprinting.

Streaming and non-streaming hashing to different keys.The `stream: true` parameter doesn't change the model output — same prompt produces the same content whether you stream it or buffer it. Some implementations include `stream` in the fingerprint, splitting the cache. Fix: exclude `stream` from the fingerprint; serve cached responses as non-streaming regardless of the request flag.

Why this matters

Exact-match cache hit rates of 5-15% are achievable with disciplined fingerprinting. Without it, hit rates collapse toward zero — every trivial structural variation invalidates what should be a cache hit. The discipline is small but consequential; getting it right is the difference between a cache that pays for itself and a cache that's overhead.

See your savings before you sign up

Run our calculator on your own workload. Real provider rates, real cache math, no email gate.

Frequently asked questions

Should I include the system prompt in the fingerprint?
Yes — the system prompt affects the model's response, so two requests with different system prompts should fingerprint differently even if user messages are identical. The only exception is if you're caching at a different layer (e.g. caching only the user-side response and re-applying the system prompt at serve time), which is unusual.
What about the user field in OpenAI's API?
The `user` field is metadata for OpenAI's abuse-detection systems; it doesn't affect the model's response. Exclude it from the fingerprint. Same logic applies to any pure-metadata field.
How do I handle tool definitions?
Tool definitions affect the model's response (the model knows what tools are available and may use them), so they belong in the fingerprint. But sort them by tool name first — a tools array of [A, B] vs [B, A] is the same set of tools and should produce the same response, so it should fingerprint the same.
What about idempotency keys?
Idempotency keys are caller-supplied and don't affect the response; they're metadata for the caller's deduplication. Exclude from the fingerprint. The cache layer is itself an idempotency mechanism — if the request fingerprints the same as a previous one, the cached response is by definition the right answer.