Cache fingerprinting
Deterministic hashing of a request's full input (messages, model, parameters) into a cache key — the discipline that makes cache lookups correct.
How it works
Cache fingerprinting is the process of computing a deterministic identifier for an LLM API request — typically a SHA-256 hash over the canonicalised request — so that byte-equivalent requests map to the same cache key. The discipline matters because LLM API requests carry many fields and even tiny structural differences (a renamed field, a reordered messages array, an explicit null where the field was previously absent) produce different hashes, missing legitimate cache hits.
A well-designed fingerprint covers: the messages array (role + content per message, deterministically ordered), the model name, temperature, top_p, max_tokens, stop sequences (sorted), and tool definitions (when present). It excludes: request-ID, timestamps, idempotency keys, user-supplied opaque metadata, and any field that doesn't affect the response. Getting the inclusion and exclusion right is the whole game.
The normalisation discipline
Two requests that are logically equivalent need to fingerprint to the same hash. The canonical normalisations:
- Sort tool definitions and stop sequences. A tools array of [A, B] and [B, A] produce the same model behaviour; they should fingerprint the same.
- Resolve nullable fields to a canonical state. "temperature: null" and "temperature missing" and "temperature: 1.0" are often equivalent (1.0 is OpenAI's default). Pick one form, normalise to it.
- Strip non-functional fields. Customer-supplied request IDs, internal extension fields, debug flags. None of these affect the model's response.
- Use deterministic JSON serialisation. Python's default `json.dumps` doesn't guarantee field ordering across versions; pin it with `sort_keys=True`.
- Normalise whitespace where semantically equivalent. Trailing newlines, repeated spaces — depends on whether the model treats them as semantically meaningful. Conservative: don't normalise. Aggressive: normalise.
The pitfalls that break implementations
The failure modes that show up in production:
Field ordering drift. The application uses one JSON serialiser; the cache layer uses another. Same data, different field order in the serialised form, different hash. Fix: canonicalise the request before hashing, using a single shared library.
Optional fields appearing inconsistently.Some requests have "top_p: 1.0", others omit it entirely (and the SDK defaults to 1.0). Different hashes for semantically identical requests. Fix: explicit defaults applied before hashing.
Extensions leaking into the fingerprint. A request carries a _prism_cache_controlmarker block that's relevant to the cache layer but shouldn't affect the model response. If the fingerprint includes it, two requests differing only in cache-control fingerprint differently — and the cache misses on what should be a hit. Fix: strip extensions before fingerprinting.
Streaming and non-streaming hashing to different keys.The `stream: true` parameter doesn't change the model output — same prompt produces the same content whether you stream it or buffer it. Some implementations include `stream` in the fingerprint, splitting the cache. Fix: exclude `stream` from the fingerprint; serve cached responses as non-streaming regardless of the request flag.
Why this matters
Exact-match cache hit rates of 5-15% are achievable with disciplined fingerprinting. Without it, hit rates collapse toward zero — every trivial structural variation invalidates what should be a cache hit. The discipline is small but consequential; getting it right is the difference between a cache that pays for itself and a cache that's overhead.