LLM Cache¶
LLM-specific cache compressor and data structures for storing and retrieving LLM requests and responses with rich metadata.
Package Separation
This module stays in causaliq-knowledge as it contains LLM-specific logic.
The core cache infrastructure (TokenCache, Compressor, JsonCompressor)
is in causaliq-core. Import from causaliq_core.cache.
Overview¶
The LLM cache module provides:
- LLMCompressor - Extends JsonCompressor with LLM-specific convenience methods
- LLMCacheEntry - Complete cache entry with request, response, and metadata
- LLMResponse - Response data (content, finish reason, model version)
- LLMMetadata - Rich metadata (provider, tokens, latency, cost)
- LLMTokenUsage - Token usage statistics
Design Philosophy¶
The LLM cache separates concerns:
| Component | Package |
|---|---|
TokenCache |
causaliq_core.cache |
Compressor |
causaliq_core.cache.compressors |
JsonCompressor |
causaliq_core.cache.compressors |
LLMCompressor |
causaliq_knowledge.llm.cache |
LLMCacheEntry |
causaliq_knowledge.llm.cache |
This allows the base cache to be reused across projects while keeping LLM-specific logic in the appropriate package.
Usage¶
Creating Cache Entries¶
Use the LLMCacheEntry.create() factory method for convenient entry creation:
from causaliq_knowledge.llm.cache import LLMCacheEntry
entry = LLMCacheEntry.create(
model="gpt-4",
messages=[
{"role": "system", "content": "You are helpful."},
{"role": "user", "content": "Hello!"},
],
content="Hi there! How can I help you today?",
temperature=0.7,
max_tokens=1000,
provider="openai",
latency_ms=850,
input_tokens=25,
output_tokens=15,
cost_usd=0.002,
)
Compressing and Storing Entries¶
from causaliq_core.cache import TokenCache
from causaliq_knowledge.llm.cache import LLMCacheEntry, LLMCompressor
with TokenCache(":memory:") as cache:
compressor = LLMCompressor()
# Create an entry
entry = LLMCacheEntry.create(
model="gpt-4",
messages=[{"role": "user", "content": "What is Python?"}],
content="Python is a programming language.",
provider="openai",
)
# Compress to bytes
blob = compressor.compress_entry(entry, cache)
# Store in cache
cache.put("request-hash", "llm", blob)
Retrieving and Decompressing Entries¶
from causaliq_core.cache import TokenCache
from causaliq_knowledge.llm.cache import LLMCompressor
with TokenCache("cache.db") as cache:
compressor = LLMCompressor()
# Retrieve from cache
blob = cache.get("request-hash", "llm")
if blob:
# Decompress to LLMCacheEntry
entry = compressor.decompress_entry(blob, cache)
print(f"Response: {entry.response.content}")
print(f"Latency: {entry.metadata.latency_ms}ms")
Exporting and Importing Entries¶
Export entries to JSON for inspection or migration:
from pathlib import Path
from causaliq_knowledge.llm.cache import LLMCacheEntry, LLMCompressor
compressor = LLMCompressor()
# Create entry
entry = LLMCacheEntry.create(
model="claude-3",
messages=[{"role": "user", "content": "Hello"}],
content="Hi!",
provider="anthropic",
)
# Export to JSON file
compressor.export_entry(entry, Path("entry.json"))
# Import from JSON file
restored = compressor.import_entry(Path("entry.json"))
Using with TokenCache Auto-Compression¶
Register the compressor for automatic compression/decompression:
from causaliq_core.cache import TokenCache
from causaliq_knowledge.llm.cache import LLMCacheEntry, LLMCompressor
with TokenCache(":memory:") as cache:
# Register compressor for "llm" entry type
cache.register_compressor("llm", LLMCompressor())
entry = LLMCacheEntry.create(
model="gpt-4",
messages=[{"role": "user", "content": "Hello"}],
content="Hi!",
)
# Store with auto-compression
cache.put_data("hash123", "llm", entry.to_dict())
# Retrieve with auto-decompression
data = cache.get_data("hash123", "llm")
restored = LLMCacheEntry.from_dict(data)
Using cached_completion with BaseLLMClient¶
The recommended way to use caching is via BaseLLMClient.cached_completion():
from causaliq_core.cache import TokenCache
from causaliq_knowledge.llm import GroqClient, LLMConfig
with TokenCache("llm_cache.db") as cache:
client = GroqClient(LLMConfig(model="llama-3.1-8b-instant"))
client.set_cache(cache)
# First call - makes API request, caches response with latency
response = client.cached_completion(
[{"role": "user", "content": "What is Python?"}]
)
# Second call - returns from cache, no API call
response = client.cached_completion(
[{"role": "user", "content": "What is Python?"}]
)
This automatically:
- Generates a deterministic cache key (SHA-256 of model + messages + params)
- Checks cache before making API calls
- Captures latency with
time.perf_counter() - Stores response with full metadata
Importing Pre-Cached Responses¶
Load cached responses from JSON files for testing or migration:
from pathlib import Path
from causaliq_core.cache import TokenCache
from causaliq_knowledge.llm.cache import LLMCompressor
with TokenCache("llm_cache.db") as cache:
cache.register_compressor("llm", LLMCompressor())
# Import all LLM entries from directory
count = cache.import_entries(Path("./cached_responses"), "llm")
print(f"Imported {count} cached LLM responses")
Data Structures¶
LLMTokenUsage¶
Token usage statistics for billing and analysis:
from causaliq_knowledge.llm.cache import LLMTokenUsage
usage = LLMTokenUsage(
input=100, # Prompt tokens
output=50, # Completion tokens
total=150, # Total tokens
)
LLMMetadata¶
Rich metadata for debugging and analytics:
from causaliq_knowledge.llm.cache import LLMMetadata, LLMTokenUsage
metadata = LLMMetadata(
provider="openai",
timestamp="2024-01-15T10:30:00+00:00",
latency_ms=850,
tokens=LLMTokenUsage(input=100, output=50, total=150),
cost_usd=0.005,
cache_hit=False,
)
# Convert to/from dict
data = metadata.to_dict()
restored = LLMMetadata.from_dict(data)
LLMResponse¶
Response content and generation info:
from causaliq_knowledge.llm.cache import LLMResponse
response = LLMResponse(
content="The answer is 42.",
finish_reason="stop",
model_version="gpt-4-0125-preview",
)
# Convert to/from dict
data = response.to_dict()
restored = LLMResponse.from_dict(data)
LLMCacheEntry¶
Complete cache entry combining request and response:
from causaliq_knowledge.llm.cache import (
LLMCacheEntry, LLMResponse, LLMMetadata
)
entry = LLMCacheEntry(
model="gpt-4",
messages=[{"role": "user", "content": "Hello"}],
temperature=0.7,
max_tokens=1000,
response=LLMResponse(content="Hi!"),
metadata=LLMMetadata(provider="openai"),
)
# Preferred: use factory method
entry = LLMCacheEntry.create(
model="gpt-4",
messages=[{"role": "user", "content": "Hello"}],
content="Hi!",
provider="openai",
)
# Convert to/from dict
data = entry.to_dict()
restored = LLMCacheEntry.from_dict(data)
API Reference¶
LLMCompressor¶
LLMCompressor
¶
Compressor for LLM cache entries.
Extends JsonCompressor with LLM-specific convenience methods for compressing/decompressing LLMCacheEntry objects.
The compressor stores data in the standard JSON tokenised format, achieving 50-70% compression through the shared token dictionary.
Example
from causaliq_core.cache import TokenCache from causaliq_knowledge.llm.cache import ( ... LLMCompressor, LLMCacheEntry, ... ) with TokenCache(":memory:") as cache: ... compressor = LLMCompressor() ... entry = LLMCacheEntry.create( ... model="gpt-4", ... messages=[{"role": "user", "content": "Hello"}], ... content="Hi there!", ... provider="openai", ... ) ... blob = compressor.compress(entry.to_dict(), cache) ... data = compressor.decompress(blob, cache) ... restored = LLMCacheEntry.from_dict(data)
Methods:
-
compress_entry–Compress an LLMCacheEntry to bytes.
-
decompress_entry–Decompress bytes to an LLMCacheEntry.
-
export_entry–Export an LLMCacheEntry to a JSON file.
-
import_entry–Import an LLMCacheEntry from a JSON file.
-
generate_export_filename–Generate a human-readable filename for export.
compress_entry
¶
compress_entry(entry: LLMCacheEntry, cache: TokenCache) -> bytes
Compress an LLMCacheEntry to bytes.
Convenience method that handles to_dict conversion.
Parameters:
-
(entry¶LLMCacheEntry) –The cache entry to compress.
-
(cache¶TokenCache) –TokenCache for token dictionary.
Returns:
-
bytes–Compressed bytes.
decompress_entry
¶
decompress_entry(blob: bytes, cache: TokenCache) -> LLMCacheEntry
Decompress bytes to an LLMCacheEntry.
Convenience method that handles from_dict conversion.
Parameters:
Returns:
-
LLMCacheEntry–Decompressed LLMCacheEntry.
export_entry
¶
export_entry(entry: LLMCacheEntry, path: Path) -> None
Export an LLMCacheEntry to a JSON file.
Uses to_export_dict() to parse JSON content for readability.
Parameters:
-
(entry¶LLMCacheEntry) –The cache entry to export.
-
(path¶Path) –Destination file path.
import_entry
¶
import_entry(path: Path) -> LLMCacheEntry
Import an LLMCacheEntry from a JSON file.
Parameters:
-
(path¶Path) –Source file path.
Returns:
-
LLMCacheEntry–Imported LLMCacheEntry.
generate_export_filename
¶
generate_export_filename(entry: LLMCacheEntry, cache_key: str) -> str
Generate a human-readable filename for export.
Creates a filename using request_id, timestamp, and provider: {request_id}{yyyy-mm-dd-hhmmss}.json
If request_id is not set, falls back to a short hash prefix.
Parameters:
-
(entry¶LLMCacheEntry) –The cache entry to generate filename for.
-
(cache_key¶str) –The cache key (hash) for fallback uniqueness.
Returns:
-
str–Human-readable filename with .json extension.
Example
compressor = LLMCompressor() entry = LLMCacheEntry.create( ... model="gpt-4", ... messages=[{"role": "user", "content": "test"}], ... content="Response", ... provider="openai", ... request_id="expt23", ... )
Returns something like: expt23_2026-01-29-143052_openai.json¶
LLMCacheEntry¶
LLMCacheEntry
dataclass
¶
LLMCacheEntry(
model: str = "",
messages: list[dict[str, Any]] = list(),
temperature: float = 0.0,
max_tokens: int | None = None,
response: LLMResponse = LLMResponse(),
metadata: LLMMetadata = LLMMetadata(),
)
Complete LLM cache entry with request, response, and metadata.
Attributes:
-
model(str) –The model name requested.
-
messages(list[dict[str, Any]]) –The conversation messages.
-
temperature(float) –Sampling temperature.
-
max_tokens(int | None) –Maximum tokens in response.
-
response(LLMResponse) –The LLM response data.
-
metadata(LLMMetadata) –Rich metadata for analysis.
Methods:
-
create–Create a cache entry with common parameters.
-
to_dict–Convert to dictionary for JSON serialisation.
-
to_export_dict–Convert to dictionary for export with readable formatting.
-
from_dict–Create from dictionary.
create
classmethod
¶
create(
model: str,
messages: list[dict[str, Any]],
content: str,
*,
temperature: float = 0.0,
max_tokens: int | None = None,
finish_reason: str = "stop",
model_version: str = "",
provider: str = "",
latency_ms: int = 0,
input_tokens: int = 0,
output_tokens: int = 0,
cost_usd: float = 0.0,
request_id: str = ""
) -> LLMCacheEntry
Create a cache entry with common parameters.
Parameters:
-
(model¶str) –The model name requested.
-
(messages¶list[dict[str, Any]]) –The conversation messages.
-
(content¶str) –The response content.
-
(temperature¶float, default:0.0) –Sampling temperature.
-
(max_tokens¶int | None, default:None) –Maximum tokens in response.
-
(finish_reason¶str, default:'stop') –Why generation stopped.
-
(model_version¶str, default:'') –Actual model version.
-
(provider¶str, default:'') –LLM provider name.
-
(latency_ms¶int, default:0) –Response time in milliseconds.
-
(input_tokens¶int, default:0) –Number of input tokens.
-
(output_tokens¶int, default:0) –Number of output tokens.
-
(cost_usd¶float, default:0.0) –Estimated cost in USD.
-
(request_id¶str, default:'') –Optional identifier for the request (not part of hash).
Returns:
-
LLMCacheEntry–Configured LLMCacheEntry.
to_export_dict
¶
Convert to dictionary for export with readable formatting.
- Message content with newlines is split into arrays of lines
- Response JSON content is parsed into a proper JSON structure
from_dict
classmethod
¶
from_dict(data: dict[str, Any]) -> LLMCacheEntry
Create from dictionary.
Handles both internal format (string content) and export format (array of lines for content).
LLMResponse¶
LLMResponse
dataclass
¶
LLM response data for caching.
Attributes:
-
content(str) –The full text response from the LLM.
-
finish_reason(str) –Why generation stopped (stop, length, etc.).
-
model_version(str) –Actual model version used.
Methods:
-
to_dict–Convert to dictionary for JSON serialisation.
-
to_export_dict–Convert to dictionary for export, parsing JSON content if valid.
-
from_dict–Create from dictionary.
to_export_dict
¶
Convert to dictionary for export, parsing JSON content if valid.
Unlike to_dict(), this attempts to parse the content as JSON for more readable exported files.
LLMMetadata¶
LLMMetadata
dataclass
¶
LLMMetadata(
provider: str = "",
timestamp: str = "",
latency_ms: int = 0,
tokens: LLMTokenUsage = LLMTokenUsage(),
cost_usd: float = 0.0,
cache_hit: bool = False,
request_id: str = "",
)
Metadata for a cached LLM response.
Attributes:
-
provider(str) –LLM provider name (openai, anthropic, etc.).
-
timestamp(str) –When the original request was made (ISO format).
-
latency_ms(int) –Response time in milliseconds.
-
tokens(LLMTokenUsage) –Token usage statistics.
-
cost_usd(float) –Estimated cost of the request in USD.
-
cache_hit(bool) –Whether this was served from cache.
-
request_id(str) –Optional identifier for the request (not in cache key).
Methods:
LLMTokenUsage¶
LLMTokenUsage
dataclass
¶
Token usage statistics for an LLM request.
Attributes:
-
input(int) –Number of tokens in the prompt.
-
output(int) –Number of tokens in the completion.
-
total(int) –Total tokens (input + output).