LLM Cache¶
LLM-specific cache encoder and data structures for storing and retrieving LLM requests and responses with rich metadata.
Package Separation
This module stays in causaliq-knowledge as it contains LLM-specific logic.
The core cache infrastructure (TokenCache,
JsonEncoder) will migrate to causaliq-core.
Overview¶
The LLM cache module provides:
- LLMEntryEncoder - Extends JsonEncoder with LLM-specific convenience methods
- LLMCacheEntry - Complete cache entry with request, response, and metadata
- LLMResponse - Response data (content, finish reason, model version)
- LLMMetadata - Rich metadata (provider, tokens, latency, cost)
- LLMTokenUsage - Token usage statistics
Design Philosophy¶
The LLM cache separates concerns:
| Component | Package | Migration |
|---|---|---|
TokenCache |
causaliq_knowledge.cache |
→ causaliq-core |
EntryEncoder |
causaliq_knowledge.cache.encoders |
→ causaliq-core |
JsonEncoder |
causaliq_knowledge.cache.encoders |
→ causaliq-core |
LLMEntryEncoder |
causaliq_knowledge.llm.cache |
Stays here |
LLMCacheEntry |
causaliq_knowledge.llm.cache |
Stays here |
This allows the base cache to be reused across projects while keeping LLM-specific logic in the appropriate package.
Usage¶
Creating Cache Entries¶
Use the LLMCacheEntry.create() factory method for convenient entry creation:
from causaliq_knowledge.llm.cache import LLMCacheEntry
entry = LLMCacheEntry.create(
model="gpt-4",
messages=[
{"role": "system", "content": "You are helpful."},
{"role": "user", "content": "Hello!"},
],
content="Hi there! How can I help you today?",
temperature=0.7,
max_tokens=1000,
provider="openai",
latency_ms=850,
input_tokens=25,
output_tokens=15,
cost_usd=0.002,
)
Encoding and Storing Entries¶
from causaliq_knowledge.cache import TokenCache
from causaliq_knowledge.llm.cache import LLMCacheEntry, LLMEntryEncoder
with TokenCache(":memory:") as cache:
encoder = LLMEntryEncoder()
# Create an entry
entry = LLMCacheEntry.create(
model="gpt-4",
messages=[{"role": "user", "content": "What is Python?"}],
content="Python is a programming language.",
provider="openai",
)
# Encode to bytes
blob = encoder.encode_entry(entry, cache)
# Store in cache
cache.put("request-hash", "llm", blob)
Retrieving and Decoding Entries¶
from causaliq_knowledge.cache import TokenCache
from causaliq_knowledge.llm.cache import LLMEntryEncoder
with TokenCache("cache.db") as cache:
encoder = LLMEntryEncoder()
# Retrieve from cache
blob = cache.get("request-hash", "llm")
if blob:
# Decode to LLMCacheEntry
entry = encoder.decode_entry(blob, cache)
print(f"Response: {entry.response.content}")
print(f"Latency: {entry.metadata.latency_ms}ms")
Exporting and Importing Entries¶
Export entries to JSON for inspection or migration:
from pathlib import Path
from causaliq_knowledge.llm.cache import LLMCacheEntry, LLMEntryEncoder
encoder = LLMEntryEncoder()
# Create entry
entry = LLMCacheEntry.create(
model="claude-3",
messages=[{"role": "user", "content": "Hello"}],
content="Hi!",
provider="anthropic",
)
# Export to JSON file
encoder.export_entry(entry, Path("entry.json"))
# Import from JSON file
restored = encoder.import_entry(Path("entry.json"))
Using with TokenCache Auto-Encoding¶
Register the encoder for automatic encoding/decoding:
from causaliq_knowledge.cache import TokenCache
from causaliq_knowledge.llm.cache import LLMCacheEntry, LLMEntryEncoder
with TokenCache(":memory:") as cache:
# Register encoder for "llm" entry type
cache.register_encoder("llm", LLMEntryEncoder())
entry = LLMCacheEntry.create(
model="gpt-4",
messages=[{"role": "user", "content": "Hello"}],
content="Hi!",
)
# Store with auto-encoding
cache.put_data("hash123", "llm", entry.to_dict())
# Retrieve with auto-decoding
data = cache.get_data("hash123", "llm")
restored = LLMCacheEntry.from_dict(data)
Using cached_completion with BaseLLMClient¶
The recommended way to use caching is via BaseLLMClient.cached_completion():
from causaliq_knowledge.cache import TokenCache
from causaliq_knowledge.llm import GroqClient, LLMConfig
with TokenCache("llm_cache.db") as cache:
client = GroqClient(LLMConfig(model="llama-3.1-8b-instant"))
client.set_cache(cache)
# First call - makes API request, caches response with latency
response = client.cached_completion(
[{"role": "user", "content": "What is Python?"}]
)
# Second call - returns from cache, no API call
response = client.cached_completion(
[{"role": "user", "content": "What is Python?"}]
)
This automatically:
- Generates a deterministic cache key (SHA-256 of model + messages + params)
- Checks cache before making API calls
- Captures latency with
time.perf_counter() - Stores response with full metadata
Importing Pre-Cached Responses¶
Load cached responses from JSON files for testing or migration:
from pathlib import Path
from causaliq_knowledge.cache import TokenCache
from causaliq_knowledge.llm.cache import LLMEntryEncoder
with TokenCache("llm_cache.db") as cache:
cache.register_encoder("llm", LLMEntryEncoder())
# Import all LLM entries from directory
count = cache.import_entries(Path("./cached_responses"), "llm")
print(f"Imported {count} cached LLM responses")
Data Structures¶
LLMTokenUsage¶
Token usage statistics for billing and analysis:
from causaliq_knowledge.llm.cache import LLMTokenUsage
usage = LLMTokenUsage(
input=100, # Prompt tokens
output=50, # Completion tokens
total=150, # Total tokens
)
LLMMetadata¶
Rich metadata for debugging and analytics:
from causaliq_knowledge.llm.cache import LLMMetadata, LLMTokenUsage
metadata = LLMMetadata(
provider="openai",
timestamp="2024-01-15T10:30:00+00:00",
latency_ms=850,
tokens=LLMTokenUsage(input=100, output=50, total=150),
cost_usd=0.005,
cache_hit=False,
)
# Convert to/from dict
data = metadata.to_dict()
restored = LLMMetadata.from_dict(data)
LLMResponse¶
Response content and generation info:
from causaliq_knowledge.llm.cache import LLMResponse
response = LLMResponse(
content="The answer is 42.",
finish_reason="stop",
model_version="gpt-4-0125-preview",
)
# Convert to/from dict
data = response.to_dict()
restored = LLMResponse.from_dict(data)
LLMCacheEntry¶
Complete cache entry combining request and response:
from causaliq_knowledge.llm.cache import (
LLMCacheEntry, LLMResponse, LLMMetadata
)
entry = LLMCacheEntry(
model="gpt-4",
messages=[{"role": "user", "content": "Hello"}],
temperature=0.7,
max_tokens=1000,
response=LLMResponse(content="Hi!"),
metadata=LLMMetadata(provider="openai"),
)
# Preferred: use factory method
entry = LLMCacheEntry.create(
model="gpt-4",
messages=[{"role": "user", "content": "Hello"}],
content="Hi!",
provider="openai",
)
# Convert to/from dict
data = entry.to_dict()
restored = LLMCacheEntry.from_dict(data)
API Reference¶
LLMEntryEncoder¶
LLMEntryEncoder
¶
Encoder for LLM cache entries.
Extends JsonEncoder with LLM-specific convenience methods for encoding/decoding LLMCacheEntry objects.
The encoder stores data in the standard JSON tokenised format, achieving 50-70% compression through the shared token dictionary.
Example
from causaliq_knowledge.cache import TokenCache from causaliq_knowledge.llm.cache import ( ... LLMEntryEncoder, LLMCacheEntry, ... ) with TokenCache(":memory:") as cache: ... encoder = LLMEntryEncoder() ... entry = LLMCacheEntry.create( ... model="gpt-4", ... messages=[{"role": "user", "content": "Hello"}], ... content="Hi there!", ... provider="openai", ... ) ... blob = encoder.encode(entry.to_dict(), cache) ... data = encoder.decode(blob, cache) ... restored = LLMCacheEntry.from_dict(data)
Methods:
-
encode_entry–Encode an LLMCacheEntry to bytes.
-
decode_entry–Decode bytes to an LLMCacheEntry.
-
export_entry–Export an LLMCacheEntry to a JSON file.
-
import_entry–Import an LLMCacheEntry from a JSON file.
encode_entry
¶
encode_entry(entry: LLMCacheEntry, cache: TokenCache) -> bytes
Encode an LLMCacheEntry to bytes.
Convenience method that handles to_dict conversion.
Parameters:
-
(entry¶LLMCacheEntry) –The cache entry to encode.
-
(cache¶TokenCache) –TokenCache for token dictionary.
Returns:
-
bytes–Encoded bytes.
decode_entry
¶
decode_entry(blob: bytes, cache: TokenCache) -> LLMCacheEntry
Decode bytes to an LLMCacheEntry.
Convenience method that handles from_dict conversion.
Parameters:
-
(blob¶bytes) –Encoded bytes.
-
(cache¶TokenCache) –TokenCache for token dictionary.
Returns:
-
LLMCacheEntry–Decoded LLMCacheEntry.
export_entry
¶
export_entry(entry: LLMCacheEntry, path: Path) -> None
Export an LLMCacheEntry to a JSON file.
Parameters:
-
(entry¶LLMCacheEntry) –The cache entry to export.
-
(path¶Path) –Destination file path.
import_entry
¶
import_entry(path: Path) -> LLMCacheEntry
Import an LLMCacheEntry from a JSON file.
Parameters:
-
(path¶Path) –Source file path.
Returns:
-
LLMCacheEntry–Imported LLMCacheEntry.
LLMCacheEntry¶
LLMCacheEntry
dataclass
¶
LLMCacheEntry(
model: str = "",
messages: list[dict[str, Any]] = list(),
temperature: float = 0.0,
max_tokens: int | None = None,
response: LLMResponse = LLMResponse(),
metadata: LLMMetadata = LLMMetadata(),
)
Complete LLM cache entry with request, response, and metadata.
Attributes:
-
model(str) –The model name requested.
-
messages(list[dict[str, Any]]) –The conversation messages.
-
temperature(float) –Sampling temperature.
-
max_tokens(int | None) –Maximum tokens in response.
-
response(LLMResponse) –The LLM response data.
-
metadata(LLMMetadata) –Rich metadata for analysis.
Methods:
-
create–Create a cache entry with common parameters.
-
to_dict–Convert to dictionary for JSON serialisation.
-
from_dict–Create from dictionary.
create
classmethod
¶
create(
model: str,
messages: list[dict[str, Any]],
content: str,
*,
temperature: float = 0.0,
max_tokens: int | None = None,
finish_reason: str = "stop",
model_version: str = "",
provider: str = "",
latency_ms: int = 0,
input_tokens: int = 0,
output_tokens: int = 0,
cost_usd: float = 0.0
) -> LLMCacheEntry
Create a cache entry with common parameters.
Parameters:
-
(model¶str) –The model name requested.
-
(messages¶list[dict[str, Any]]) –The conversation messages.
-
(content¶str) –The response content.
-
(temperature¶float, default:0.0) –Sampling temperature.
-
(max_tokens¶int | None, default:None) –Maximum tokens in response.
-
(finish_reason¶str, default:'stop') –Why generation stopped.
-
(model_version¶str, default:'') –Actual model version.
-
(provider¶str, default:'') –LLM provider name.
-
(latency_ms¶int, default:0) –Response time in milliseconds.
-
(input_tokens¶int, default:0) –Number of input tokens.
-
(output_tokens¶int, default:0) –Number of output tokens.
-
(cost_usd¶float, default:0.0) –Estimated cost in USD.
Returns:
-
LLMCacheEntry–Configured LLMCacheEntry.
LLMResponse¶
LLMResponse
dataclass
¶
LLM response data for caching.
Attributes:
-
content(str) –The full text response from the LLM.
-
finish_reason(str) –Why generation stopped (stop, length, etc.).
-
model_version(str) –Actual model version used.
Methods:
LLMMetadata¶
LLMMetadata
dataclass
¶
LLMMetadata(
provider: str = "",
timestamp: str = "",
latency_ms: int = 0,
tokens: LLMTokenUsage = LLMTokenUsage(),
cost_usd: float = 0.0,
cache_hit: bool = False,
)
Metadata for a cached LLM response.
Attributes:
-
provider(str) –LLM provider name (openai, anthropic, etc.).
-
timestamp(str) –When the original request was made (ISO format).
-
latency_ms(int) –Response time in milliseconds.
-
tokens(LLMTokenUsage) –Token usage statistics.
-
cost_usd(float) –Estimated cost of the request in USD.
-
cache_hit(bool) –Whether this was served from cache.
Methods:
LLMTokenUsage¶
LLMTokenUsage
dataclass
¶
Token usage statistics for an LLM request.
Attributes:
-
input(int) –Number of tokens in the prompt.
-
output(int) –Number of tokens in the completion.
-
total(int) –Total tokens (input + output).