LLM Cache¶

LLM-specific cache encoder and data structures for storing and retrieving LLM requests and responses with rich metadata.

Package Separation

This module stays in causaliq-knowledge as it contains LLM-specific logic. The core cache infrastructure (TokenCache, JsonEncoder) will migrate to causaliq-core.

Overview¶

The LLM cache module provides:

LLMEntryEncoder - Extends JsonEncoder with LLM-specific convenience methods
LLMCacheEntry - Complete cache entry with request, response, and metadata
LLMResponse - Response data (content, finish reason, model version)
LLMMetadata - Rich metadata (provider, tokens, latency, cost)
LLMTokenUsage - Token usage statistics

Design Philosophy¶

The LLM cache separates concerns:

Component	Package	Migration
`TokenCache`	`causaliq_knowledge.cache`	→ `causaliq-core`
`EntryEncoder`	`causaliq_knowledge.cache.encoders`	→ `causaliq-core`
`JsonEncoder`	`causaliq_knowledge.cache.encoders`	→ `causaliq-core`
`LLMEntryEncoder`	`causaliq_knowledge.llm.cache`	Stays here
`LLMCacheEntry`	`causaliq_knowledge.llm.cache`	Stays here

This allows the base cache to be reused across projects while keeping LLM-specific logic in the appropriate package.

Usage¶

Creating Cache Entries¶

Use the LLMCacheEntry.create() factory method for convenient entry creation:

from causaliq_knowledge.llm.cache import LLMCacheEntry

entry = LLMCacheEntry.create(
    model="gpt-4",
    messages=[
        {"role": "system", "content": "You are helpful."},
        {"role": "user", "content": "Hello!"},
    ],
    content="Hi there! How can I help you today?",
    temperature=0.7,
    max_tokens=1000,
    provider="openai",
    latency_ms=850,
    input_tokens=25,
    output_tokens=15,
    cost_usd=0.002,
)

Encoding and Storing Entries¶

from causaliq_knowledge.cache import TokenCache
from causaliq_knowledge.llm.cache import LLMCacheEntry, LLMEntryEncoder

with TokenCache(":memory:") as cache:
    encoder = LLMEntryEncoder()

    # Create an entry
    entry = LLMCacheEntry.create(
        model="gpt-4",
        messages=[{"role": "user", "content": "What is Python?"}],
        content="Python is a programming language.",
        provider="openai",
    )

    # Encode to bytes
    blob = encoder.encode_entry(entry, cache)

    # Store in cache
    cache.put("request-hash", "llm", blob)

Retrieving and Decoding Entries¶

from causaliq_knowledge.cache import TokenCache
from causaliq_knowledge.llm.cache import LLMEntryEncoder

with TokenCache("cache.db") as cache:
    encoder = LLMEntryEncoder()

    # Retrieve from cache
    blob = cache.get("request-hash", "llm")
    if blob:
        # Decode to LLMCacheEntry
        entry = encoder.decode_entry(blob, cache)
        print(f"Response: {entry.response.content}")
        print(f"Latency: {entry.metadata.latency_ms}ms")

Exporting and Importing Entries¶

Export entries to JSON for inspection or migration:

from pathlib import Path
from causaliq_knowledge.llm.cache import LLMCacheEntry, LLMEntryEncoder

encoder = LLMEntryEncoder()

# Create entry
entry = LLMCacheEntry.create(
    model="claude-3",
    messages=[{"role": "user", "content": "Hello"}],
    content="Hi!",
    provider="anthropic",
)

# Export to JSON file
encoder.export_entry(entry, Path("entry.json"))

# Import from JSON file
restored = encoder.import_entry(Path("entry.json"))

Using with TokenCache Auto-Encoding¶

Register the encoder for automatic encoding/decoding:

from causaliq_knowledge.cache import TokenCache
from causaliq_knowledge.llm.cache import LLMCacheEntry, LLMEntryEncoder

with TokenCache(":memory:") as cache:
    # Register encoder for "llm" entry type
    cache.register_encoder("llm", LLMEntryEncoder())

    entry = LLMCacheEntry.create(
        model="gpt-4",
        messages=[{"role": "user", "content": "Hello"}],
        content="Hi!",
    )

    # Store with auto-encoding
    cache.put_data("hash123", "llm", entry.to_dict())

    # Retrieve with auto-decoding
    data = cache.get_data("hash123", "llm")
    restored = LLMCacheEntry.from_dict(data)

Using cached_completion with BaseLLMClient¶

The recommended way to use caching is via BaseLLMClient.cached_completion():

from causaliq_knowledge.cache import TokenCache
from causaliq_knowledge.llm import GroqClient, LLMConfig

with TokenCache("llm_cache.db") as cache:
    client = GroqClient(LLMConfig(model="llama-3.1-8b-instant"))
    client.set_cache(cache)

    # First call - makes API request, caches response with latency
    response = client.cached_completion(
        [{"role": "user", "content": "What is Python?"}]
    )

    # Second call - returns from cache, no API call
    response = client.cached_completion(
        [{"role": "user", "content": "What is Python?"}]
    )

This automatically:

Generates a deterministic cache key (SHA-256 of model + messages + params)
Checks cache before making API calls
Captures latency with time.perf_counter()
Stores response with full metadata

Importing Pre-Cached Responses¶

Load cached responses from JSON files for testing or migration:

from pathlib import Path
from causaliq_knowledge.cache import TokenCache
from causaliq_knowledge.llm.cache import LLMEntryEncoder

with TokenCache("llm_cache.db") as cache:
    cache.register_encoder("llm", LLMEntryEncoder())

    # Import all LLM entries from directory
    count = cache.import_entries(Path("./cached_responses"), "llm")
    print(f"Imported {count} cached LLM responses")

Data Structures¶

LLMTokenUsage¶

Token usage statistics for billing and analysis:

from causaliq_knowledge.llm.cache import LLMTokenUsage

usage = LLMTokenUsage(
    input=100,   # Prompt tokens
    output=50,   # Completion tokens  
    total=150,   # Total tokens
)

LLMMetadata¶

Rich metadata for debugging and analytics:

from causaliq_knowledge.llm.cache import LLMMetadata, LLMTokenUsage

metadata = LLMMetadata(
    provider="openai",
    timestamp="2024-01-15T10:30:00+00:00",
    latency_ms=850,
    tokens=LLMTokenUsage(input=100, output=50, total=150),
    cost_usd=0.005,
    cache_hit=False,
)

# Convert to/from dict
data = metadata.to_dict()
restored = LLMMetadata.from_dict(data)

LLMResponse¶

Response content and generation info:

from causaliq_knowledge.llm.cache import LLMResponse

response = LLMResponse(
    content="The answer is 42.",
    finish_reason="stop",
    model_version="gpt-4-0125-preview",
)

# Convert to/from dict
data = response.to_dict()
restored = LLMResponse.from_dict(data)

LLMCacheEntry¶

Complete cache entry combining request and response:

from causaliq_knowledge.llm.cache import (
    LLMCacheEntry, LLMResponse, LLMMetadata
)

entry = LLMCacheEntry(
    model="gpt-4",
    messages=[{"role": "user", "content": "Hello"}],
    temperature=0.7,
    max_tokens=1000,
    response=LLMResponse(content="Hi!"),
    metadata=LLMMetadata(provider="openai"),
)

# Preferred: use factory method
entry = LLMCacheEntry.create(
    model="gpt-4",
    messages=[{"role": "user", "content": "Hello"}],
    content="Hi!",
    provider="openai",
)

# Convert to/from dict
data = entry.to_dict()
restored = LLMCacheEntry.from_dict(data)

API Reference¶

LLMEntryEncoder¶

LLMEntryEncoder ¶

Encoder for LLM cache entries.

Extends JsonEncoder with LLM-specific convenience methods for encoding/decoding LLMCacheEntry objects.

The encoder stores data in the standard JSON tokenised format, achieving 50-70% compression through the shared token dictionary.

Example

from causaliq_knowledge.cache import TokenCache from causaliq_knowledge.llm.cache import ( ... LLMEntryEncoder, LLMCacheEntry, ... ) with TokenCache(":memory:") as cache: ... encoder = LLMEntryEncoder() ... entry = LLMCacheEntry.create( ... model="gpt-4", ... messages=[{"role": "user", "content": "Hello"}], ... content="Hi there!", ... provider="openai", ... ) ... blob = encoder.encode(entry.to_dict(), cache) ... data = encoder.decode(blob, cache) ... restored = LLMCacheEntry.from_dict(data)

Methods:

encode_entry –

Encode an LLMCacheEntry to bytes.
decode_entry –

Decode bytes to an LLMCacheEntry.
export_entry –

Export an LLMCacheEntry to a JSON file.
import_entry –

Import an LLMCacheEntry from a JSON file.

encode_entry ¶

encode_entry(entry: LLMCacheEntry, cache: TokenCache) -> bytes

Encode an LLMCacheEntry to bytes.

Convenience method that handles to_dict conversion.

Parameters:

entry ¶
(LLMCacheEntry) –

The cache entry to encode.
cache ¶
(TokenCache) –

TokenCache for token dictionary.

Returns:

bytes –

Encoded bytes.

decode_entry ¶

decode_entry(blob: bytes, cache: TokenCache) -> LLMCacheEntry

Decode bytes to an LLMCacheEntry.

Convenience method that handles from_dict conversion.

Parameters:

blob ¶
(bytes) –

Encoded bytes.
cache ¶
(TokenCache) –

TokenCache for token dictionary.

Returns:

LLMCacheEntry –

Decoded LLMCacheEntry.

export_entry ¶

export_entry(entry: LLMCacheEntry, path: Path) -> None

Export an LLMCacheEntry to a JSON file.

Parameters:

entry ¶
(LLMCacheEntry) –

The cache entry to export.
path ¶
(Path) –

Destination file path.

import_entry ¶

import_entry(path: Path) -> LLMCacheEntry

Import an LLMCacheEntry from a JSON file.

Parameters:

path ¶
(Path) –

Source file path.

Returns:

LLMCacheEntry –

Imported LLMCacheEntry.

LLMCacheEntry¶

LLMCacheEntry `dataclass` ¶

LLMCacheEntry(
    model: str = "",
    messages: list[dict[str, Any]] = list(),
    temperature: float = 0.0,
    max_tokens: int | None = None,
    response: LLMResponse = LLMResponse(),
    metadata: LLMMetadata = LLMMetadata(),
)

Complete LLM cache entry with request, response, and metadata.

Attributes:

model (str) –

The model name requested.
messages (list[dict[str, Any]]) –

The conversation messages.
temperature (float) –

Sampling temperature.
max_tokens (int | None) –

Maximum tokens in response.
response (LLMResponse) –

The LLM response data.
metadata (LLMMetadata) –

Rich metadata for analysis.

Methods:

create –

Create a cache entry with common parameters.
to_dict –

Convert to dictionary for JSON serialisation.
from_dict –

Create from dictionary.

create `classmethod` ¶

create(
    model: str,
    messages: list[dict[str, Any]],
    content: str,
    *,
    temperature: float = 0.0,
    max_tokens: int | None = None,
    finish_reason: str = "stop",
    model_version: str = "",
    provider: str = "",
    latency_ms: int = 0,
    input_tokens: int = 0,
    output_tokens: int = 0,
    cost_usd: float = 0.0
) -> LLMCacheEntry

Create a cache entry with common parameters.

Parameters:

model ¶
(str) –

The model name requested.
messages ¶
(list[dict[str, Any]]) –

The conversation messages.
content ¶
(str) –

The response content.
temperature ¶
(float, default: 0.0 ) –

Sampling temperature.
max_tokens ¶
(int | None, default: None ) –

Maximum tokens in response.
finish_reason ¶
(str, default: 'stop' ) –

Why generation stopped.
model_version ¶
(str, default: '' ) –

Actual model version.
provider ¶
(str, default: '' ) –

LLM provider name.
latency_ms ¶
(int, default: 0 ) –

Response time in milliseconds.
input_tokens ¶
(int, default: 0 ) –

Number of input tokens.
output_tokens ¶
(int, default: 0 ) –

Number of output tokens.
cost_usd ¶
(float, default: 0.0 ) –

Estimated cost in USD.

Returns:

LLMCacheEntry –

Configured LLMCacheEntry.

to_dict ¶

to_dict() -> dict[str, Any]

Convert to dictionary for JSON serialisation.

from_dict `classmethod` ¶

from_dict(data: dict[str, Any]) -> LLMCacheEntry

Create from dictionary.

LLMResponse¶

LLMResponse `dataclass` ¶

LLMResponse(content: str = '', finish_reason: str = 'stop', model_version: str = '')

LLM response data for caching.

Attributes:

content (str) –

The full text response from the LLM.
finish_reason (str) –

Why generation stopped (stop, length, etc.).
model_version (str) –

Actual model version used.

Methods:

to_dict –

Convert to dictionary for JSON serialisation.
from_dict –

Create from dictionary.

to_dict ¶

to_dict() -> dict[str, Any]

Convert to dictionary for JSON serialisation.

from_dict `classmethod` ¶

from_dict(data: dict[str, Any]) -> LLMResponse

Create from dictionary.

LLMMetadata¶

LLMMetadata `dataclass` ¶

LLMMetadata(
    provider: str = "",
    timestamp: str = "",
    latency_ms: int = 0,
    tokens: LLMTokenUsage = LLMTokenUsage(),
    cost_usd: float = 0.0,
    cache_hit: bool = False,
)

Metadata for a cached LLM response.

Attributes:

provider (str) –

LLM provider name (openai, anthropic, etc.).
timestamp (str) –

When the original request was made (ISO format).
latency_ms (int) –

Response time in milliseconds.
tokens (LLMTokenUsage) –

Token usage statistics.
cost_usd (float) –

Estimated cost of the request in USD.
cache_hit (bool) –

Whether this was served from cache.

Methods:

to_dict –

Convert to dictionary for JSON serialisation.
from_dict –

Create from dictionary.

to_dict ¶

to_dict() -> dict[str, Any]

Convert to dictionary for JSON serialisation.

from_dict `classmethod` ¶

from_dict(data: dict[str, Any]) -> LLMMetadata

Create from dictionary.

LLMTokenUsage¶

LLMTokenUsage `dataclass` ¶

LLMTokenUsage(input: int = 0, output: int = 0, total: int = 0)

Token usage statistics for an LLM request.

Attributes:

input (int) –

Number of tokens in the prompt.
output (int) –

Number of tokens in the completion.
total (int) –

Total tokens (input + output).

LLM Cache¶

Overview¶

Design Philosophy¶

Usage¶

Creating Cache Entries¶

Encoding and Storing Entries¶

Retrieving and Decoding Entries¶

Exporting and Importing Entries¶

Using with TokenCache Auto-Encoding¶

Using cached_completion with BaseLLMClient¶

Importing Pre-Cached Responses¶

Data Structures¶

LLMTokenUsage¶

LLMMetadata¶

LLMResponse¶

LLMCacheEntry¶

API Reference¶

LLMEntryEncoder¶

LLMEntryEncoder ¶

encode_entry ¶

entry ¶

cache ¶

decode_entry ¶

blob ¶

cache ¶

export_entry ¶

entry ¶

path ¶

import_entry ¶

path ¶

LLMCacheEntry¶

LLMCacheEntry dataclass ¶

create classmethod ¶

model ¶

messages ¶

content ¶

temperature ¶

max_tokens ¶

finish_reason ¶

model_version ¶

provider ¶

latency_ms ¶

input_tokens ¶

output_tokens ¶

cost_usd ¶

to_dict ¶

from_dict classmethod ¶

LLMResponse¶

LLMResponse dataclass ¶

to_dict ¶

from_dict classmethod ¶

LLMMetadata¶

LLMMetadata dataclass ¶

to_dict ¶

from_dict classmethod ¶

LLMTokenUsage¶

LLMTokenUsage dataclass ¶

`entry` ¶

`cache` ¶

`blob` ¶

`cache` ¶

`entry` ¶

`path` ¶

`path` ¶

LLMCacheEntry `dataclass` ¶

create `classmethod` ¶

`model` ¶

`messages` ¶

`content` ¶

`temperature` ¶

`max_tokens` ¶

`finish_reason` ¶

`model_version` ¶

`provider` ¶

`latency_ms` ¶

`input_tokens` ¶

`output_tokens` ¶

`cost_usd` ¶

from_dict `classmethod` ¶

LLMResponse `dataclass` ¶

from_dict `classmethod` ¶

LLMMetadata `dataclass` ¶

from_dict `classmethod` ¶

LLMTokenUsage `dataclass` ¶