Skip to content

LLM Cache

LLM-specific cache encoder and data structures for storing and retrieving LLM requests and responses with rich metadata.

Package Separation

This module stays in causaliq-knowledge as it contains LLM-specific logic. The core cache infrastructure (TokenCache, JsonEncoder) will migrate to causaliq-core.

Overview

The LLM cache module provides:

  • LLMEntryEncoder - Extends JsonEncoder with LLM-specific convenience methods
  • LLMCacheEntry - Complete cache entry with request, response, and metadata
  • LLMResponse - Response data (content, finish reason, model version)
  • LLMMetadata - Rich metadata (provider, tokens, latency, cost)
  • LLMTokenUsage - Token usage statistics

Design Philosophy

The LLM cache separates concerns:

Component Package Migration
TokenCache causaliq_knowledge.cache causaliq-core
EntryEncoder causaliq_knowledge.cache.encoders causaliq-core
JsonEncoder causaliq_knowledge.cache.encoders causaliq-core
LLMEntryEncoder causaliq_knowledge.llm.cache Stays here
LLMCacheEntry causaliq_knowledge.llm.cache Stays here

This allows the base cache to be reused across projects while keeping LLM-specific logic in the appropriate package.

Usage

Creating Cache Entries

Use the LLMCacheEntry.create() factory method for convenient entry creation:

from causaliq_knowledge.llm.cache import LLMCacheEntry

entry = LLMCacheEntry.create(
    model="gpt-4",
    messages=[
        {"role": "system", "content": "You are helpful."},
        {"role": "user", "content": "Hello!"},
    ],
    content="Hi there! How can I help you today?",
    temperature=0.7,
    max_tokens=1000,
    provider="openai",
    latency_ms=850,
    input_tokens=25,
    output_tokens=15,
    cost_usd=0.002,
)

Encoding and Storing Entries

from causaliq_knowledge.cache import TokenCache
from causaliq_knowledge.llm.cache import LLMCacheEntry, LLMEntryEncoder

with TokenCache(":memory:") as cache:
    encoder = LLMEntryEncoder()

    # Create an entry
    entry = LLMCacheEntry.create(
        model="gpt-4",
        messages=[{"role": "user", "content": "What is Python?"}],
        content="Python is a programming language.",
        provider="openai",
    )

    # Encode to bytes
    blob = encoder.encode_entry(entry, cache)

    # Store in cache
    cache.put("request-hash", "llm", blob)

Retrieving and Decoding Entries

from causaliq_knowledge.cache import TokenCache
from causaliq_knowledge.llm.cache import LLMEntryEncoder

with TokenCache("cache.db") as cache:
    encoder = LLMEntryEncoder()

    # Retrieve from cache
    blob = cache.get("request-hash", "llm")
    if blob:
        # Decode to LLMCacheEntry
        entry = encoder.decode_entry(blob, cache)
        print(f"Response: {entry.response.content}")
        print(f"Latency: {entry.metadata.latency_ms}ms")

Exporting and Importing Entries

Export entries to JSON for inspection or migration:

from pathlib import Path
from causaliq_knowledge.llm.cache import LLMCacheEntry, LLMEntryEncoder

encoder = LLMEntryEncoder()

# Create entry
entry = LLMCacheEntry.create(
    model="claude-3",
    messages=[{"role": "user", "content": "Hello"}],
    content="Hi!",
    provider="anthropic",
)

# Export to JSON file
encoder.export_entry(entry, Path("entry.json"))

# Import from JSON file
restored = encoder.import_entry(Path("entry.json"))

Using with TokenCache Auto-Encoding

Register the encoder for automatic encoding/decoding:

from causaliq_knowledge.cache import TokenCache
from causaliq_knowledge.llm.cache import LLMCacheEntry, LLMEntryEncoder

with TokenCache(":memory:") as cache:
    # Register encoder for "llm" entry type
    cache.register_encoder("llm", LLMEntryEncoder())

    entry = LLMCacheEntry.create(
        model="gpt-4",
        messages=[{"role": "user", "content": "Hello"}],
        content="Hi!",
    )

    # Store with auto-encoding
    cache.put_data("hash123", "llm", entry.to_dict())

    # Retrieve with auto-decoding
    data = cache.get_data("hash123", "llm")
    restored = LLMCacheEntry.from_dict(data)

Using cached_completion with BaseLLMClient

The recommended way to use caching is via BaseLLMClient.cached_completion():

from causaliq_knowledge.cache import TokenCache
from causaliq_knowledge.llm import GroqClient, LLMConfig

with TokenCache("llm_cache.db") as cache:
    client = GroqClient(LLMConfig(model="llama-3.1-8b-instant"))
    client.set_cache(cache)

    # First call - makes API request, caches response with latency
    response = client.cached_completion(
        [{"role": "user", "content": "What is Python?"}]
    )

    # Second call - returns from cache, no API call
    response = client.cached_completion(
        [{"role": "user", "content": "What is Python?"}]
    )

This automatically:

  • Generates a deterministic cache key (SHA-256 of model + messages + params)
  • Checks cache before making API calls
  • Captures latency with time.perf_counter()
  • Stores response with full metadata

Importing Pre-Cached Responses

Load cached responses from JSON files for testing or migration:

from pathlib import Path
from causaliq_knowledge.cache import TokenCache
from causaliq_knowledge.llm.cache import LLMEntryEncoder

with TokenCache("llm_cache.db") as cache:
    cache.register_encoder("llm", LLMEntryEncoder())

    # Import all LLM entries from directory
    count = cache.import_entries(Path("./cached_responses"), "llm")
    print(f"Imported {count} cached LLM responses")

Data Structures

LLMTokenUsage

Token usage statistics for billing and analysis:

from causaliq_knowledge.llm.cache import LLMTokenUsage

usage = LLMTokenUsage(
    input=100,   # Prompt tokens
    output=50,   # Completion tokens  
    total=150,   # Total tokens
)

LLMMetadata

Rich metadata for debugging and analytics:

from causaliq_knowledge.llm.cache import LLMMetadata, LLMTokenUsage

metadata = LLMMetadata(
    provider="openai",
    timestamp="2024-01-15T10:30:00+00:00",
    latency_ms=850,
    tokens=LLMTokenUsage(input=100, output=50, total=150),
    cost_usd=0.005,
    cache_hit=False,
)

# Convert to/from dict
data = metadata.to_dict()
restored = LLMMetadata.from_dict(data)

LLMResponse

Response content and generation info:

from causaliq_knowledge.llm.cache import LLMResponse

response = LLMResponse(
    content="The answer is 42.",
    finish_reason="stop",
    model_version="gpt-4-0125-preview",
)

# Convert to/from dict
data = response.to_dict()
restored = LLMResponse.from_dict(data)

LLMCacheEntry

Complete cache entry combining request and response:

from causaliq_knowledge.llm.cache import (
    LLMCacheEntry, LLMResponse, LLMMetadata
)

entry = LLMCacheEntry(
    model="gpt-4",
    messages=[{"role": "user", "content": "Hello"}],
    temperature=0.7,
    max_tokens=1000,
    response=LLMResponse(content="Hi!"),
    metadata=LLMMetadata(provider="openai"),
)

# Preferred: use factory method
entry = LLMCacheEntry.create(
    model="gpt-4",
    messages=[{"role": "user", "content": "Hello"}],
    content="Hi!",
    provider="openai",
)

# Convert to/from dict
data = entry.to_dict()
restored = LLMCacheEntry.from_dict(data)

API Reference

LLMEntryEncoder

LLMEntryEncoder

Encoder for LLM cache entries.

Extends JsonEncoder with LLM-specific convenience methods for encoding/decoding LLMCacheEntry objects.

The encoder stores data in the standard JSON tokenised format, achieving 50-70% compression through the shared token dictionary.

Example

from causaliq_knowledge.cache import TokenCache from causaliq_knowledge.llm.cache import ( ... LLMEntryEncoder, LLMCacheEntry, ... ) with TokenCache(":memory:") as cache: ... encoder = LLMEntryEncoder() ... entry = LLMCacheEntry.create( ... model="gpt-4", ... messages=[{"role": "user", "content": "Hello"}], ... content="Hi there!", ... provider="openai", ... ) ... blob = encoder.encode(entry.to_dict(), cache) ... data = encoder.decode(blob, cache) ... restored = LLMCacheEntry.from_dict(data)

Methods:

encode_entry

encode_entry(entry: LLMCacheEntry, cache: TokenCache) -> bytes

Encode an LLMCacheEntry to bytes.

Convenience method that handles to_dict conversion.

Parameters:

Returns:

  • bytes

    Encoded bytes.

decode_entry

decode_entry(blob: bytes, cache: TokenCache) -> LLMCacheEntry

Decode bytes to an LLMCacheEntry.

Convenience method that handles from_dict conversion.

Parameters:

  • blob

    (bytes) –

    Encoded bytes.

  • cache

    (TokenCache) –

    TokenCache for token dictionary.

Returns:

export_entry

export_entry(entry: LLMCacheEntry, path: Path) -> None

Export an LLMCacheEntry to a JSON file.

Parameters:

  • entry

    (LLMCacheEntry) –

    The cache entry to export.

  • path

    (Path) –

    Destination file path.

import_entry

import_entry(path: Path) -> LLMCacheEntry

Import an LLMCacheEntry from a JSON file.

Parameters:

  • path

    (Path) –

    Source file path.

Returns:

LLMCacheEntry

LLMCacheEntry dataclass

LLMCacheEntry(
    model: str = "",
    messages: list[dict[str, Any]] = list(),
    temperature: float = 0.0,
    max_tokens: int | None = None,
    response: LLMResponse = LLMResponse(),
    metadata: LLMMetadata = LLMMetadata(),
)

Complete LLM cache entry with request, response, and metadata.

Attributes:

  • model (str) –

    The model name requested.

  • messages (list[dict[str, Any]]) –

    The conversation messages.

  • temperature (float) –

    Sampling temperature.

  • max_tokens (int | None) –

    Maximum tokens in response.

  • response (LLMResponse) –

    The LLM response data.

  • metadata (LLMMetadata) –

    Rich metadata for analysis.

Methods:

  • create

    Create a cache entry with common parameters.

  • to_dict

    Convert to dictionary for JSON serialisation.

  • from_dict

    Create from dictionary.

create classmethod

create(
    model: str,
    messages: list[dict[str, Any]],
    content: str,
    *,
    temperature: float = 0.0,
    max_tokens: int | None = None,
    finish_reason: str = "stop",
    model_version: str = "",
    provider: str = "",
    latency_ms: int = 0,
    input_tokens: int = 0,
    output_tokens: int = 0,
    cost_usd: float = 0.0
) -> LLMCacheEntry

Create a cache entry with common parameters.

Parameters:

  • model

    (str) –

    The model name requested.

  • messages

    (list[dict[str, Any]]) –

    The conversation messages.

  • content

    (str) –

    The response content.

  • temperature

    (float, default: 0.0 ) –

    Sampling temperature.

  • max_tokens

    (int | None, default: None ) –

    Maximum tokens in response.

  • finish_reason

    (str, default: 'stop' ) –

    Why generation stopped.

  • model_version

    (str, default: '' ) –

    Actual model version.

  • provider

    (str, default: '' ) –

    LLM provider name.

  • latency_ms

    (int, default: 0 ) –

    Response time in milliseconds.

  • input_tokens

    (int, default: 0 ) –

    Number of input tokens.

  • output_tokens

    (int, default: 0 ) –

    Number of output tokens.

  • cost_usd

    (float, default: 0.0 ) –

    Estimated cost in USD.

Returns:

to_dict

to_dict() -> dict[str, Any]

Convert to dictionary for JSON serialisation.

from_dict classmethod

from_dict(data: dict[str, Any]) -> LLMCacheEntry

Create from dictionary.

LLMResponse

LLMResponse dataclass

LLMResponse(content: str = '', finish_reason: str = 'stop', model_version: str = '')

LLM response data for caching.

Attributes:

  • content (str) –

    The full text response from the LLM.

  • finish_reason (str) –

    Why generation stopped (stop, length, etc.).

  • model_version (str) –

    Actual model version used.

Methods:

  • to_dict

    Convert to dictionary for JSON serialisation.

  • from_dict

    Create from dictionary.

to_dict

to_dict() -> dict[str, Any]

Convert to dictionary for JSON serialisation.

from_dict classmethod

from_dict(data: dict[str, Any]) -> LLMResponse

Create from dictionary.

LLMMetadata

LLMMetadata dataclass

LLMMetadata(
    provider: str = "",
    timestamp: str = "",
    latency_ms: int = 0,
    tokens: LLMTokenUsage = LLMTokenUsage(),
    cost_usd: float = 0.0,
    cache_hit: bool = False,
)

Metadata for a cached LLM response.

Attributes:

  • provider (str) –

    LLM provider name (openai, anthropic, etc.).

  • timestamp (str) –

    When the original request was made (ISO format).

  • latency_ms (int) –

    Response time in milliseconds.

  • tokens (LLMTokenUsage) –

    Token usage statistics.

  • cost_usd (float) –

    Estimated cost of the request in USD.

  • cache_hit (bool) –

    Whether this was served from cache.

Methods:

  • to_dict

    Convert to dictionary for JSON serialisation.

  • from_dict

    Create from dictionary.

to_dict

to_dict() -> dict[str, Any]

Convert to dictionary for JSON serialisation.

from_dict classmethod

from_dict(data: dict[str, Any]) -> LLMMetadata

Create from dictionary.

LLMTokenUsage

LLMTokenUsage dataclass

LLMTokenUsage(input: int = 0, output: int = 0, total: int = 0)

Token usage statistics for an LLM request.

Attributes:

  • input (int) –

    Number of tokens in the prompt.

  • output (int) –

    Number of tokens in the completion.

  • total (int) –

    Total tokens (input + output).