LLM (and other) Caching¶
Rationale¶
LLM requests and responses will be cached for the following purposes:
- reducing costs - unnecessary LLM requests are avoided;
- reducing latency - requests from cache can be serviced much more quickly than querying an external LLM API;
- functional testing - cached responses provide a mechanism for deterministic and fast functional testing both locally and in GitHub CI testing;
- reproducibility - of published results avoiding the issues of the non-deterministic nature of LLMs and their rapid evolution;
- LLM analysis - these caches provide a rich resource characterising the comparative behaviour of different models, prompt strategies etc.
In fact, these are generic requirements that may apply to other caching requirements, including, perhaps:
- structure learning results such as learned graphs, debug traces and experiment metadata - the conservative principle in CausalIQ workflows means that experiments are generally only performed if the results for that experiment are not present on disk
- node scores during structure learning, although these are generally held in memory
This argues for a generic caching capability, that would ultimately be provided by causaliq-core (though we may develop it first in causaliq-knowledge, as this is where the first use case arises, and then migrate it).
Design Principles¶
- Opt-out caching - caching is enabled by default; users explicitly disable it when not wanted - for instance, if non-cached experiment timings are required
- Flexible organisation - defined by the application/user
- each application defines a key which is SHA hashed with the truncated is hash used to locate the resource within a cache
- for LLM queries this may be the model, messages, temperature and max tokens
- for structure learning results, this could be the sample size, randomisation seed and resource type
- for node scores, this could be the node and parents
- each application defines the scope of each physical cache to optimise resource re-use
- for LLM queries this is likely defined by the dataset - in this case, the cache therefore represents the "LLM learning about a particular dataset"
- for structure learning results, this could be the experiment series id and network
- for node scores, this could be the dataset, sample size, score type and score parameters
- each application defines the data held for the cached resource
- for LLM queries this includes the complete response, and rich metadata such as token counts, latency, cost, and timestamps to support LLM analysis
- for structure learning results this includes the learned graph, experiment metadata (execution time, graph scores, hyperparameters etc) and optionally, the structure learning iteration trace
- for node scores this is simply the score float value
- Simple cache semantics - on request: check cache → if hit, return cached response → if miss, make request, cache result, return response
- Efficient storage - much of this cached data has lots of repeated "tokens" - JSON and XML tags, variable names, LLM query and response words - storing it verbatim would be very inefficient
- resources should be stored internally in a compressed format, with a "dictionary" used to compress and expand numeric tokens to tags, words etc (i.e. like ZIP?)
- Fast access - to resources
- resource existence checking and retrieval should be very fast.
- resource addition and update, and sequential scanning of all resources (for analysis) should be reasonably fast
- Import/Export - functions should be provided to import/export all or selected parts of a cache keys and resources to standards-based human-readable files:
- this is useful for functional testing, for example, where test files are human readable, and then a fixture imports the test files to the cache bfore testing starts
- for LLM queries the queries, response and metadata would be exported as JSON files
- for experiment results the learned graphs would be exported as GraphML files, the metadata as JSON and the debug trace in CSV format
- Centralised usage - access to cache data should be centralised in the application code
- for LLM queries caching logic implemented once in
BaseLLMClient, available to all providers - for experiment results centralised in core CausalIQ WOrkflow code?
- for node scores this might be called from the node_score function
- Persistent and in-memory - the application can opt to use the cache in-memory or from disk
- Concurrency - the cache manages multiple processes or threads retrievng and updating it
Resolving the Format Tension¶
There is a tension between some of these principles, in particular, space/performance/concurrency and standards-based open formats. This tension is mitigated because these requirements arise at different phases of the project lifecycle:
| Phase | Primary Need | Format |
|---|---|---|
| Test setup | Human-readable, editable fixtures | JSON, GraphML |
| Production runs | Speed, concurrency, disk efficiency | SQLite + encoded blobs |
| Result archival (Zenodo) | Standards-based, long-term accessible | JSON, GraphML, CSV |
This lifecycle perspective leads to a clear resolution:
- Compact, efficient formats are the norm during active experimentation
- Import converts open formats → internal cache (test fixture setup)
- Export converts internal cache → open formats (archival, sharing)
The import/export capability is therefore not just a convenience feature - it is essential architecture that bridges incompatible requirements at different lifecycle stages.
Existing Packages¶
Evaluation of existing packages against our requirements:
Comparison Summary¶
| Requirement | zip | SQLite | diskcache | pickle |
|---|---|---|---|---|
| Fast key lookup | ⚠️ Need index | ✅ Indexed | ✅ Indexed | ❌ Load all |
| Concurrency | ❌ Poor | ✅ Built-in | ✅ Built-in | ❌ None |
| Efficient storage | ✅ Compressed | ⚠️ Per-blob only | ❌ None | ⚠️ Reasonable |
| Cross-entry compression | ✅ Shared dict | ❌ No | ❌ No | ❌ No |
| In-memory mode | ❌ File-based | ✅ :memory: |
❌ Disk-based | ✅ Native |
| Incremental updates | ❌ Rewrite | ✅ Per-entry | ✅ Per-entry | ❌ Rewrite |
| Import/Export | ✅ File extract | ⚠️ Via SQL | ❌ Python-only | ❌ Python-only |
| Standards-based | ✅ Universal | ✅ Universal | ❌ Python-only | ❌ Python-only |
| Security | ✅ Safe | ✅ Safe | ✅ Safe | ❌ Code execution risk |
zip¶
Pros:
- Excellent compression with shared dictionary across all files
- Universal standard format
- Built-in Python support (zipfile module)
- Could use hash as filename within archive
Cons: - No true in-memory mode (must write to file or BytesIO) - Poor incremental update support - must rewrite archive to add/modify entries - No built-in concurrency protection - Random access requires scanning central directory
Verdict: Good for archival/export, poor for active cache with frequent updates.
SQLite¶
Pros:
- Excellent indexed key lookup - O(1) access
- Built-in concurrency with locking
- In-memory mode via :memory:
- Per-entry insert/update without rewriting
- Universal format, excellent tooling
- SQL enables rich querying for analysis
Cons: - No cross-entry compression (can compress individual blobs, but no shared dictionary) - Compression must be handled at application level
Verdict: Excellent for cache infrastructure. Compression concern addressed by our pluggable encoder + token dictionary approach.
diskcache¶
Pros: - Simple key-value API - Built-in concurrency (uses SQLite underneath) - Automatic eviction policies (LRU, etc.) - Good performance for typical caching scenarios
Cons: - No compression - Python-specific - not portable to other languages/tools - No cross-entry compression or shared dictionaries - Limited querying capability - just key-value lookup - Export would require custom code
Verdict: Convenient for simple caching but lacks compression and portability requirements.
pickle¶
Pros: - Native Python serialisation - handles arbitrary objects - Simple API - In-memory use is natural (just Python dicts)
Cons: - Security risk - pickle can execute arbitrary code on load - No concurrency protection - corruption risk with multiple writers - Single-file approach requires loading entire cache into memory - Directory-of-files approach has same scaling issues as JSON - No cross-entry compression - Python-specific - cannot share caches with other tools - Not human-readable
Verdict: Unsuitable for production caches. Security risk alone disqualifies it. Only appropriate for small, single-process, trusted, ephemeral use cases.
Recommendation¶
SQLite + pluggable encoders with shared token dictionary combines: - SQLite's strengths: concurrency, fast lookup, in-memory mode, incremental updates - Custom compression: pluggable encoders achieve domain-specific compression, shared token dictionary provides cross-entry compression benefits - Standards compliance: SQLite is universal, export functions produce JSON/GraphML
LLM Query/Response Caching¶
This section details the specific caching requirements for LLM queries and responses.
Cache Key Construction¶
The cache key is a SHA-256 hash (truncated to 16 hex characters) of the following request parameters that affect the response:
| Field | Type | Rationale |
|---|---|---|
model |
string | Different models produce different responses |
messages |
list[dict] | The full conversation including system prompt and user query |
temperature |
float | Affects response variability (0.0 recommended for reproducibility) |
max_tokens |
int | May truncate response |
Not included in key:
- api_key - authentication, doesn't affect response content
- timeout - client-side concern
- stream - delivery method, same content
Example cache key input:
{
"model": "gpt-4o",
"messages": [
{"role": "system", "content": "You are a causal inference expert..."},
{"role": "user", "content": "What does variable 'BMI' typically represent in health studies?"}
],
"temperature": 0.0,
"max_tokens": 1000
}
Resulting hash: a3f7b2c1e9d4f8a2
Cached Data Structure¶
Each cache entry stores both the response and rich metadata for analysis:
{
"cache_key": {
"model": "gpt-4o",
"messages": [...],
"temperature": 0.0,
"max_tokens": 1000
},
"response": {
"content": "BMI (Body Mass Index) is a measure of body fat based on height and weight...",
"finish_reason": "stop",
"model_version": "gpt-4o-2024-08-06"
},
"metadata": {
"provider": "openai",
"timestamp": "2026-01-11T10:15:32.456Z",
"latency_ms": 1847,
"tokens": {
"input": 142,
"output": 203,
"total": 345
},
"cost_usd": 0.00518,
"cache_hit": false
}
}
Field Descriptions¶
Response fields:
| Field | Description |
|---|---|
content |
The full text response from the LLM |
finish_reason |
Why generation stopped: stop, length, content_filter |
model_version |
Actual model version used (may differ from requested) |
Metadata fields:
| Field | Description | Use Case |
|---|---|---|
provider |
LLM provider name (openai, anthropic, etc.) | Analysis across providers |
timestamp |
When the original request was made | Tracking model evolution |
latency_ms |
Response time in milliseconds | Performance analysis |
tokens.input |
Prompt token count | Cost tracking |
tokens.output |
Completion token count | Cost tracking |
tokens.total |
Total tokens | Cost tracking |
cost_usd |
Estimated cost of the request | Budget monitoring |
cache_hit |
Whether this was served from cache | Always false when first stored |
Cache Scope¶
LLM caches are typically dataset-scoped:
datasets/
health_study/
llm_cache.db # All LLM queries about this dataset
economic_data/
llm_cache.db # Separate cache for different dataset
This organisation means:
- Queries like "What does BMI mean?" are shared across all experiments on that dataset
- Different experiment series (algorithms, parameters) reuse the same cached LLM knowledge
- The cache represents "what the LLM has learned about this dataset"
Integration with BaseLLMClient¶
class BaseLLMClient:
def __init__(self, cache: TokenCache | None = None, use_cache: bool = True):
self.cache = cache
self.use_cache = use_cache
def query(self, messages: list[dict], **kwargs) -> LLMResponse:
if self.use_cache and self.cache:
cache_key = self._build_cache_key(messages, kwargs)
cached = self.cache.get(cache_key, entry_type='llm')
if cached:
return LLMResponse.from_cache(cached)
# Make actual API call
start = time.perf_counter()
response = self._call_api(messages, **kwargs)
latency_ms = int((time.perf_counter() - start) * 1000)
# Cache the result
if self.use_cache and self.cache:
entry = self._build_cache_entry(messages, kwargs, response, latency_ms)
self.cache.put(cache_key, entry_type='llm', data=entry)
return response
def _build_cache_key(self, messages: list[dict], kwargs: dict) -> str:
key_data = {
'model': self.model,
'messages': messages,
'temperature': kwargs.get('temperature', 0.0),
'max_tokens': kwargs.get('max_tokens', 1000),
}
key_json = json.dumps(key_data, sort_keys=True, separators=(',', ':'))
return hashlib.sha256(key_json.encode()).hexdigest()[:16]
Proposed Solution: SQLite + Pluggable Encoders¶
Given the requirements, we propose SQLite for storage/concurrency combined with pluggable type-specific encoders for compression. Generic tokenisation provides a good default, but domain-specific encoders achieve far better compression for well-defined structures.
Why Pluggable Encoders?¶
Generic tokenisation is a lowest common denominator - it works for everything but optimises nothing. Consider learned graphs:
| Approach | Per Edge | 10,000 edges |
|---|---|---|
| GraphML (verbatim) | ~80-150 bytes | 800KB - 1.5MB |
| GraphML (tokenised) | ~30-50 bytes | 300-500KB |
| Domain-specific binary | 4 bytes | 40KB |
Domain-specific encoding achieves 10-30x better compression for structured data.
SQLite Schema¶
-- Token dictionary (grows dynamically, shared across encoders)
CREATE TABLE tokens (
id INTEGER PRIMARY KEY AUTOINCREMENT, -- uint16 range (max 65,535)
token TEXT UNIQUE NOT NULL,
frequency INTEGER DEFAULT 1
);
-- Generic cache entries
CREATE TABLE cache_entries (
hash TEXT PRIMARY KEY,
entry_type TEXT NOT NULL, -- 'llm', 'graph', 'score', etc.
data BLOB NOT NULL, -- encoded by type-specific encoder
created_at TEXT NOT NULL,
metadata BLOB -- tokenised metadata (optional)
);
CREATE INDEX idx_entry_type ON cache_entries(entry_type);
CREATE INDEX idx_created_at ON cache_entries(created_at);
Encoder Architecture¶
from abc import ABC, abstractmethod
from pathlib import Path
from typing import Any
class EntryEncoder(ABC):
"""Type-specific encoder/decoder for cache entries."""
@abstractmethod
def encode(self, data: Any, token_cache: 'TokenCache') -> bytes:
"""Encode data to bytes, optionally using shared token dictionary."""
...
@abstractmethod
def decode(self, blob: bytes, token_cache: 'TokenCache') -> Any:
"""Decode bytes back to original data structure."""
...
@abstractmethod
def export(self, data: Any, path: Path) -> None:
"""Export to human-readable format (GraphML, JSON, etc.)."""
...
@abstractmethod
def import_(self, path: Path) -> Any:
"""Import from human-readable format."""
...
Concrete Encoders¶
GraphEncoder - Domain-Specific Binary¶
Achieves ~4 bytes per edge through bit-packing:
class GraphEncoder(EntryEncoder):
"""Compact binary encoding for learned graphs.
Format:
- Header: node_count (uint16)
- Node table: [node_name_token_id (uint16), ...] - maps index → name
- Edge list: [packed_edge (uint32), ...] where:
- bits 0-13: source node index (up to 16,384 nodes)
- bits 14-27: target node index
- bits 28-31: edge type (4 bits for endpoint marks: -, >, o)
"""
def encode(self, graph: nx.DiGraph, token_cache: 'TokenCache') -> bytes:
nodes = list(graph.nodes())
node_to_idx = {n: i for i, n in enumerate(nodes)}
# Header
data = struct.pack('H', len(nodes))
# Node table - uses shared token dictionary for node names
for node in nodes:
token_id = token_cache.get_or_create_token(str(node))
data += struct.pack('H', token_id)
# Edge list - packed binary
for src, tgt, attrs in graph.edges(data=True):
edge_type = self._encode_edge_type(attrs)
packed = (node_to_idx[src] |
(node_to_idx[tgt] << 14) |
(edge_type << 28))
data += struct.pack('I', packed)
return data
def export(self, graph: nx.DiGraph, path: Path) -> None:
"""Export to standard GraphML for human readability."""
nx.write_graphml(graph, path)
def import_(self, path: Path) -> nx.DiGraph:
"""Import from GraphML."""
return nx.read_graphml(path)
LLMEntryEncoder - Generic Tokenisation¶
Good default for text-heavy, variably-structured data:
class LLMEntryEncoder(EntryEncoder):
"""Tokenised encoding for LLM cache entries.
Uses shared token dictionary for JSON structure and text content.
Numbers stored as literals. Achieves 50-70% compression on typical entries.
"""
TOKEN_REF = 0x00
LITERAL_INT = 0x01
LITERAL_FLOAT = 0x02
def encode(self, entry: dict, token_cache: 'TokenCache') -> bytes:
# Tokenise JSON keys, structural chars, string words
# Store integers and floats as literals
...
def decode(self, blob: bytes, token_cache: 'TokenCache') -> dict:
# Reconstruct JSON from token IDs and literals
...
def export(self, entry: dict, path: Path) -> None:
path.write_text(json.dumps(entry, indent=2))
def import_(self, path: Path) -> dict:
return json.loads(path.read_text())
ScoreEncoder - Minimal Encoding¶
For high-volume, simple data:
class ScoreEncoder(EntryEncoder):
"""Compact encoding for node scores - just 8 bytes per entry."""
def encode(self, score: float, token_cache: 'TokenCache') -> bytes:
return struct.pack('d', score) # 8-byte double
def decode(self, blob: bytes, token_cache: 'TokenCache') -> float:
return struct.unpack('d', blob)[0]
def export(self, score: float, path: Path) -> None:
path.write_text(str(score))
def import_(self, path: Path) -> float:
return float(path.read_text())
Compression Comparison¶
| Entry Type | Encoder | Typical Size | vs Verbatim |
|---|---|---|---|
| LLM response | LLMEntryEncoder | 500 bytes | 50-70% smaller |
| Learned graph (1000 edges) | GraphEncoder | 4KB | 95% smaller |
| Node score | ScoreEncoder | 8 bytes | 50% smaller |
| Experiment metadata | LLMEntryEncoder | 200 bytes | 60% smaller |
TokenCache with Pluggable Encoders¶
class TokenCache:
def __init__(self, db_path: str, encoders: dict[str, EntryEncoder] = None):
self.conn = sqlite3.connect(db_path)
self._init_schema()
self._load_token_dict()
# Register encoders - defaults provided, can be overridden
self.encoders = encoders or {
'llm': LLMEntryEncoder(),
'graph': GraphEncoder(),
'score': ScoreEncoder(),
}
def _load_token_dict(self):
"""Load token dictionary into memory for fast lookup."""
cursor = self.conn.execute("SELECT token, id FROM tokens")
self.token_to_id = {row[0]: row[1] for row in cursor}
self.id_to_token = {v: k for k, v in self.token_to_id.items()}
def get_or_create_token(self, token: str) -> int:
"""Get token ID, creating new entry if needed. Used by encoders."""
if token in self.token_to_id:
return self.token_to_id[token]
cursor = self.conn.execute(
"INSERT INTO tokens (token) VALUES (?) RETURNING id",
(token,)
)
token_id = cursor.fetchone()[0]
self.token_to_id[token] = token_id
self.id_to_token[token_id] = token
return token_id
def put(self, hash: str, entry_type: str, data: Any, metadata: dict = None):
"""Store entry using appropriate encoder."""
encoder = self.encoders[entry_type]
blob = encoder.encode(data, self)
meta_blob = self.encoders['llm'].encode(metadata, self) if metadata else None
self.conn.execute(
"INSERT OR REPLACE INTO cache_entries VALUES (?, ?, ?, ?, ?)",
(hash, entry_type, blob, datetime.utcnow().isoformat(), meta_blob)
)
self.conn.commit()
def get(self, hash: str, entry_type: str) -> Any | None:
"""Retrieve and decode entry."""
cursor = self.conn.execute(
"SELECT data FROM cache_entries WHERE hash = ? AND entry_type = ?",
(hash, entry_type)
)
row = cursor.fetchone()
if row is None:
return None
encoder = self.encoders[entry_type]
return encoder.decode(row[0], self)
def register_encoder(self, entry_type: str, encoder: EntryEncoder):
"""Register custom encoder for new entry types."""
self.encoders[entry_type] = encoder
Tokenisation Details (LLMEntryEncoder)¶
| Element | Tokenise? | Rationale |
|---|---|---|
JSON structural chars ({, }, [, ], :, ,) |
Yes | Very frequent, 1 char → 2 bytes is fine given volume |
JSON keys ("role", "content") |
Yes | Highly repetitive |
| String values (words) | Yes | Variable names, common words repeat often |
| Integers | No | Store as literal bytes (type marker + value) |
| Floats | No | Store as literal bytes (type marker + 8-byte double) |
| Booleans | Yes | Just two tokens: true, false |
| Null | Yes | Single token |
Token Boundary Rules¶
Tokenisation splits on:
- JSON structural characters - each is its own token
- Whitespace - discarded (JSON can be reconstructed without it)
- Within strings - split on whitespace and punctuation boundaries
Example:
Encoding Format¶
The data blob contains a sequence of typed values:
| Type Marker (uint8) | Followed By | Description |
|---|---|---|
| 0x00 | uint16 | Token ID reference |
| 0x01 | int64 (8 bytes) | Literal integer |
| 0x02 | float64 (8 bytes) | Literal float |
Example encoding:
{"score": 3.14159, "node": "BMI"}
Tokens: { " score " : " node " : " BMI " }
↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓
T1 T2 T3 T4 T2 T5 T4 T2 T6 T2 T1→}
Encoded: [0x00,T1, 0x00,T2, 0x00,T3, 0x00,T2, 0x00,T4,
0x02,<8-byte float>,
0x00,T4, 0x00,T2, 0x00,T5, 0x00,T2, 0x00,T4,
0x00,T2, 0x00,T6, 0x00,T2, 0x00,T7]
In-Memory Mode¶
For node scores during structure learning (high-frequency, session-scoped):
# In-memory SQLite - fast, no disk I/O
cache = TokenCache(":memory:")
# Can optionally persist at end of session
cache.save_to_disk("node_scores.db")
Import/Export¶
Import/export delegates to the appropriate encoder:
class TokenCache:
def export_entries(self, output_dir: Path, entry_type: str, format: str = None):
"""Export cache entries to human-readable files."""
encoder = self.encoders[entry_type]
query = "SELECT hash, data FROM cache_entries WHERE entry_type = ?"
for hash, blob in self.conn.execute(query, (entry_type,)):
data = encoder.decode(blob, self)
ext = format or encoder.default_export_format
encoder.export(data, output_dir / f"{hash}.{ext}")
def import_entries(self, input_dir: Path, entry_type: str):
"""Import human-readable files into cache."""
encoder = self.encoders[entry_type]
for file in input_dir.iterdir():
if file.is_file():
data = encoder.import_(file)
hash = file.stem
self.put(hash, entry_type, data)
Cache Organisation¶
The cache is scoped by its physical location (database file):
| Context | Cache Location | Typical Size |
|---|---|---|
| Dataset LLM queries | {dataset_dir}/llm_cache.db |
10K-100K entries |
| Experiment series | {series_dir}/results.db |
100K-10M entries |
| Functional tests | tests/data/cache.db |
100-1K entries |
| Node scores (session) | :memory: |
1M+ entries |
Collision Handling¶
With truncated SHA-256 hashes, collisions are possible. On cache read:
- Find entry by hash
- Decode and verify original key matches request exactly
- If mismatch, treat as cache miss
CLI Tooling¶
# Export to human-readable format (uses encoder's export method)
causaliq cache export llm_cache.db --type llm --output cache_dump/
causaliq cache export results.db --type graph --format graphml --output graphs/
# Import from human-readable format
causaliq cache import cache_dump/ --into llm_cache.db --type llm
# Statistics
causaliq cache stats results.db
# Output:
# Entries: 52,341 (llm: 12,341, graph: 40,000)
# Tokens: 4,892
# Size: 45.2 MB (estimated uncompressed: 142.8 MB)
# Query/inspect
causaliq cache query results.db --type llm --where "model='gpt-4o'" --limit 10
Implementation Plan¶
The caching system will be built incrementally across the following commits. Feature branch: feature/llm-caching
Module Structure¶
The implementation separates core infrastructure (future migration to causaliq-core) from LLM-specific code (remains in causaliq-knowledge):
src/causaliq_knowledge/
cache/ # CORE - will migrate to causaliq-core
__init__.py
token_cache.py # TokenCache class, SQLite schema
encoders/
__init__.py
base.py # EntryEncoder ABC
json_encoder.py # Generic JSON tokenisation encoder
llm/
cache.py # LLM-SPECIFIC - stays in causaliq-knowledge
# LLMEntryEncoder, cache key building, metadata
tests/
unit/
cache/ # CORE tests - migrate with core
test_token_cache.py
test_json_encoder.py
llm/
test_llm_cache.py # LLM-SPECIFIC tests - stay here
integration/
test_llm_caching.py # LLM-SPECIFIC integration tests
This separation ensures:
- Core cache infrastructure can be extracted to causaliq-core with its tests
- LLM-specific encoder and integration remain in causaliq-knowledge
- Clean dependency: causaliq-knowledge depends on causaliq-core, not vice versa
Phase 1: Core Infrastructure (migrates to causaliq-core)¶
| # | Commit | Description |
|---|---|---|
| 1 | Add cache module skeleton | Create src/causaliq_knowledge/cache/ with __init__.py, empty module structure |
| 2 | Implement SQLite schema and TokenCache base | TokenCache class with __init__, _init_schema(), connection management, in-memory support |
| 3 | Add token dictionary management | get_or_create_token(), _load_token_dict(), in-memory token lookup |
| 4 | Add basic get/put operations | put(), get(), exists() methods (without encoding - raw blob storage) |
| 5 | Add unit tests for core TokenCache | tests/unit/cache/test_token_cache.py - schema, tokens, CRUD, in-memory |
Phase 2: Encoder Architecture (core migrates, LLM-specific stays)¶
| # | Commit | Description |
|---|---|---|
| 6 | Define EntryEncoder ABC | cache/encoders/base.py - ABC with encode(), decode(), export(), import_() (core) |
| 7 | Implement generic JSON encoder | cache/encoders/json_encoder.py - tokenisation, literals (core) |
| 8 | Integrate encoders with TokenCache | register_encoder(), encoder dispatch in get()/put() (core) |
| 9 | Add unit tests for encoders | tests/unit/cache/test_json_encoder.py (core) |
| 10 | Implement LLMEntryEncoder | llm/cache.py - extends JSON encoder with LLM-specific structure (LLM-specific) |
| 11 | Add unit tests for LLMEntryEncoder | tests/unit/llm/test_llm_cache.py (LLM-specific) |
Phase 3: Import/Export (core)¶
| # | Commit | Description |
|---|---|---|
| 12 | Add export functionality | export_entries() method in TokenCache |
| 13 | Add import functionality | import_entries() method in TokenCache |
| 14 | Add import/export tests | Round-trip tests in tests/unit/cache/ |
Phase 4: LLM Client Integration (LLM-specific)¶
| # | Commit | Description |
|---|---|---|
| 15 | Add cache_key building to BaseLLMClient | _build_cache_key() method with SHA-256 hashing |
| 16 | Integrate caching into query flow | Cache lookup before API call, cache storage after |
| 17 | Add metadata capture | Latency, token counts, cost estimation, timestamps |
| 18 | Add LLM caching integration tests | tests/integration/test_llm_caching.py - mock API, cache hit/miss |
| 19 | Update functional tests to use cache import | Fixture imports test data, tests run from cache |
Phase 5: CLI Tooling (core commands, LLM-specific options)¶
| # | Commit | Description |
|---|---|---|
| 20 | Add cache CLI subcommand | causaliq cache command group (core) |
| 21 | Add cache export command | causaliq cache export with type/format options |
| 22 | Add cache import command | causaliq cache import |
| 23 | Add cache stats command | Entry counts, token count, size estimation |
Phase 6: Documentation & Polish¶
| # | Commit | Description |
|---|---|---|
| 24 | Add cache API documentation | Docstrings, mkdocs pages for cache module |
| 25 | Update architecture docs | Mark caching design as implemented, add usage examples |
Dependency Graph¶
Phase 1 (core): 1 → 2 → 3 → 4 → 5
↓
Phase 2 (core): 6 → 7 → 8 → 9
↓
Phase 2 (llm): 10 → 11
↓
Phase 3 (core): 12 → 13 → 14
↓
Phase 4 (llm): 15 → 16 → 17 → 18 → 19
↓
Phase 5: 20 → 21 → 22 → 23
↓
Phase 6: 24 → 25
Milestones¶
- After Phase 1 (commits 1-5): Working cache with raw blob storage, testable standalone
- After Phase 2 (commits 6-11): Compression working, LLM encoder ready
- After Phase 4 (commits 15-19): Cache fully usable for LLM queries
- After Phase 5 (commits 20-23): CLI tools available for cache management
Future: Migration to causaliq-core¶
When ready to migrate core infrastructure:
- Move
cache/directory tocausaliq-core - Move
tests/unit/cache/tocausaliq-core - Update imports in
causaliq-knowledgeto usecausaliq_core.cache LLMEntryEncoderand LLM integration remain incausaliq-knowledge