Skip to content

Cache Compressors

Pluggable compressors for type-specific cache entry compression.

Overview

Compressors transform data to/from compact binary representations, using a shared token dictionary for cross-entry compression.

Compressor Description
Compressor Abstract base class for all compressors
JsonCompressor Tokenised compression for JSON data

Compressor

The Compressor abstract base class defines the interface for pluggable cache compressors.

Creating a Custom Compressor

from pathlib import Path
from causaliq_core.cache import TokenCache
from causaliq_core.cache.compressors import Compressor


class MyCompressor(Compressor):
    """Example compressor for custom data types."""

    @property
    def default_export_format(self) -> str:
        """File extension for exports."""
        return "txt"

    def compress(self, data: dict, token_cache: TokenCache) -> bytes:
        """Convert data to bytes for storage."""
        # Use token_cache.get_or_create_token() for string compression
        return b"compressed"

    def decompress(self, blob: bytes, token_cache: TokenCache) -> dict:
        """Convert bytes back to original data."""
        # Use token_cache.get_token() to restore strings
        return {"decompressed": True}

    def export(self, data: dict, path: Path) -> None:
        """Export to human-readable file."""
        path.write_text(str(data))

    def import_(self, path: Path) -> dict:
        """Import from human-readable file."""
        return eval(path.read_text())

Compressor Interface

Compressor

Abstract base class for cache entry compressors.

Compressors handle: - Compressing data to compact binary format for storage - Decompressing binary data back to original structure - Exporting to human-readable formats (JSON, GraphML, etc.) - Importing from human-readable formats

Compressors may use the shared token dictionary in TokenCache for cross-entry compression of repeated strings.

Example

class MyCompressor(Compressor): ... def compress(self, data, token_cache): ... return json.dumps(data).encode() ... def decompress(self, blob, token_cache): ... return json.loads(blob.decode()) ... # ... export/import methods

Methods:

  • compress

    Compress data to binary format.

  • decompress

    Decompress binary data back to original structure.

  • export

    Export data to human-readable file format.

  • import_

    Import data from human-readable file format.

Attributes:

default_export_format property

default_export_format: str

Default file extension for exports (e.g. 'json', 'graphml').

compress abstractmethod

compress(data: Any, token_cache: TokenCache) -> bytes

Compress data to binary format.

Parameters:

  • data

    (Any) –

    The data to compress (type depends on compressor).

  • token_cache

    (TokenCache) –

    Cache instance for shared token dictionary.

Returns:

  • bytes

    Compact binary representation.

decompress abstractmethod

decompress(blob: bytes, token_cache: TokenCache) -> Any

Decompress binary data back to original structure.

Parameters:

  • blob

    (bytes) –

    Binary data from cache.

  • token_cache

    (TokenCache) –

    Cache instance for shared token dictionary.

Returns:

  • Any

    Decompressed data in original format.

export abstractmethod

export(data: Any, path: Path) -> None

Export data to human-readable file format.

Parameters:

  • data

    (Any) –

    The data to export (decompressed format).

  • path

    (Path) –

    Destination file path.

import_ abstractmethod

import_(path: Path) -> Any

Import data from human-readable file format.

Parameters:

  • path

    (Path) –

    Source file path.

Returns:

  • Any

    Imported data ready for compression.


JsonCompressor

The JsonCompressor provides tokenised compression for JSON-serialisable data, achieving compression through shared token dictionary usage.

Supported Types

JsonCompressor handles any JSON-serialisable Python data structure:

  • Dictionaries
  • Lists
  • Strings
  • Integers and floats
  • Booleans
  • None

Usage

Direct Compression

from causaliq_core.cache import TokenCache
from causaliq_core.cache.compressors import JsonCompressor

with TokenCache(":memory:") as cache:
    compressor = JsonCompressor()

    # Compress any JSON-serialisable data
    data = {
        "messages": [
            {"role": "user", "content": "What is BMI?"},
        ],
        "temperature": 0.7,
        "max_tokens": 100,
    }

    # Compress to compact binary format
    blob = compressor.compress(data, cache)

    # Decompress back to original structure
    decompressed = compressor.decompress(blob, cache)
    assert decompressed == data

With TokenCache Auto-Compression

from causaliq_core.cache import TokenCache
from causaliq_core.cache.compressors import JsonCompressor

with TokenCache(":memory:") as cache:
    # Set compressor for auto-compression
    cache.set_compressor(JsonCompressor())

    # Store and retrieve with automatic compression
    cache.put_data("hash1", {"key": "value"})
    data = cache.get_data("hash1")

Export/Import for Human-Readable Files

from pathlib import Path
from causaliq_core.cache.compressors import JsonCompressor

compressor = JsonCompressor()

# Export to JSON file
data = {"messages": [{"role": "user", "content": "Hello"}]}
compressor.export(data, Path("data.json"))

# Import from JSON file
imported = compressor.import_(Path("data.json"))

Compression Format

The compressor uses three type markers for compact binary representation:

Marker Value Description
TOKEN_REF 0x00 Token ID reference (uint16)
LITERAL_INT 0x01 64-bit signed integer
LITERAL_FLOAT 0x02 64-bit double float

What Gets Tokenised

Element Tokenised Rationale
JSON structural chars ({, }, [, ], :, ,) Yes Very frequent
String quotes (") Yes Frequent
String content (words) Yes High repetition across entries
null, true, false Yes Fixed vocabulary
Integers No Stored as 8-byte literals
Floats No Stored as 8-byte literals

Compression Example

Original: {"role": "user", "content": "Hello world"}
Tokens:   { " role " : " user " , " content " : " Hello   world " }
          ↓   ↓     ↓ ↓   ↓    ↓ ↓     ↓      ↓ ↓   ↓       ↓    ↓
          T1  T2   T3 T2 T4   T5 T2 T6  T7    T8 T2 T4     T9   T10 T2 T11

Each token ID: 3 bytes (0x00 marker + uint16)

Token Reuse

Tokens are shared across all entries in a cache, providing cumulative compression benefits:

with TokenCache(":memory:") as cache:
    cache.set_compressor(JsonCompressor())

    # Common terms like "role", "content", "user" are tokenised once
    cache.put_data("h1", {"role": "user", "content": "Hello"})
    cache.put_data("h2", {"role": "assistant", "content": "Hi"})
    cache.put_data("h3", {"role": "user", "content": "Bye"})

    # "role", "content", "user" reuse same token IDs across all entries
    print(f"Total tokens: {cache.token_count()}")  # Much less than unique words

API Reference

JsonCompressor

Tokenised compression for JSON-serialisable data.

Uses shared token dictionary for JSON structure and text content. Numbers are stored as binary literals. Typical compression is 50-70%.

Compression format
  • Token reference: 0x00 + uint16 (token ID)
  • Integer literal: 0x01 + int64 (8 bytes, signed)
  • Float literal: 0x02 + float64 (8 bytes, double)
Example

from causaliq_core.cache import TokenCache with TokenCache(":memory:") as cache: ... compressor = JsonCompressor() ... data = {"key": "value", "count": 42} ... blob = compressor.compress(data, cache) ... decompressed = compressor.decompress(blob, cache) ... assert decompressed == data

Methods:

  • compress

    Compress JSON-serialisable data to tokenised binary format.

  • decompress

    Decompress tokenised binary data back to JSON structure.

  • export

    Export data to JSON file.

  • import_

    Import data from JSON file.

Attributes:

default_export_format property

default_export_format: str

Default file extension for exports.

compress

compress(data: Any, token_cache: TokenCache) -> bytes

Compress JSON-serialisable data to tokenised binary format.

Parameters:

  • data

    (Any) –

    Any JSON-serialisable data (dict, list, str, int, etc.).

  • token_cache

    (TokenCache) –

    Cache instance for shared token dictionary.

Returns:

  • bytes

    Compact binary representation using token IDs and literals.

decompress

decompress(blob: bytes, token_cache: TokenCache) -> Any

Decompress tokenised binary data back to JSON structure.

Parameters:

  • blob

    (bytes) –

    Binary data from cache.

  • token_cache

    (TokenCache) –

    Cache instance for shared token dictionary.

Returns:

  • Any

    Decompressed JSON-compatible data structure.

export

export(data: Any, path: Path) -> None

Export data to JSON file.

Parameters:

  • data

    (Any) –

    The decompressed data to export.

  • path

    (Path) –

    Destination file path.

import_

import_(path: Path) -> Any

Import data from JSON file.

Parameters:

  • path

    (Path) –

    Source file path.

Returns:

  • Any

    Imported JSON data ready for compression.