JsonEncoder¶

The JsonEncoder provides tokenised encoding for JSON-serialisable data, achieving 50-70% compression through shared token dictionary usage.

Overview¶

JsonEncoder is a concrete implementation of EntryEncoder that handles any JSON-serialisable Python data structure:

Dictionaries
Lists
Strings
Integers and floats
Booleans
None

Usage¶

Direct Encoding¶

from causaliq_knowledge.cache import TokenCache
from causaliq_knowledge.cache.encoders import JsonEncoder

with TokenCache(":memory:") as cache:
    encoder = JsonEncoder()

    # Encode any JSON-serialisable data
    data = {
        "messages": [
            {"role": "user", "content": "What is BMI?"},
        ],
        "temperature": 0.7,
        "max_tokens": 100,
    }

    # Encode to compact binary format
    blob = encoder.encode(data, cache)

    # Decode back to original structure
    decoded = encoder.decode(blob, cache)
    assert decoded == data

With TokenCache Auto-Encoding¶

from causaliq_knowledge.cache import TokenCache
from causaliq_knowledge.cache.encoders import JsonEncoder

with TokenCache(":memory:") as cache:
    # Register for auto-encoding
    cache.register_encoder("json", JsonEncoder())

    # Store and retrieve with automatic encoding
    cache.put_data("hash1", "json", {"key": "value"})
    data = cache.get_data("hash1", "json")

Export/Import for Human-Readable Files¶

from pathlib import Path
from causaliq_knowledge.cache.encoders import JsonEncoder

encoder = JsonEncoder()

# Export to JSON file
data = {"messages": [{"role": "user", "content": "Hello"}]}
encoder.export(data, Path("data.json"))

# Import from JSON file
imported = encoder.import_(Path("data.json"))

Encoding Format¶

The encoder uses three type markers for compact binary representation:

Marker	Value	Description
TOKEN_REF	0x00	Token ID reference (uint16)
LITERAL_INT	0x01	64-bit signed integer
LITERAL_FLOAT	0x02	64-bit double float

What Gets Tokenised¶

Element	Tokenised	Rationale
JSON structural chars (`{`, `}`, `[`, `]`, `:`, `,`)	Yes	Very frequent
String quotes (`"`)	Yes	Frequent
String content (words)	Yes	High repetition across entries
`null`, `true`, `false`	Yes	Fixed vocabulary
Integers	No	Stored as 8-byte literals
Floats	No	Stored as 8-byte literals

Compression Example¶

Original: {"role": "user", "content": "Hello world"}
Tokens:   { " role " : " user " , " content " : " Hello   world " }
          ↓   ↓     ↓ ↓   ↓    ↓ ↓     ↓      ↓ ↓   ↓       ↓    ↓
          T1  T2   T3 T2 T4   T5 T2 T6  T7    T8 T2 T4     T9   T10 T2 T11

Each token ID: 3 bytes (0x00 marker + uint16)
Typical compression: 50-70% vs raw JSON

Token Reuse¶

Tokens are shared across all entries in a cache, providing cumulative compression benefits:

with TokenCache(":memory:") as cache:
    cache.register_encoder("json", JsonEncoder())

    # Common terms like "role", "content", "user" are tokenised once
    cache.put_data("h1", "json", {"role": "user", "content": "Hello"})
    cache.put_data("h2", "json", {"role": "assistant", "content": "Hi"})
    cache.put_data("h3", "json", {"role": "user", "content": "Bye"})

    # "role", "content", "user" reuse same token IDs across all entries
    print(f"Total tokens: {cache.token_count()}")  # Much less than unique words

API Reference¶

JsonEncoder ¶

Tokenised encoding for JSON-serialisable data.

Uses shared token dictionary for JSON structure and text content. Numbers are stored as binary literals. Typical compression is 50-70%.

Encoding format

Token reference: 0x00 + uint16 (token ID)
Integer literal: 0x01 + int64 (8 bytes, signed)
Float literal: 0x02 + float64 (8 bytes, double)

Example

from causaliq_knowledge.cache import TokenCache with TokenCache(":memory:") as cache: ... encoder = JsonEncoder() ... data = {"key": "value", "count": 42} ... blob = encoder.encode(data, cache) ... decoded = encoder.decode(blob, cache) ... assert decoded == data

Methods:

encode –

Encode JSON-serialisable data to tokenised binary format.
decode –

Decode tokenised binary data back to JSON structure.
export –

Export data to JSON file.
import_ –

Import data from JSON file.

Attributes:

default_export_format (str) –

Default file extension for exports.

default_export_format `property` ¶

default_export_format: str

Default file extension for exports.

encode ¶

encode(data: Any, token_cache: TokenCache) -> bytes

Encode JSON-serialisable data to tokenised binary format.

Parameters:

data ¶
(Any) –

Any JSON-serialisable data (dict, list, str, int, etc.).
token_cache ¶
(TokenCache) –

Cache instance for shared token dictionary.

Returns:

bytes –

Compact binary representation using token IDs and literals.

decode ¶

decode(blob: bytes, token_cache: TokenCache) -> Any

Decode tokenised binary data back to JSON structure.

Parameters:

blob ¶
(bytes) –

Binary data from cache.
token_cache ¶
(TokenCache) –

Cache instance for shared token dictionary.

Returns:

Any –

Decoded JSON-compatible data structure.

export ¶

export(data: Any, path: Path) -> None

Export data to JSON file.

Parameters:

data ¶
(Any) –

The decoded data to export.
path ¶
(Path) –

Destination file path.

import_ ¶

import_(path: Path) -> Any

Import data from JSON file.

Parameters:

path ¶
(Path) –

Source file path.

Returns:

Any –

Imported JSON data ready for encoding.

JsonEncoder¶

Overview¶

Usage¶

Direct Encoding¶

With TokenCache Auto-Encoding¶

Export/Import for Human-Readable Files¶

Encoding Format¶

What Gets Tokenised¶

Compression Example¶

Token Reuse¶

API Reference¶

JsonEncoder ¶

default_export_format `property` ¶

encode ¶

`data` ¶

`token_cache` ¶

decode ¶

`blob` ¶

`token_cache` ¶

export ¶

`data` ¶

`path` ¶

import_ ¶

`path` ¶

JsonEncoder¶

Overview¶

Usage¶

Direct Encoding¶

With TokenCache Auto-Encoding¶

Export/Import for Human-Readable Files¶

Encoding Format¶

What Gets Tokenised¶

Compression Example¶

Token Reuse¶

API Reference¶

JsonEncoder ¶

default_export_format property ¶

encode ¶

data ¶

token_cache ¶

decode ¶

blob ¶

token_cache ¶

export ¶

data ¶

path ¶

import_ ¶

path ¶

default_export_format `property` ¶

`data` ¶

`token_cache` ¶

`blob` ¶

`token_cache` ¶

`data` ¶

`path` ¶

`path` ¶