Skip to content

JsonEncoder

The JsonEncoder provides tokenised encoding for JSON-serialisable data, achieving 50-70% compression through shared token dictionary usage.

Overview

JsonEncoder is a concrete implementation of EntryEncoder that handles any JSON-serialisable Python data structure:

  • Dictionaries
  • Lists
  • Strings
  • Integers and floats
  • Booleans
  • None

Usage

Direct Encoding

from causaliq_knowledge.cache import TokenCache
from causaliq_knowledge.cache.encoders import JsonEncoder

with TokenCache(":memory:") as cache:
    encoder = JsonEncoder()

    # Encode any JSON-serialisable data
    data = {
        "messages": [
            {"role": "user", "content": "What is BMI?"},
        ],
        "temperature": 0.7,
        "max_tokens": 100,
    }

    # Encode to compact binary format
    blob = encoder.encode(data, cache)

    # Decode back to original structure
    decoded = encoder.decode(blob, cache)
    assert decoded == data

With TokenCache Auto-Encoding

from causaliq_knowledge.cache import TokenCache
from causaliq_knowledge.cache.encoders import JsonEncoder

with TokenCache(":memory:") as cache:
    # Register for auto-encoding
    cache.register_encoder("json", JsonEncoder())

    # Store and retrieve with automatic encoding
    cache.put_data("hash1", "json", {"key": "value"})
    data = cache.get_data("hash1", "json")

Export/Import for Human-Readable Files

from pathlib import Path
from causaliq_knowledge.cache.encoders import JsonEncoder

encoder = JsonEncoder()

# Export to JSON file
data = {"messages": [{"role": "user", "content": "Hello"}]}
encoder.export(data, Path("data.json"))

# Import from JSON file
imported = encoder.import_(Path("data.json"))

Encoding Format

The encoder uses three type markers for compact binary representation:

Marker Value Description
TOKEN_REF 0x00 Token ID reference (uint16)
LITERAL_INT 0x01 64-bit signed integer
LITERAL_FLOAT 0x02 64-bit double float

What Gets Tokenised

Element Tokenised Rationale
JSON structural chars ({, }, [, ], :, ,) Yes Very frequent
String quotes (") Yes Frequent
String content (words) Yes High repetition across entries
null, true, false Yes Fixed vocabulary
Integers No Stored as 8-byte literals
Floats No Stored as 8-byte literals

Compression Example

Original: {"role": "user", "content": "Hello world"}
Tokens:   { " role " : " user " , " content " : " Hello   world " }
          ↓   ↓     ↓ ↓   ↓    ↓ ↓     ↓      ↓ ↓   ↓       ↓    ↓
          T1  T2   T3 T2 T4   T5 T2 T6  T7    T8 T2 T4     T9   T10 T2 T11

Each token ID: 3 bytes (0x00 marker + uint16)
Typical compression: 50-70% vs raw JSON

Token Reuse

Tokens are shared across all entries in a cache, providing cumulative compression benefits:

with TokenCache(":memory:") as cache:
    cache.register_encoder("json", JsonEncoder())

    # Common terms like "role", "content", "user" are tokenised once
    cache.put_data("h1", "json", {"role": "user", "content": "Hello"})
    cache.put_data("h2", "json", {"role": "assistant", "content": "Hi"})
    cache.put_data("h3", "json", {"role": "user", "content": "Bye"})

    # "role", "content", "user" reuse same token IDs across all entries
    print(f"Total tokens: {cache.token_count()}")  # Much less than unique words

API Reference

JsonEncoder

Tokenised encoding for JSON-serialisable data.

Uses shared token dictionary for JSON structure and text content. Numbers are stored as binary literals. Typical compression is 50-70%.

Encoding format
  • Token reference: 0x00 + uint16 (token ID)
  • Integer literal: 0x01 + int64 (8 bytes, signed)
  • Float literal: 0x02 + float64 (8 bytes, double)
Example

from causaliq_knowledge.cache import TokenCache with TokenCache(":memory:") as cache: ... encoder = JsonEncoder() ... data = {"key": "value", "count": 42} ... blob = encoder.encode(data, cache) ... decoded = encoder.decode(blob, cache) ... assert decoded == data

Methods:

  • encode

    Encode JSON-serialisable data to tokenised binary format.

  • decode

    Decode tokenised binary data back to JSON structure.

  • export

    Export data to JSON file.

  • import_

    Import data from JSON file.

Attributes:

default_export_format property

default_export_format: str

Default file extension for exports.

encode

encode(data: Any, token_cache: TokenCache) -> bytes

Encode JSON-serialisable data to tokenised binary format.

Parameters:

  • data

    (Any) –

    Any JSON-serialisable data (dict, list, str, int, etc.).

  • token_cache

    (TokenCache) –

    Cache instance for shared token dictionary.

Returns:

  • bytes

    Compact binary representation using token IDs and literals.

decode

decode(blob: bytes, token_cache: TokenCache) -> Any

Decode tokenised binary data back to JSON structure.

Parameters:

  • blob

    (bytes) –

    Binary data from cache.

  • token_cache

    (TokenCache) –

    Cache instance for shared token dictionary.

Returns:

  • Any

    Decoded JSON-compatible data structure.

export

export(data: Any, path: Path) -> None

Export data to JSON file.

Parameters:

  • data

    (Any) –

    The decoded data to export.

  • path

    (Path) –

    Destination file path.

import_

import_(path: Path) -> Any

Import data from JSON file.

Parameters:

  • path

    (Path) –

    Source file path.

Returns:

  • Any

    Imported JSON data ready for encoding.