JsonEncoder¶
The JsonEncoder provides tokenised encoding for JSON-serialisable data,
achieving 50-70% compression through shared token dictionary usage.
Overview¶
JsonEncoder is a concrete implementation of EntryEncoder that handles
any JSON-serialisable Python data structure:
- Dictionaries
- Lists
- Strings
- Integers and floats
- Booleans
- None
Usage¶
Direct Encoding¶
from causaliq_knowledge.cache import TokenCache
from causaliq_knowledge.cache.encoders import JsonEncoder
with TokenCache(":memory:") as cache:
encoder = JsonEncoder()
# Encode any JSON-serialisable data
data = {
"messages": [
{"role": "user", "content": "What is BMI?"},
],
"temperature": 0.7,
"max_tokens": 100,
}
# Encode to compact binary format
blob = encoder.encode(data, cache)
# Decode back to original structure
decoded = encoder.decode(blob, cache)
assert decoded == data
With TokenCache Auto-Encoding¶
from causaliq_knowledge.cache import TokenCache
from causaliq_knowledge.cache.encoders import JsonEncoder
with TokenCache(":memory:") as cache:
# Register for auto-encoding
cache.register_encoder("json", JsonEncoder())
# Store and retrieve with automatic encoding
cache.put_data("hash1", "json", {"key": "value"})
data = cache.get_data("hash1", "json")
Export/Import for Human-Readable Files¶
from pathlib import Path
from causaliq_knowledge.cache.encoders import JsonEncoder
encoder = JsonEncoder()
# Export to JSON file
data = {"messages": [{"role": "user", "content": "Hello"}]}
encoder.export(data, Path("data.json"))
# Import from JSON file
imported = encoder.import_(Path("data.json"))
Encoding Format¶
The encoder uses three type markers for compact binary representation:
| Marker | Value | Description |
|---|---|---|
| TOKEN_REF | 0x00 | Token ID reference (uint16) |
| LITERAL_INT | 0x01 | 64-bit signed integer |
| LITERAL_FLOAT | 0x02 | 64-bit double float |
What Gets Tokenised¶
| Element | Tokenised | Rationale |
|---|---|---|
JSON structural chars ({, }, [, ], :, ,) |
Yes | Very frequent |
String quotes (") |
Yes | Frequent |
| String content (words) | Yes | High repetition across entries |
null, true, false |
Yes | Fixed vocabulary |
| Integers | No | Stored as 8-byte literals |
| Floats | No | Stored as 8-byte literals |
Compression Example¶
Original: {"role": "user", "content": "Hello world"}
Tokens: { " role " : " user " , " content " : " Hello world " }
↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓
T1 T2 T3 T2 T4 T5 T2 T6 T7 T8 T2 T4 T9 T10 T2 T11
Each token ID: 3 bytes (0x00 marker + uint16)
Typical compression: 50-70% vs raw JSON
Token Reuse¶
Tokens are shared across all entries in a cache, providing cumulative compression benefits:
with TokenCache(":memory:") as cache:
cache.register_encoder("json", JsonEncoder())
# Common terms like "role", "content", "user" are tokenised once
cache.put_data("h1", "json", {"role": "user", "content": "Hello"})
cache.put_data("h2", "json", {"role": "assistant", "content": "Hi"})
cache.put_data("h3", "json", {"role": "user", "content": "Bye"})
# "role", "content", "user" reuse same token IDs across all entries
print(f"Total tokens: {cache.token_count()}") # Much less than unique words
API Reference¶
JsonEncoder
¶
Tokenised encoding for JSON-serialisable data.
Uses shared token dictionary for JSON structure and text content. Numbers are stored as binary literals. Typical compression is 50-70%.
Encoding format
- Token reference: 0x00 + uint16 (token ID)
- Integer literal: 0x01 + int64 (8 bytes, signed)
- Float literal: 0x02 + float64 (8 bytes, double)
Example
from causaliq_knowledge.cache import TokenCache with TokenCache(":memory:") as cache: ... encoder = JsonEncoder() ... data = {"key": "value", "count": 42} ... blob = encoder.encode(data, cache) ... decoded = encoder.decode(blob, cache) ... assert decoded == data
Methods:
-
encode–Encode JSON-serialisable data to tokenised binary format.
-
decode–Decode tokenised binary data back to JSON structure.
-
export–Export data to JSON file.
-
import_–Import data from JSON file.
Attributes:
-
default_export_format(str) –Default file extension for exports.
encode
¶
encode(data: Any, token_cache: TokenCache) -> bytes
Encode JSON-serialisable data to tokenised binary format.
Parameters:
-
(data¶Any) –Any JSON-serialisable data (dict, list, str, int, etc.).
-
(token_cache¶TokenCache) –Cache instance for shared token dictionary.
Returns:
-
bytes–Compact binary representation using token IDs and literals.
decode
¶
decode(blob: bytes, token_cache: TokenCache) -> Any
Decode tokenised binary data back to JSON structure.
Parameters:
-
(blob¶bytes) –Binary data from cache.
-
(token_cache¶TokenCache) –Cache instance for shared token dictionary.
Returns:
-
Any–Decoded JSON-compatible data structure.