Network Context Format¶

This guide describes the JSON format for network context files used in causaliq-knowledge graph generation.

Overview¶

Network context files define the variables, metadata, and constraints for a causal network. They enable LLMs to generate causal graphs with appropriate domain context while allowing control over how much information is provided.

Minimal Example¶

The simplest valid network context:

{
    "network": "simple",
    "domain": "test",
    "variables": [
        {"name": "X", "type": "binary"},
        {"name": "Y", "type": "binary"}
    ]
}

Complete Example¶

A comprehensive network context with all optional fields:

{
    "schema_version": "2.0",
    "network": "smoking_cancer",
    "domain": "epidemiology",
    "purpose": "Causal model for smoking and lung cancer relationship",

    "provenance": {
        "source_network": "LUNG",
        "source_reference": "Lauritzen & Spiegelhalter (1988)",
        "source_url": "https://example.com/lung-network",
        "memorization_risk": "high",
        "notes": "Well-known benchmark, LLMs may have memorised structure"
    },

    "llm_guidance": {
        "usage_notes": [
            "Focus on biological plausibility",
            "Consider temporal ordering of variables"
        ],
        "do_not_provide": [
            "Ground truth edges",
            "Canonical variable names"
        ],
        "expected_difficulty": "medium"
    },

    "prompt_details": {
        "minimal": {
            "description": "Variable names only",
            "include_fields": ["name"]
        },
        "standard": {
            "description": "Names with basic metadata",
            "include_fields": ["name", "type", "short_description", "states"]
        },
        "rich": {
            "description": "Full context for complex reasoning",
            "include_fields": [
                "name", "type", "role", "short_description",
                "extended_description", "states", "sensitivity_hints"
            ]
        }
    },

    "variables": [
        {
            "name": "smoke",
            "llm_name": "tobacco_history",
            "display_name": "Smoking Status",
            "type": "binary",
            "states": ["never", "ever"],
            "role": "exogenous",
            "category": "exposure",
            "short_description": "Patient tobacco smoking history.",
            "extended_description": "Self-reported lifetime smoking history. Known major risk factor for lung cancer with dose-response relationship.",
            "base_rate": {"never": 0.65, "ever": 0.35},
            "sensitivity_hints": "Strong causal effect on respiratory outcomes.",
            "related_domain_knowledge": [
                "Smoking contains carcinogens that damage lung tissue",
                "Risk increases with duration and intensity"
            ],
            "references": ["Doll & Hill (1950)", "IARC Monograph"]
        },
        {
            "name": "cancer",
            "llm_name": "lung_malignancy",
            "display_name": "Lung Cancer Status",
            "type": "binary",
            "states": ["negative", "positive"],
            "role": "endogenous",
            "category": "outcome",
            "short_description": "Lung cancer diagnosis.",
            "extended_description": "Confirmed lung cancer diagnosis. Primary outcome variable in smoking studies.",
            "sensitivity_hints": "Caused by multiple factors including smoking and genetics."
        },
        {
            "name": "genetics",
            "llm_name": "genetic_predisposition",
            "type": "categorical",
            "states": ["low", "medium", "high"],
            "role": "exogenous",
            "category": "genetic",
            "short_description": "Genetic predisposition to lung cancer."
        }
    ],

    "constraints": {
        "forbidden_edges": [
            ["cancer", "smoke"],
            ["cancer", "genetics"]
        ],
        "required_edges": [],
        "partial_order": [
            ["smoke", "cancer"],
            ["genetics", "cancer"]
        ]
    },

    "ground_truth": {
        "edges_expert": [
            ["smoke", "cancer"],
            ["genetics", "cancer"]
        ]
    }
}

Field Reference¶

Root Fields¶

Field	Type	Required	Description
`schema_version`	string	No	Schema version (default: "2.0")
`network`	string	Yes	Identifier for the network (e.g., "asia")
`domain`	string	Yes	Domain (e.g., "epidemiology", "genetics")
`purpose`	string	No	Purpose or description of the context
`provenance`	object	No	Source and provenance information
`llm_guidance`	object	No	Guidance for LLM interactions
`prompt_details`	object	No	Custom prompt detail definitions
`variables`	array	Yes	List of variable specifications
`constraints`	object	No	Structural constraints
`causal_principles`	array	No	Domain causal principles
`ground_truth`	object	No	Ground truth for evaluation

Variable Fields¶

Field	Type	Required	Description
`name`	string	Yes	Benchmark/literature name for ground truth
`llm_name`	string	No	Name used for LLM queries (defaults to name)
`type`	string	Yes	One of: binary, categorical, ordinal, continuous
`display_name`	string	No	Human-readable name for reports
`aliases`	array	No	Alternative names

Semantic Disguising with name vs llm_name¶

The name and llm_name fields enable semantic disguising - using meaningful but non-canonical names to reduce LLM memorisation whilst still allowing evaluation against benchmarks.

Example: For the ASIA network's "Tuberculosis" variable:

{
  "name": "tub",
  "llm_name": "HasTB",
  "display_name": "Tuberculosis Status",
  "type": "binary"
}

name: "tub" - the original ASIA benchmark identifier for evaluation
llm_name: "HasTB" - sent to the LLM (meaningful but not the benchmark name)
display_name: "Tuberculosis Status" - for human-readable reports

This approach:

Reduces memorisation - LLM sees "HasTB" not "tub" from the benchmark
Preserves semantics - The llm_name still conveys clinical meaning
Enables evaluation - Results mapped back via name field

Additional Variable Fields¶

Field	Type	Required	Description
`states`	array	No	Possible values for discrete variables
`role`	string	No	One of: exogenous, endogenous, latent
`category`	string	No	Domain-specific category
`short_description`	string	No	Brief description
`extended_description`	string	No	Detailed description with context
`base_rate`	object	No	Prior probabilities
`conditional_rates`	object	No	Conditional probabilities
`sensitivity_hints`	string	No	Hints about causal relationships
`related_domain_knowledge`	array	No	Domain knowledge statements
`references`	array	No	Literature references

Variable Types¶

binary - Two states (yes/no, present/absent)
categorical - Multiple unordered states (colors, categories)
ordinal - Multiple ordered states (low/medium/high, stages)
continuous - Numeric values (age, temperature, concentrations)

Variable Roles¶

exogenous - No parents in the causal graph (root causes)
endogenous - Has parents (caused by other variables)
latent - Unobserved/hidden variable

Prompt Details¶

Prompt details control how much variable information is provided to LLMs:

Minimal View¶

Only variable names - tests pure structural reasoning:

{"include_fields": ["name"]}

Output: [{"name": "smoking"}, {"name": "cancer"}]

Standard View¶

Names with basic metadata - balanced context:

{"include_fields": ["name", "type", "short_description", "states"]}

Rich View¶

Full context for complex reasoning:

{
    "include_fields": [
        "name", "type", "role", "short_description",
        "extended_description", "states", "sensitivity_hints"
    ]
}

Constraints¶

Structural constraints guide graph generation:

Forbidden Edges¶

Edges that must not exist:

{
    "forbidden_edges": [
        ["effect", "cause"],
        ["outcome", "exposure"]
    ]
}

Required Edges¶

Edges that must exist:

{
    "required_edges": [
        ["treatment", "outcome"]
    ]
}

Partial Order¶

Temporal ordering constraints (A must precede B):

{
    "partial_order": [
        ["birth_year", "diagnosis_age"],
        ["exposure", "disease"]
    ]
}

Ground Truth¶

For evaluation, ground truth edges can be specified:

{
    "ground_truth": {
        "edges_expert": [["A", "B"], ["B", "C"]],
        "edges_experiment": [["A", "B"]],
        "edges_observational": [["A", "B"], ["B", "C"], ["A", "C"]]
    }
}

edges_expert - Edges from domain expert consensus
edges_experiment - Edges confirmed by experiments
edges_observational - Edges from observational studies

Loading Network Context¶

from causaliq_knowledge.graph import NetworkContext

# Load from file
context = NetworkContext.load("models/cancer.json")

# Load with full validation
context, warnings = NetworkContext.load_and_validate("models/cancer.json")

# Access data
print(f"Network: {context.network}")
print(f"Domain: {context.domain}")
print(f"Variables: {context.get_variable_names()}")
print(f"LLM Names: {context.get_llm_names()}")

Example Network Contexts¶

Example network context files are in the research/models/ directory:

asia/ - ASIA network (pulmonary disease)
cancer/ - Lung cancer model
sachs/ - SACHS protein signalling network
diabetes/ - Diabetes risk factors
sepsis/ - Sepsis clinical model

Best Practices¶

Use meaningful llm_names - Aids LLM reasoning whilst avoiding memorisation
Provide short descriptions - Essential context for LLMs
Define custom prompt details - Control information disclosure
Set provenance - Document data sources
Include ground truth - Enable evaluation
Add constraints - Encode domain knowledge