Oracle - Synthetic Data Generator¶

The Oracle class provides a specialized data adapter that generates synthetic data from known Bayesian Networks. This adapter is primarily used for algorithm validation, benchmarking, and controlled experiments where the true underlying causal structure is known.

Class Definition¶

class Oracle(Data):
    """Oracle data adapter for synthetic data generation from Bayesian Networks.

    Args:
        bn: A BN (Bayesian Network) object from causaliq-core.

    Attributes:
        bn: The underlying Bayesian Network object.
    """

Constructor¶

`init(bn) -> None`¶

Creates an Oracle data adapter from a Bayesian Network.

Arguments: - bn: BN object from causaliq-core containing DAG structure and CPTs

Validation: - Input must be a valid BN object - BN must contain both DAG structure and conditional probability tables - All nodes must have associated conditional distributions

Initialization: - Extracts node names from BN DAG structure - Determines variable types from conditional distributions (CPT vs continuous) - Sets initial sample size to 1 (can be changed with set_N())

Example:

from causaliq_core.bn.io import read_bn
from causaliq_data import Oracle

# Load BN from file
bn = read_bn("cancer.dsc")

# Create Oracle adapter
oracle = Oracle(bn)
print(f"Nodes: {oracle.nodes}")
print(f"Types: {oracle.node_types}")

Synthetic Data Generation¶

`set_N(N, seed=None, random_selection=False) -> None`¶

Sets the effective sample size for synthetic data operations.

Arguments:

N: Target sample size for synthetic data generation
seed: Must be None (not supported for Oracle)
random_selection: Must be False (not applicable)

Behavior:

Updates internal sample size counter
Does not actually generate data (Oracle provides analytical answers)
Used by algorithms to determine confidence/precision of estimates

Validation:

N must be positive integer
seed parameter must be None (raises TypeError if provided)
random_selection must be False

Usage:

oracle.set_N(10000)  # Set effective sample size
print(f"Sample size: {oracle.N}")

Statistical Operations¶

`marginals(node, parents, values_reqd=False) -> Tuple`¶

Provides exact marginal distributions from the Bayesian Network.

Arguments:

node: Target node name (internal name)
parents: Dictionary specifying parent values {parent: value}
values_reqd: Whether to return value labels (always False for Oracle)

Returns:

Exact conditional probability distribution for the node given parents
For categorical nodes: probability vector over possible values
For continuous nodes: parameters of the conditional distribution

Features:

Exact Results: Returns true probabilities, not empirical estimates
No Sampling Error: Results are analytical, not subject to sampling variation
Efficient Computation: Leverages BN's internal probability representations

Example:

# Get P(Cancer | Smoker=True, Pollution=High)
marginal = oracle.marginals("Cancer", 
                           {"Smoker": "True", "Pollution": "High"})
print(f"P(Cancer=True|evidence): {marginal[0][1]}")

`values(nodes) -> np.ndarray`¶

Not Implemented: Oracle does not store actual data values.

Raises:

TypeError: Always raised with message "Oracle.values() not implemented"

Rationale:

Oracle provides analytical results, not sampled data
Use concrete adapters (Pandas/NumPy) for data value access
Consistent with Oracle's role as synthetic probability source

Specialized Oracle Features¶

True Parameter Access¶

Oracle provides direct access to the true parameters of the Bayesian Network:

Conditional Probability Tables:

# Access true CPT for a categorical node
cpt = oracle.bn.cnds["Disease"]
print("True conditional probabilities:")
for parent_config in cpt.parents_configs():
    for value in cpt.values:
        prob = cpt.get_prob(parent_config, value)
        print(f"P({value}|{parent_config}) = {prob}")

Network Structure:

# Access true DAG structure
dag = oracle.bn.dag
print(f"True edges: {dag.edges}")
print(f"True parents of X: {dag.parents('X')}")

Algorithm Validation¶

Oracle is ideal for validating causal discovery algorithms:

Score Validation:

# Compare algorithm scores with true model
true_score = oracle.score(learned_dag)
oracle_score = oracle.score(oracle.bn.dag)  # True structure score
print(f"Score difference: {abs(true_score - oracle_score)}")

Conditional Independence Testing:

# Test algorithm's CI conclusions against true model
for x, y, z in ci_tests:
    true_independent = oracle.bn.d_separated(x, y, z)
    algorithm_independent = algorithm.ci_test(oracle, x, y, z)
    accuracy = (true_independent == algorithm_independent)

Limitations and Constraints¶

Unsupported Operations¶

Data Value Access: - values() method raises TypeError - No actual data samples available - Use for probability queries only

Randomization Restrictions:

randomise_names() raises NotImplementedError
Name randomization not meaningful for Oracle
Node names tied to BN structure

Sampling Limitations:

No row-level sampling or shuffling
set_N() only affects effective sample size for algorithms
No actual data generation performed

Data Type Constraints¶

Variable Types:

Categorical variables: Must have finite discrete values
Continuous variables: Limited to supported distribution types
Mixed networks: Handled according to individual node types

Integration with CausalIQ Ecosystem¶

Algorithm Testing Framework¶

def test_algorithm_accuracy(algorithm, test_bns):
    results = []
    for bn_file in test_bns:
        # Load true BN
        bn = read_bn(bn_file)
        oracle = Oracle(bn)

        # Run algorithm
        oracle.set_N(10000)  # Large effective sample size
        learned_structure = algorithm.run(oracle)

        # Compare with true structure  
        accuracy = compare_structures(bn.dag, learned_structure)
        results.append(accuracy)

    return results

Benchmark Experiments¶

def benchmark_scoring_functions(oracle, scoring_functions):
    true_score = {}
    for score_fn in scoring_functions:
        # Get score for true structure
        true_score[score_fn.name] = score_fn.calculate(oracle, oracle.bn.dag)

        # Test alternative structures
        for alt_structure in generate_alternatives(oracle.bn.dag):
            alt_score = score_fn.calculate(oracle, alt_structure)
            print(f"{score_fn.name}: True={true_score[score_fn.name]:.3f}, "
                  f"Alt={alt_score:.3f}")

Stability Analysis¶

def analyze_algorithm_stability(algorithm, oracle, trials=100):
    # Oracle provides consistent "data" across trials
    results = []
    for trial in range(trials):
        oracle.randomise_order(trial)  # Change processing order
        result = algorithm.run(oracle)
        results.append(result)

    # Analyze consistency of results
    return assess_stability(results)

Performance Characteristics¶

Computational Efficiency¶

Analytical Operations: No sampling or counting required
Exact Computations: Probability queries return exact values
Memory Efficient: No large data arrays stored
Fast Initialization: Only stores BN structure and parameters

Scalability¶

Network Size: Performance depends on BN complexity, not sample size
Query Complexity: Marginal queries scale with network connectivity
Memory Usage: Minimal, proportional to BN size only

Best Practices¶

When to Use Oracle¶

Algorithm Validation: Testing against known ground truth
Benchmarking: Comparing algorithm performance across known structures
Method Development: Developing new algorithms with reliable test cases
Educational Use: Demonstrating causal discovery concepts

Usage Patterns¶

# Validation workflow
oracle = Oracle(known_bn)
oracle.set_N(sample_size)

# Test your algorithm
learned_result = your_algorithm.discover(oracle)

# Compare with truth
accuracy_metrics = evaluate_against_truth(learned_result, oracle.bn)

Integration Tips¶

Use Oracle early in algorithm development for debugging
Combine with Pandas/NumPy adapters for comprehensive testing
Leverage exact probabilities for theoretical analysis
Document true structure properties for result interpretation