Oracle - Synthetic Data Generator¶
The Oracle class provides a specialized data adapter that generates synthetic data from known Bayesian Networks. This adapter is primarily used for algorithm validation, benchmarking, and controlled experiments where the true underlying causal structure is known.
Class Definition¶
class Oracle(Data):
"""Oracle data adapter for synthetic data generation from Bayesian Networks.
Args:
bn: A BN (Bayesian Network) object from causaliq-core.
Attributes:
bn: The underlying Bayesian Network object.
"""
Constructor¶
__init__(bn) -> None¶
Creates an Oracle data adapter from a Bayesian Network.
Arguments:
- bn: BN object from causaliq-core containing DAG structure and CPTs
Validation: - Input must be a valid BN object - BN must contain both DAG structure and conditional probability tables - All nodes must have associated conditional distributions
Initialization:
- Extracts node names from BN DAG structure
- Determines variable types from conditional distributions (CPT vs continuous)
- Sets initial sample size to 1 (can be changed with set_N())
Example:
from causaliq_core.bn.io import read_bn
from causaliq_data import Oracle
# Load BN from file
bn = read_bn("cancer.dsc")
# Create Oracle adapter
oracle = Oracle(bn)
print(f"Nodes: {oracle.nodes}")
print(f"Types: {oracle.node_types}")
Synthetic Data Generation¶
set_N(N, seed=None, random_selection=False) -> None¶
Sets the effective sample size for synthetic data operations.
Arguments:
N: Target sample size for synthetic data generationseed: Must be None (not supported for Oracle)random_selection: Must be False (not applicable)
Behavior:
- Updates internal sample size counter
- Does not actually generate data (Oracle provides analytical answers)
- Used by algorithms to determine confidence/precision of estimates
Validation:
Nmust be positive integerseedparameter must be None (raises TypeError if provided)random_selectionmust be False
Usage:
Statistical Operations¶
marginals(node, parents, values_reqd=False) -> Tuple¶
Provides exact marginal distributions from the Bayesian Network.
Arguments:
node: Target node name (internal name)parents: Dictionary specifying parent values{parent: value}values_reqd: Whether to return value labels (always False for Oracle)
Returns:
- Exact conditional probability distribution for the node given parents
- For categorical nodes: probability vector over possible values
- For continuous nodes: parameters of the conditional distribution
Features:
- Exact Results: Returns true probabilities, not empirical estimates
- No Sampling Error: Results are analytical, not subject to sampling variation
- Efficient Computation: Leverages BN's internal probability representations
Example:
# Get P(Cancer | Smoker=True, Pollution=High)
marginal = oracle.marginals("Cancer",
{"Smoker": "True", "Pollution": "High"})
print(f"P(Cancer=True|evidence): {marginal[0][1]}")
values(nodes) -> np.ndarray¶
Not Implemented: Oracle does not store actual data values.
Raises:
TypeError: Always raised with message "Oracle.values() not implemented"
Rationale:
- Oracle provides analytical results, not sampled data
- Use concrete adapters (Pandas/NumPy) for data value access
- Consistent with Oracle's role as synthetic probability source
Specialized Oracle Features¶
True Parameter Access¶
Oracle provides direct access to the true parameters of the Bayesian Network:
Conditional Probability Tables:
# Access true CPT for a categorical node
cpt = oracle.bn.cnds["Disease"]
print("True conditional probabilities:")
for parent_config in cpt.parents_configs():
for value in cpt.values:
prob = cpt.get_prob(parent_config, value)
print(f"P({value}|{parent_config}) = {prob}")
Network Structure:
# Access true DAG structure
dag = oracle.bn.dag
print(f"True edges: {dag.edges}")
print(f"True parents of X: {dag.parents('X')}")
Algorithm Validation¶
Oracle is ideal for validating causal discovery algorithms:
Score Validation:
# Compare algorithm scores with true model
true_score = oracle.score(learned_dag)
oracle_score = oracle.score(oracle.bn.dag) # True structure score
print(f"Score difference: {abs(true_score - oracle_score)}")
Conditional Independence Testing:
# Test algorithm's CI conclusions against true model
for x, y, z in ci_tests:
true_independent = oracle.bn.d_separated(x, y, z)
algorithm_independent = algorithm.ci_test(oracle, x, y, z)
accuracy = (true_independent == algorithm_independent)
Limitations and Constraints¶
Unsupported Operations¶
Data Value Access:
- values() method raises TypeError
- No actual data samples available
- Use for probability queries only
Randomization Restrictions:
randomise_names()raises NotImplementedError- Name randomization not meaningful for Oracle
- Node names tied to BN structure
Sampling Limitations:
- No row-level sampling or shuffling
set_N()only affects effective sample size for algorithms- No actual data generation performed
Data Type Constraints¶
Variable Types:
- Categorical variables: Must have finite discrete values
- Continuous variables: Limited to supported distribution types
- Mixed networks: Handled according to individual node types
Integration with CausalIQ Ecosystem¶
Algorithm Testing Framework¶
def test_algorithm_accuracy(algorithm, test_bns):
results = []
for bn_file in test_bns:
# Load true BN
bn = read_bn(bn_file)
oracle = Oracle(bn)
# Run algorithm
oracle.set_N(10000) # Large effective sample size
learned_structure = algorithm.run(oracle)
# Compare with true structure
accuracy = compare_structures(bn.dag, learned_structure)
results.append(accuracy)
return results
Benchmark Experiments¶
def benchmark_scoring_functions(oracle, scoring_functions):
true_score = {}
for score_fn in scoring_functions:
# Get score for true structure
true_score[score_fn.name] = score_fn.calculate(oracle, oracle.bn.dag)
# Test alternative structures
for alt_structure in generate_alternatives(oracle.bn.dag):
alt_score = score_fn.calculate(oracle, alt_structure)
print(f"{score_fn.name}: True={true_score[score_fn.name]:.3f}, "
f"Alt={alt_score:.3f}")
Stability Analysis¶
def analyze_algorithm_stability(algorithm, oracle, trials=100):
# Oracle provides consistent "data" across trials
results = []
for trial in range(trials):
oracle.randomise_order(trial) # Change processing order
result = algorithm.run(oracle)
results.append(result)
# Analyze consistency of results
return assess_stability(results)
Performance Characteristics¶
Computational Efficiency¶
- Analytical Operations: No sampling or counting required
- Exact Computations: Probability queries return exact values
- Memory Efficient: No large data arrays stored
- Fast Initialization: Only stores BN structure and parameters
Scalability¶
- Network Size: Performance depends on BN complexity, not sample size
- Query Complexity: Marginal queries scale with network connectivity
- Memory Usage: Minimal, proportional to BN size only
Best Practices¶
When to Use Oracle¶
- Algorithm Validation: Testing against known ground truth
- Benchmarking: Comparing algorithm performance across known structures
- Method Development: Developing new algorithms with reliable test cases
- Educational Use: Demonstrating causal discovery concepts
Usage Patterns¶
# Validation workflow
oracle = Oracle(known_bn)
oracle.set_N(sample_size)
# Test your algorithm
learned_result = your_algorithm.discover(oracle)
# Compare with truth
accuracy_metrics = evaluate_against_truth(learned_result, oracle.bn)
Integration Tips¶
- Use Oracle early in algorithm development for debugging
- Combine with Pandas/NumPy adapters for comprehensive testing
- Leverage exact probabilities for theoretical analysis
- Document true structure properties for result interpretation