Independence Testing - Probabilistic Independence Tests¶

The indep module provides comprehensive statistical independence testing functionality for causal discovery workflows. It implements multiple test statistics and supports both conditional and unconditional independence testing on data or Bayesian Network parameters.

Overview¶

Independence tests are fundamental to constraint-based causal discovery algorithms (PC, FCI, etc.) and structure learning validation. The module supports:

Multiple Test Statistics: Chi-squared (X²) and Mutual Information (MI) tests
Conditional Independence: Testing X ⊥ Y | Z with arbitrary conditioning sets
Flexible Data Sources: Works with pandas DataFrames, data files, or BN parameters
Comprehensive Validation: Robust argument checking and error handling

Constants¶

TESTS = ["mi", "x2"]  # Supported test types
TEST_PARAMS = {"alpha": 0.05}  # Default test parameters  
MIN_P_VALUE = 1e-30  # Minimum p-value threshold

Core Functions¶

`indep(x, y, z, data, bn=None, N=1000000000, types="mi") -> DataFrame`¶

The main independence testing function that performs statistical tests to determine if variables x and y are independent, optionally conditional on variables z.

Arguments:

x: Name of the first variable (string)
y: Name of the second variable (string)
z: Name(s) of conditioning variables (string, list of strings, or None)
data: Data source (pandas DataFrame, file path string, or None)
bn: Bayesian Network for synthetic testing (BN object or None)
N: Sample size when using BN parameters (int, default: 1,000,000,000)
types: Test statistic type(s) to compute (string or list, default: "mi")

Returns:

DataFrame with independence test results where:

Columns: Test types (e.g., "mi", "x2")
Rows: Test statistics ("statistic", "df", "p_value")

Test Statistics:

Chi-squared ("x2"): Classical Pearson chi-squared test for independence
Formula: Σ((observed - expected)² / expected)
Asymptotically χ² distributed under null hypothesis of independence
Mutual Information ("mi"): Information-theoretic measure of dependence
Formula: 2 × Σ(observed × log(observed / expected))
Asymptotically χ² distributed (G-test statistic)

Usage Examples:

from causaliq_data.indep import indep
import pandas as pd

# Load data
data = pd.read_csv("dataset.csv")

# Unconditional independence test
result = indep("X", "Y", None, data, types="x2")
print(f"p-value: {result.loc['p_value', 'x2']}")

# Conditional independence test  
result = indep("X", "Y", ["Z1", "Z2"], data, types=["mi", "x2"])

# Test using BN parameters
from causaliq_core.bn.io import read_bn
bn = read_bn("network.dsc")
result = indep("A", "B", "C", None, bn=bn, N=10000)

Data Source Options:

DataFrame: Direct pandas DataFrame input
File Path: Path to CSV file (loaded automatically)
BN Parameters: Uses conditional probability tables from Bayesian Network

`check_test_params(params) -> Dict[str, Any]`¶

Validates and standardizes independence test parameters.

Arguments:

params: Dictionary of test parameters to validate

Supported Parameters:

alpha: Significance level for tests (float, 0 < alpha < 1, default: 0.05)

Returns:

Dictionary of validated parameters with defaults applied.

Raises:

TypeError: If parameters have incorrect types
ValueError: If parameter values are invalid

Example:

from causaliq_data.indep import check_test_params

# Validate custom parameters
params = check_test_params({"alpha": 0.01})
print(params)  # {"alpha": 0.01}

# Apply defaults
params = check_test_params({})  
print(params)  # {"alpha": 0.05}

Internal Functions¶

`check_indep_args(...) -> Tuple[...]`¶

Internal function that validates and standardizes all arguments for independence tests.

Key Validations:

Type checking for all arguments
Variable name uniqueness
Data/BN consistency checks
Column/node existence validation
Sample size validation

`_statistic(actuals, type) -> Tuple[int, float]`¶

Internal function that computes test statistics from contingency tables.

Arguments:

actuals: 2D list representing contingency table counts
type: Test statistic type ("x2" or "mi")

Returns:

Tuple of (degrees of freedom, test statistic value)

Features:

Handles zero-count tables (returns 0.0 statistic)
Robust error handling for malformed inputs
Optimized computation for both test types

Statistical Details¶

Test Assumptions¶

Both chi-squared and mutual information tests assume:

Categorical Variables: All variables must be discrete/categorical
Sufficient Sample Size: Large enough samples for asymptotic properties
Independent Observations: Rows represent independent samples
No Missing Data: Complete case analysis only

Degrees of Freedom¶

For contingency tables with dimensions r × c:

Degrees of Freedom: (r-1) × (c-1)
Conditional Tests: Sum across conditioning set combinations

P-value Computation¶

P-values are computed using the chi-squared distribution:

p_value = 1.0 - stats.chi2.cdf(statistic, degrees_of_freedom)

Values below MIN_P_VALUE (1e-30) are set to 0.0 for numerical stability.

Error Handling¶

The module provides comprehensive error checking:

Type Errors:

Non-string variable names
Invalid data types for DataFrame/BN arguments
Malformed conditioning sets or test type specifications

Value Errors:

Duplicate variable names
Variables not present in data/BN
Negative sample sizes
Unsupported or duplicate test types
Empty test specifications

File Errors:

Missing data files
Malformed CSV data

Integration with Causal Discovery¶

Independence tests are essential for:

Constraint-Based Algorithms¶

# PC Algorithm skeleton discovery
if indep("X", "Y", [], data)["mi"]["p_value"] > 0.05:
    # Remove edge X-Y
    pass

# Conditional independence for orientation  
if indep("X", "Y", ["Z"], data)["mi"]["p_value"] <= 0.05:
    # Orient edge based on dependence
    pass

Structure Learning Validation¶

# Validate learned structure
learned_bn = learn_structure(data)
for x, y in learned_bn.edges():
    parents_xy = list(set(learned_bn.parents(x) + learned_bn.parents(y)))
    test_result = indep(x, y, parents_xy, data)
    if test_result["mi"]["p_value"] > 0.05:
        print(f"Warning: {x}-{y} may be spurious")

Synthetic Data Validation¶

# Test independence properties in generated data
true_bn = read_bn("true_network.dsc")
synthetic_data = generate_data(true_bn, N=5000)

# Verify independence assumptions hold
for node in true_bn.nodes:
    non_descendants = true_bn.non_descendants(node)
    for nd in non_descendants:
        parents = true_bn.parents(node)
        result = indep(node, nd, parents, synthetic_data)
        assert result["mi"]["p_value"] > 0.05, f"{node} should be independent of {nd}"

Performance Considerations¶

Computational Complexity¶

Contingency Table Construction: O(n × k) where n = sample size, k = variables
Statistic Computation: O(r × c) where r, c are table dimensions
Conditional Tests: Multiplicative in conditioning set size

Memory Usage¶

Sparse Tables: Efficient handling of sparse contingency tables
Batch Processing: Processes all conditioning combinations efficiently
Memory Reuse: Minimal copying in DataFrame operations

Optimization Tips¶

Batch Multiple Tests: Use types=["mi", "x2"] for multiple statistics
Limit Conditioning Sets: Large conditioning sets increase computational cost
Sample Size Management: Use appropriate N values with BN parameters
Data Preprocessing: Pre-filter and clean data before testing

Independence Testing - Probabilistic Independence Tests¶

Overview¶

Constants¶

Core Functions¶

indep(x, y, z, data, bn=None, N=1000000000, types="mi") -> DataFrame¶

check_test_params(params) -> Dict[str, Any]¶

Internal Functions¶

check_indep_args(...) -> Tuple[...]¶

_statistic(actuals, type) -> Tuple[int, float]¶

Statistical Details¶

Test Assumptions¶

Degrees of Freedom¶

P-value Computation¶

Error Handling¶

Integration with Causal Discovery¶

Constraint-Based Algorithms¶

Structure Learning Validation¶

Synthetic Data Validation¶

Performance Considerations¶

Computational Complexity¶

Memory Usage¶

Optimization Tips¶

See Also¶

`indep(x, y, z, data, bn=None, N=1000000000, types="mi") -> DataFrame`¶

`check_test_params(params) -> Dict[str, Any]`¶

`check_indep_args(...) -> Tuple[...]`¶

`_statistic(actuals, type) -> Tuple[int, float]`¶