Skip to content

CausalIQ Data API Reference

The CausalIQ Data API provides a unified interface for data handling in causal discovery workflows. The API is built around a plug-in architecture with concrete implementations for different data backends.

Core Design

All data adapters implement the Data abstract base class, which extends the BNFit interface from causaliq-core. This ensures consistent behavior across different data sources while allowing backend-specific optimizations.

Module Structure

Data - Abstract Base Class

The foundational abstract class that defines the core interface for all data adapters. Provides:

  • Node Management: Consistent handling of variable names and ordering
  • Randomisation Framework: Built-in support for data and name randomisation
  • BNFit Interface: Full compatibility with Bayesian Network fitting operations
  • Type System: Unified variable type handling across data sources

Pandas - DataFrame-Based Adapter

A concrete implementation that wraps pandas DataFrames for flexible data handling:

  • Rich Type Support: Native pandas categorical and numeric types
  • File I/O: Direct CSV reading with compression support
  • Data Validation: Comprehensive missing data and type checking
  • Memory Efficiency: Smart sampling and subsetting without data duplication

NumPy - High-Performance Array Adapter

A high-performance implementation using NumPy arrays for computational efficiency:

  • Optimized Counting: Fast categorical data counting using bincount
  • Memory Management: Efficient handling of large datasets with minimal copying
  • Type Optimization: Automatic selection of appropriate numeric types
  • Advanced Sampling: Multiple strategies for data subset selection and randomization

Oracle - Synthetic Data Generator

A specialized adapter for generating synthetic data from known Bayesian Networks:

  • BN Integration: Direct integration with causaliq-core BN objects
  • Parameter Access: Direct access to true conditional probability tables
  • Testing Support: Ideal for algorithm validation and benchmarking
  • Simulation Control: Flexible sample size management for experiments

Score - Scoring Functions for Causal Structure Learning

A comprehensive module providing scoring functions for evaluating Bayesian networks and DAGs:

  • Multiple Score Types: Support for entropy-based, Bayesian, and Gaussian scoring methods
  • Categorical Scoring: BIC, AIC, log-likelihood, BDE, K2, and other Bayesian scores
  • Gaussian Scoring: BGE, Gaussian BIC, and Gaussian log-likelihood for continuous data
  • Network Evaluation: Complete DAG and Bayesian Network scoring with per-node breakdowns
  • Parameter Validation: Automatic parameter checking and default value assignment

Independence Testing - Probabilistic Independence Tests

Statistical independence testing functionality for constraint-based causal discovery:

  • Multiple Test Statistics: Chi-squared (X²) and Mutual Information (MI) tests
  • Conditional Independence: Support for testing X ⊥ Y | Z with arbitrary conditioning sets
  • Flexible Data Sources: Works with pandas DataFrames, data files, or Bayesian Network parameters
  • Robust Validation: Comprehensive argument checking and error handling
  • Integration Ready: Designed for use in PC, FCI, and other constraint-based algorithms

Preprocess - Data Preprocessing Utilities

Data cleaning and preparation utilities for Bayesian Network workflows:

  • Single-Valued Variable Removal: Automatic detection and removal of constant variables
  • Network Restructuring: Intelligent BN reconstruction after variable removal
  • Data Validation: Ensures minimum variable requirements for meaningful analysis
  • Categorical Optimization: Proper type handling for downstream operations
  • Integration Support: Seamless workflow integration with data adapters and structure learning

Common Patterns

Data Loading

from causaliq_data import Pandas, NumPy

# Load from CSV file
data = Pandas.read("dataset.csv", dstype="categorical")

# Convert to NumPy for performance
numpy_data = NumPy.from_df(data.as_df(), dstype="categorical")

Randomisation Workflows

# Randomize node names for sensitivity testing
data.randomise_names(seed=42)

# Randomize processing order
data.randomise_order(seed=123)

# Set working sample size with randomization
data.set_N(1000, seed=456, random_selection=True)

Statistical Operations

# Get marginal distributions
marginals = data.marginals("target_node", {"parent1": 0, "parent2": 1})

# Access value counts for categorical variables
counts = data.node_values["categorical_var"]

# Get unique value combinations
unique_vals, counts = data.unique(("var1", "var2"), num_vals)

Type System

The API supports a comprehensive type system for different variable types:

  • Categorical: VariableType.CATEGORY for discrete variables
  • Integers: INT16, INT32, INT64 for integer data
  • Floats: FLOAT32, FLOAT64 for continuous variables

Dataset types are automatically inferred:

  • Categorical: All variables are categorical
  • Continuous: All variables are numeric
  • Mixed: Combination of categorical and numeric variables

Performance Considerations

Memory Efficiency

  • Original data is preserved separately from working samples
  • Lazy evaluation of expensive operations
  • Strategic use of data views vs copies

Computational Optimization

  • NumPy adapter provides the best performance for large datasets
  • Optimized algorithms for unique value detection and counting
  • Efficient handling of categorical data through integer encoding

Scalability

  • Support for working with data subsets without loading entire datasets
  • Memory-conscious type selection based on data characteristics
  • Configurable thresholds for algorithm selection

Error Handling

All adapters provide comprehensive error handling with descriptive messages:

  • Type Validation: Strict checking of argument types and values
  • Data Validation: Detection of missing data, invalid formats, and size constraints
  • State Consistency: Validation of internal state consistency across operations