Architecture Overview¶

CausalIQ Ecosystem¶

causaliq-data is a component of the overall CausalIQ ecosystem architecture, providing the data layer foundation for causal discovery algorithms.

Core Architecture: Plug-in Data Adapters¶

CausalIQ Data is built around a plug-in data adapter architecture that enables seamless integration of different data sources and formats through a unified interface. This design provides flexibility while maintaining consistent performance characteristics across different data backends.

Abstract Base Class (`Data`)¶

The Data class defines the core BNFit interface that all data adapters must implement. This abstract base class:

Extends the BNFit interface from causaliq-core for Bayesian Network fitting
Defines standard methods for data access, manipulation, and randomisation
Ensures consistent behavior across all concrete implementations
Provides common functionality like node ordering and name randomisation

Concrete Data Adapters¶

The architecture supports multiple data adapters, each optimized for different use cases:

Pandas Adapter - For standard tabular data with rich type support
NumPy Adapter - For high-performance numerical operations on large datasets
Oracle Adapter - For synthetic data generation from known Bayesian Networks

Key Architectural Features¶

In-Memory Counting and Optimization¶

The data adapters implement sophisticated in-memory counting mechanisms for efficient statistical operations:

Categorical Data Counting: Optimized binning and counting for discrete variables using NumPy's bincount functionality
Value Combination Caching: Intelligent caching of unique value combinations to avoid recomputation
Memory-Efficient Storage: Strategic use of appropriate data types (int16, int32, float32, float64) to minimize memory footprint
Sample Subset Management: Efficient handling of data subsets without copying underlying arrays

Data Randomisation Capabilities¶

The architecture provides comprehensive data randomisation features essential for causal discovery validation:

Node Name Randomisation¶

Purpose: Assess algorithm sensitivity to variable naming
Implementation: Systematic generation of randomized node names while preserving data relationships
Reversibility: Ability to revert to original names for result interpretation

Sample Order Randomisation¶

Purpose: Test algorithm stability across different data presentations
Methods: Multiple randomization strategies (full shuffle, random selection, seeded ordering)
Seed Management: Deterministic randomization for reproducible experiments

Node Processing Order Randomisation¶

Purpose: Evaluate algorithm sensitivity to variable processing order
Flexibility: Support for custom orderings or random permutations
Preservation: Maintains data integrity while changing algorithmic perspectives

Performance Optimizations¶

Lazy Evaluation¶

Sample subsets are computed on-demand rather than pre-computed
Type conversions happen only when necessary (e.g., float64 conversion for continuous data during scoring)

Memory Management¶

Original data is preserved separately from working samples
Efficient copy-on-write semantics where possible
Strategic use of views vs copies to minimize memory overhead

Algorithmic Efficiency¶

Optimized unique value detection using both numpy.unique and custom counting approaches
Threshold-based algorithm selection for optimal performance across different data sizes
In-place operations where safe and beneficial

Integration Points¶

CausalIQ Core Integration¶

Implements BNFit interface for seamless integration with Bayesian Network fitting algorithms
Provides marginal distributions and conditional independence testing capabilities
Supports parameter estimation workflows

CausalIQ Discovery Integration¶

Supplies objective functions for score-based structure learning
Provides conditional independence tests for constraint-based algorithms
Enables stability testing through randomisation features

Independence Testing Framework¶

The indep module provides statistical independence testing capabilities essential for constraint-based causal discovery:

Test Statistics and Methods¶

Multiple Test Types: Chi-squared (X²) and Mutual Information (MI) statistical tests
Conditional Independence: Support for testing X ⊥ Y | Z with arbitrary conditioning variable sets
Asymptotic Theory: Both test statistics are asymptotically χ² distributed under the null hypothesis

Data Source Flexibility¶

DataFrame Integration: Direct testing on pandas DataFrame data
File System Support: Automatic data loading from CSV files
Bayesian Network Parameters: Synthetic testing using known conditional probability tables
Unified Interface: Consistent API regardless of data source

Computational Efficiency¶

Contingency Table Optimization: Efficient construction and manipulation of multi-dimensional contingency tables
Batch Processing: Simultaneous computation of multiple test statistics
Memory Management: Minimal data copying during statistical computations

Score Function Architecture¶

The score module provides a comprehensive scoring framework for evaluating Bayesian network structures:

Multi-Type Score Support¶

Categorical Scores: BIC, AIC, log-likelihood, BDE, K2, and other Bayesian methods
Gaussian Scores: BGE (Bayesian Gaussian Equivalent), Gaussian BIC, and continuous log-likelihood
Mixed Data: Automatic score type selection based on variable types

Modular Design¶

Node-Level Scoring: Independent evaluation of individual nodes with their parents
Network-Level Scoring: Complete DAG and Bayesian Network evaluation with per-node breakdowns
Parameter Validation: Centralized parameter checking and default assignment

Performance Considerations¶

Efficient Counting: Leverages data adapter counting mechanisms for marginal distributions
Vectorized Operations: Uses NumPy operations for mathematical computations
Memory Efficiency: Minimal data copying during score computation

CausalIQ Workflow Integration¶

Supports experimental workflows requiring data randomization
Provides consistent interfaces for batch processing
Enables reproducible research through seed management

Design Principles¶

Separation of Concerns: Data access, transformation, and algorithm logic are clearly separated
Performance by Design: Architecture prioritizes computational efficiency for large-scale causal discovery
Extensibility: New data adapters can be added without changing existing code
Type Safety: Comprehensive type checking and validation throughout the pipeline
Reproducibility: Built-in support for seeded randomization and deterministic operations