CausalIQ Data Interface Specification¶

Overview¶

This document defines the interface contract between CausalIQ Core and CausalIQ Data packages. The core package contains BN fitting algorithms (CPT.fit, LinGauss.fit) that require data access operations. The data package will provide concrete implementations of data sources.

Abstract Data Interface¶

`BNFit` (Abstract Base Class)¶

The core package requires data sources to implement this interface:

from abc import ABC, abstractmethod
from typing import Dict, Tuple, Any
import numpy as np

class BNFit(ABC):
    """Abstract interface for data sources used in BN fitting."""

    @abstractmethod
    def marginals(self, node: str, parents: Dict, values_reqd: bool = False) -> Tuple:
        """Return marginal counts for a node and its parents.

        Args:
            node (str): Node for which marginals required.
            parents (dict): {node: parents} parents of non-orphan nodes
            values_reqd (bool): Whether parent and child values required

        Returns:
            tuple: Of counts, and optionally, values:
                   - ndarray counts: 2D, rows=child, cols=parents
                   - int maxcol: Maximum number of parental values
                   - tuple rowval: Child values for each row
                   - tuple colval: Parent combo (dict) for each col

        Raises:
            TypeError: For bad argument types
        """
        pass

    @abstractmethod
    def values(self, columns: Tuple[str, ...]) -> np.ndarray:
        """Return the (float) values for the specified set of columns.

        Suitable for passing into e.g. linearRegression fitting function

        Args:
            columns (tuple): Columns for which data required

        Returns:
            ndarray: Numpy array of values, each column for a node

        Raises:
            TypeError: If bad arg type
            ValueError: If bad arg value
        """
        pass

    @property
    @abstractmethod
    def N(self) -> int:
        """Total sample size.

        Returns:
            int: Current sample size being used
        """
        pass

    @property
    @abstractmethod
    def node_values(self) -> Dict[str, Dict]:
        """Node value counts for categorical variables.

        Returns:
            dict: Values and their counts of categorical nodes
                  in sample {n1: {v1: c1, v2: ...}, n2 ...}
        """
        pass

Usage in Core Package¶

CPT.fit() Dependencies¶

The CPT.fit() method requires these data operations:

# For nodes with parents
counts, _, rowval, colval = data.marginals(node, {node: list(parents)}, True)

# For autocomplete functionality
data.N  # Total sample size
data.node_values[node]  # {value: count} for node
data.node_values[parent]  # {value: count} for each parent

# For orphan nodes
data.N
data.node_values[node]

LinGauss.fit() Dependencies¶

The LinGauss.fit() method requires:

# Get continuous values for regression
values = data.values((node,))  # For orphan nodes
values = data.values(tuple([node] + list(parents)))  # For nodes with parents

Expected Concrete Implementations¶

The data package should provide these concrete classes:

1. `LegacyPandasAdapter`¶

Adapts existing legacy.data.Pandas class to DataInterface
Delegates to existing marginals(), values(), N, node_values implementations
Handles pandas DataFrames efficiently with crosstab-based marginals
Ensures backward compatibility with existing test suites

2. `LegacyNumPyAdapter`¶

Adapts existing legacy.data.NumPy class to BNFit interface
Delegates to existing marginals(), values(), N, node_values implementations
Enables NumPy support that doesn't currently work with core algorithms
For NumPy-based data sources

Future Extensions¶

The interface is designed to be extensible for: - GPU-accelerated data sources (e.g., CuPy, Rapids cuDF) - Database backends (SQL, NoSQL) - Streaming data sources - Distributed data processing (Dask, Spark) - Custom data transformations

Legacy Compatibility Requirements¶

The data package must maintain compatibility with existing usage patterns:

# Existing legacy pattern that must continue working
from legacy.data.pandas import Pandas
data = Pandas(df)
cnd_spec, estimated = CPT.fit('B', ('A',), data)

# New pattern with adapter
from causaliq_data import LegacyDataAdapter
data_adapted = LegacyDataAdapter(data)
cnd_spec, estimated = CPT.fit('B', ('A',), data_adapted)

Key Implementation Details¶

marginals() Method Behavior¶

For orphan nodes (no parents): - parents parameter: {} or {node: []} - Returns: (counts.reshape(-1, 1), 1, rowval, colval) - rowval: tuple of node values - colval: tuple containing single empty dict ({},)

For nodes with single parent: - parents parameter: {node: [parent_name]} - Returns: (counts_2d, num_cols, rowval, colval) - rowval: tuple of child values - colval: tuple of dicts ({parent: value},)

For nodes with multiple parents: - parents parameter: {node: [parent1, parent2, ...]} - Returns: (counts_2d, num_cols, rowval, colval) - colval: tuple of dicts ({parent1: val1, parent2: val2},)

values() Method Behavior¶

Must return numpy array with float dtype
Each column corresponds to a requested node
Row order must be consistent with data source
Should validate that all requested columns exist

Error Handling Standards¶

All methods should raise: - TypeError: For incorrect argument types - ValueError: For invalid argument values (missing columns, etc.)

Migration Path¶

Phase 1: Create data package with interface and implementations
Phase 2: Update core package to import DataInterface from data package
Phase 3: Update legacy tests to use adapters
Phase 4: Add new data source types as needed

Testing Requirements¶

The data package should include: - Unit tests for each concrete implementation - Integration tests with core CPT.fit() and LinGauss.fit() - Compatibility tests with legacy test suite - Performance benchmarks for marginals calculation

Future Extensions¶

The interface is designed to be extensible for: - Database backends - Streaming data sources - Distributed data processing - Custom data transformations

Notes for Implementation¶

Prioritize performance in marginals() calculation (this is the bottleneck)
Consider caching computed marginals for repeated queries
Ensure thread safety if needed for concurrent access
Document any pandas version dependencies
Consider memory efficiency for large datasets

CausalIQ Data Interface Specification¶

Overview¶

Abstract Data Interface¶

BNFit (Abstract Base Class)¶

Usage in Core Package¶

CPT.fit() Dependencies¶

LinGauss.fit() Dependencies¶

Expected Concrete Implementations¶

1. LegacyPandasAdapter¶

2. LegacyNumPyAdapter¶

Future Extensions¶

Legacy Compatibility Requirements¶

Key Implementation Details¶

marginals() Method Behavior¶

values() Method Behavior¶

Error Handling Standards¶

Migration Path¶

Testing Requirements¶

Future Extensions¶

Notes for Implementation¶

`BNFit` (Abstract Base Class)¶

1. `LegacyPandasAdapter`¶

2. `LegacyNumPyAdapter`¶