NumPy - High-Performance Array Adapter¶

The NumPy class provides a high-performance implementation of the Data interface using NumPy arrays as the underlying storage. This adapter is optimized for computational efficiency and large-scale causal discovery operations.

Class Definition¶

class NumPy(Data):
    """Concrete Data subclass which holds data in NumPy arrays.

    Args:
        data (ndarray): Data provided as a 2-D NumPy array.
        dstype (DatasetType): Type of variables in dataset.
        col_values (dict): Column names and their categorical values.

    Attributes:
        data (ndarray): The original data values.
        sample (ndarray): Sample values of size N, rows possibly reordered.
        categories: Categories for each categorical node.
    """

Constructor¶

`init(data, dstype, col_values=None) -> None`¶

Creates a NumPy data adapter from a 2D NumPy array.

Arguments: - data: 2D NumPy array with shape (n_samples, n_features) - dstype: Dataset type (DatasetType.CATEGORICAL, CONTINUOUS, or MIXED) - col_values: Optional mapping of column names to categorical values

Validation: - Data must be 2D NumPy array - Minimum 2 samples and 2 features required - For categorical data, values must be integer-encoded starting from 0

Initialization: - Sets up node names as X0, X1, X2, ... by default - Converts categorical values to appropriate categories - Determines node types based on dstype

Factory Methods¶

`from_df(df, dstype, keep_df=False) -> 'NumPy'`¶

Creates a NumPy instance from a pandas DataFrame.

Arguments: - df: Pandas DataFrame containing the data - dstype: Target dataset type for conversion - keep_df: Whether to preserve DataFrame for as_df() operations

Features: - Automatic conversion from pandas to NumPy format - Intelligent handling of categorical data encoding - Optional DataFrame preservation for round-trip compatibility

Example:

# Convert from Pandas
pandas_data = Pandas.read("data.csv", dstype="categorical")
numpy_data = NumPy.from_df(pandas_data.as_df(), 
                          dstype="categorical", 
                          keep_df=True)

High-Performance Operations¶

`set_N(N, seed=None, random_selection=False) -> None`¶

Sets working sample size with optimized sampling strategies.

Arguments:

N: Target sample size
seed: Random seed for reproducible results
random_selection: Use random subset vs first N rows

Performance Features:

Random Selection: Uses numpy.random.choice() for efficient random sampling
Row Shuffling: Optional in-place shuffling with permutation()
Memory Optimization: Works with array views when possible
Type Conversion: Converts continuous data to float64 for precision only when needed

Implementation Details:

# Efficient random selection without replacement
indices = rng.choice(self.data.shape[0], size=N, replace=False)
self._sample = self.data[indices if seed != 0 else sorted(indices)]

# In-place row order randomization
if seed is not None and seed != 0:
    order = rng.permutation(N)
    self._sample = self.sample[order]

`unique(j_reqd, num_vals) -> Tuple[ndarray, ndarray]`¶

Highly optimized unique value detection and counting.

Arguments:

j_reqd: Tuple of column indices for which unique combinations are needed
num_vals: Array of number of unique values per column

Returns:

(combinations, counts): Unique value combinations and their frequencies

Optimization Strategy:

# Fast path for small combination spaces
max_combinations = prod(num_vals)
if max_combinations <= THRESHOLD:
    # Use integer packing for ultra-fast counting
    # Pack multiple values into single integers
    multipliers = [prod(num_vals[i+1:]) for i in range(len(num_vals))]
    packed = dot(self.sample[:, j_reqd], multipliers)
    counts = bincount(packed)
    # Unpack results efficiently
else:
    # Fall back to numpy.unique for large spaces
    combos, counts = npunique(self.sample[:, j_reqd], 
                             axis=0, return_counts=True)

In-Memory Counting Optimizations¶

Categorical Value Counting¶

# Ultra-fast categorical counting using bincount
for j in range(self.sample.shape[1]):
    counts = {
        self.categories[j][v]: c 
        for v, c in enumerate(bincount(self.sample[:, j]))
    }
    self._node_values[node_name] = {v: counts[v] for v in sorted(counts)}

Memory-Efficient Storage¶

Uses minimal integer types for categorical data (typically int16 or int32)
Lazy conversion to float64 only for continuous scoring operations
Strategic copying vs view usage to minimize memory footprint

Advanced Sampling¶

Random Selection Strategies¶

Random Subset Selection:

data.set_N(1000, seed=42, random_selection=True)
# Randomly selects 1000 rows from dataset

Ordered Sampling with Shuffling:

data.set_N(1000, seed=42, random_selection=False)
# Uses first 1000 rows but randomizes their order

Deterministic Reproducibility¶

Seed=0 and seed=None both preserve original data order
Positive seeds enable reproducible randomization
Consistent behavior across multiple calls with same seed

Statistical Operations¶

`marginals(node, parents, values_reqd=False) -> Tuple`¶

Efficient marginal computation using NumPy operations.

Implementation:

Leverages optimized unique() method for counting
Handles sparse parent configurations efficiently
Returns results in format compatible with scoring algorithms

`values(nodes) -> ndarray`¶

Direct array access for specified columns.

Performance:

Returns views when possible to avoid copying
Maintains column order as specified
Efficient slicing for subset access

Memory Management¶

Data Storage Strategy¶

self.data        # Original immutable data
self._sample     # Current working sample (possibly reordered)
self.categories  # Categorical value mappings (shared across samples)

Copy-on-Write Semantics¶

Original data never modified
Sample arrays created as views when order unchanged
Copies created only when shuffling or subset selection required

Type Optimization¶

Categorical data stored as smallest possible integer type
Continuous data uses float32 by default, converted to float64 only for scoring
Automatic type inference minimizes memory usage

Name Randomization¶

`randomise_names(seed=None) -> None`¶

Efficient node name randomization without data copying.

Features:

Updates only mapping dictionaries, not underlying arrays
Preserves all data relationships and types
Updates cached node_values and node_types dictionaries consistently

Performance Benchmarks¶

Typical Performance Characteristics¶

Memory Usage:

~50-80% less memory than equivalent pandas DataFrame
Categorical data: ~2-4 bytes per value vs 8+ bytes in pandas
Continuous data: 4 bytes (float32) vs 8 bytes (float64) by default

Computational Speed:

Unique value detection: 10-100x faster than pandas for categorical data
Sample subset creation: 5-20x faster than DataFrame operations
Marginal calculations: 20-50x faster for large datasets

Scalability:

Efficiently handles datasets with millions of rows
Linear scaling with data size for most operations
Memory usage scales predictably with dataset dimensions

Integration Examples¶

High-Performance Workflow¶

# Load and convert for performance
pandas_data = Pandas.read("large_dataset.csv", dstype="categorical")
numpy_data = NumPy.from_df(pandas_data.as_df(), 
                          dstype="categorical")

# Set large working sample efficiently
numpy_data.set_N(100000, seed=42, random_selection=True)

# Perform intensive causal discovery
results = heavy_computation_algorithm(numpy_data)

Memory-Conscious Processing¶

# Process data in chunks for memory efficiency
for chunk_seed in range(10):
    numpy_data.set_N(10000, seed=chunk_seed, random_selection=True)
    chunk_results = process_chunk(numpy_data)
    aggregate_results(chunk_results)

Benchmarking and Experimentation¶

# Performance comparison across randomizations
timing_results = []
for trial in range(100):
    numpy_data.randomise_order(seed=trial)
    start_time = time.time()
    result = algorithm.run(numpy_data)
    timing_results.append(time.time() - start_time)

print(f"Mean runtime: {np.mean(timing_results):.3f}s")
print(f"Std deviation: {np.std(timing_results):.3f}s")

Best Practices¶

When to Use NumPy Adapter¶

Large datasets (>10K rows typically)
Performance-critical causal discovery algorithms
Memory-constrained environments
Repeated statistical computations
Benchmark and stability experiments requiring many randomizations

Optimization Tips¶

Use random_selection=True only when needed (creates copy)
Convert from Pandas early in pipeline for consistent performance
Leverage keep_df=True only if round-trip DataFrame access needed
Choose appropriate dstype for your data characteristics