NumPy - High-Performance Array Adapter¶
The NumPy class provides a high-performance implementation of the Data interface using NumPy arrays as the underlying storage. This adapter is optimized for computational efficiency and large-scale causal discovery operations.
Class Definition¶
class NumPy(Data):
"""Concrete Data subclass which holds data in NumPy arrays.
Args:
data (ndarray): Data provided as a 2-D NumPy array.
dstype (DatasetType): Type of variables in dataset.
col_values (dict): Column names and their categorical values.
Attributes:
data (ndarray): The original data values.
sample (ndarray): Sample values of size N, rows possibly reordered.
categories: Categories for each categorical node.
"""
Constructor¶
__init__(data, dstype, col_values=None) -> None¶
Creates a NumPy data adapter from a 2D NumPy array.
Arguments:
- data: 2D NumPy array with shape (n_samples, n_features)
- dstype: Dataset type (DatasetType.CATEGORICAL, CONTINUOUS, or MIXED)
- col_values: Optional mapping of column names to categorical values
Validation: - Data must be 2D NumPy array - Minimum 2 samples and 2 features required - For categorical data, values must be integer-encoded starting from 0
Initialization: - Sets up node names as X0, X1, X2, ... by default - Converts categorical values to appropriate categories - Determines node types based on dstype
Factory Methods¶
from_df(df, dstype, keep_df=False) -> 'NumPy'¶
Creates a NumPy instance from a pandas DataFrame.
Arguments:
- df: Pandas DataFrame containing the data
- dstype: Target dataset type for conversion
- keep_df: Whether to preserve DataFrame for as_df() operations
Features: - Automatic conversion from pandas to NumPy format - Intelligent handling of categorical data encoding - Optional DataFrame preservation for round-trip compatibility
Example:
# Convert from Pandas
pandas_data = Pandas.read("data.csv", dstype="categorical")
numpy_data = NumPy.from_df(pandas_data.as_df(),
dstype="categorical",
keep_df=True)
High-Performance Operations¶
set_N(N, seed=None, random_selection=False) -> None¶
Sets working sample size with optimized sampling strategies.
Arguments:
N: Target sample sizeseed: Random seed for reproducible resultsrandom_selection: Use random subset vs first N rows
Performance Features:
- Random Selection: Uses
numpy.random.choice()for efficient random sampling - Row Shuffling: Optional in-place shuffling with
permutation() - Memory Optimization: Works with array views when possible
- Type Conversion: Converts continuous data to float64 for precision only when needed
Implementation Details:
# Efficient random selection without replacement
indices = rng.choice(self.data.shape[0], size=N, replace=False)
self._sample = self.data[indices if seed != 0 else sorted(indices)]
# In-place row order randomization
if seed is not None and seed != 0:
order = rng.permutation(N)
self._sample = self.sample[order]
unique(j_reqd, num_vals) -> Tuple[ndarray, ndarray]¶
Highly optimized unique value detection and counting.
Arguments:
j_reqd: Tuple of column indices for which unique combinations are needednum_vals: Array of number of unique values per column
Returns:
(combinations, counts): Unique value combinations and their frequencies
Optimization Strategy:
# Fast path for small combination spaces
max_combinations = prod(num_vals)
if max_combinations <= THRESHOLD:
# Use integer packing for ultra-fast counting
# Pack multiple values into single integers
multipliers = [prod(num_vals[i+1:]) for i in range(len(num_vals))]
packed = dot(self.sample[:, j_reqd], multipliers)
counts = bincount(packed)
# Unpack results efficiently
else:
# Fall back to numpy.unique for large spaces
combos, counts = npunique(self.sample[:, j_reqd],
axis=0, return_counts=True)
In-Memory Counting Optimizations¶
Categorical Value Counting¶
# Ultra-fast categorical counting using bincount
for j in range(self.sample.shape[1]):
counts = {
self.categories[j][v]: c
for v, c in enumerate(bincount(self.sample[:, j]))
}
self._node_values[node_name] = {v: counts[v] for v in sorted(counts)}
Memory-Efficient Storage¶
- Uses minimal integer types for categorical data (typically int16 or int32)
- Lazy conversion to float64 only for continuous scoring operations
- Strategic copying vs view usage to minimize memory footprint
Advanced Sampling¶
Random Selection Strategies¶
Random Subset Selection:
Ordered Sampling with Shuffling:
Deterministic Reproducibility¶
- Seed=0 and seed=None both preserve original data order
- Positive seeds enable reproducible randomization
- Consistent behavior across multiple calls with same seed
Statistical Operations¶
marginals(node, parents, values_reqd=False) -> Tuple¶
Efficient marginal computation using NumPy operations.
Implementation:
- Leverages optimized
unique()method for counting - Handles sparse parent configurations efficiently
- Returns results in format compatible with scoring algorithms
values(nodes) -> ndarray¶
Direct array access for specified columns.
Performance:
- Returns views when possible to avoid copying
- Maintains column order as specified
- Efficient slicing for subset access
Memory Management¶
Data Storage Strategy¶
self.data # Original immutable data
self._sample # Current working sample (possibly reordered)
self.categories # Categorical value mappings (shared across samples)
Copy-on-Write Semantics¶
- Original data never modified
- Sample arrays created as views when order unchanged
- Copies created only when shuffling or subset selection required
Type Optimization¶
- Categorical data stored as smallest possible integer type
- Continuous data uses float32 by default, converted to float64 only for scoring
- Automatic type inference minimizes memory usage
Name Randomization¶
randomise_names(seed=None) -> None¶
Efficient node name randomization without data copying.
Features:
- Updates only mapping dictionaries, not underlying arrays
- Preserves all data relationships and types
- Updates cached node_values and node_types dictionaries consistently
Performance Benchmarks¶
Typical Performance Characteristics¶
Memory Usage:
- ~50-80% less memory than equivalent pandas DataFrame
- Categorical data: ~2-4 bytes per value vs 8+ bytes in pandas
- Continuous data: 4 bytes (float32) vs 8 bytes (float64) by default
Computational Speed:
- Unique value detection: 10-100x faster than pandas for categorical data
- Sample subset creation: 5-20x faster than DataFrame operations
- Marginal calculations: 20-50x faster for large datasets
Scalability:
- Efficiently handles datasets with millions of rows
- Linear scaling with data size for most operations
- Memory usage scales predictably with dataset dimensions
Integration Examples¶
High-Performance Workflow¶
# Load and convert for performance
pandas_data = Pandas.read("large_dataset.csv", dstype="categorical")
numpy_data = NumPy.from_df(pandas_data.as_df(),
dstype="categorical")
# Set large working sample efficiently
numpy_data.set_N(100000, seed=42, random_selection=True)
# Perform intensive causal discovery
results = heavy_computation_algorithm(numpy_data)
Memory-Conscious Processing¶
# Process data in chunks for memory efficiency
for chunk_seed in range(10):
numpy_data.set_N(10000, seed=chunk_seed, random_selection=True)
chunk_results = process_chunk(numpy_data)
aggregate_results(chunk_results)
Benchmarking and Experimentation¶
# Performance comparison across randomizations
timing_results = []
for trial in range(100):
numpy_data.randomise_order(seed=trial)
start_time = time.time()
result = algorithm.run(numpy_data)
timing_results.append(time.time() - start_time)
print(f"Mean runtime: {np.mean(timing_results):.3f}s")
print(f"Std deviation: {np.std(timing_results):.3f}s")
Best Practices¶
When to Use NumPy Adapter¶
- Large datasets (>10K rows typically)
- Performance-critical causal discovery algorithms
- Memory-constrained environments
- Repeated statistical computations
- Benchmark and stability experiments requiring many randomizations
Optimization Tips¶
- Use
random_selection=Trueonly when needed (creates copy) - Convert from Pandas early in pipeline for consistent performance
- Leverage
keep_df=Trueonly if round-trip DataFrame access needed - Choose appropriate
dstypefor your data characteristics