Pandas - DataFrame-Based Data Adapter¶

The Pandas class provides a concrete implementation of the Data interface using pandas DataFrames as the underlying data storage. This adapter is ideal for exploratory data analysis and moderate-sized datasets where pandas' rich functionality is beneficial.

Class Definition¶

class Pandas(Data):
    """Data subclass which holds data in a Pandas dataframe.

    Args:
        df: Data provided as a Pandas dataframe.

    Attributes:
        df: Original Pandas dataframe providing data.
        dstype: Type of dataset (categorical/numeric/mixed).
    """

Constructor¶

`init(df: DataFrame) -> None`¶

Creates a new Pandas data adapter from a DataFrame.

Arguments:

df: Pandas DataFrame containing the data

Validation:

Minimum 2 rows and 2 columns required
No missing data (NaN values) allowed
All column names must be strings

Raises:

TypeError: If df is not a pandas DataFrame
ValueError: If DataFrame size or data validation fails

Class Methods¶

`read(filename, dstype, N=None, **kwargs) -> 'Pandas'`¶

Factory method to create a Pandas instance by reading data from a file.

Arguments:

filename: Path to data file (supports .csv, .gz compression)
dstype: Dataset type ("categorical", "continuous", or "mixed")
N: Optional sample size limit
**kwargs: Additional arguments for pandas.read_csv()

Features: - Automatic compression detection for .gz files - Intelligent type inference and conversion - Categorical variable encoding - Memory-efficient loading for large files

Example:

# Load categorical data
data = Pandas.read("dataset.csv", dstype="categorical")

# Load with custom separator and sample size
data = Pandas.read("data.tsv", dstype="mixed", N=10000, sep='\t')

Data Management¶

`set_N(N, seed=None, random_selection=False) -> None`¶

Sets the working sample size with optional randomization.

Arguments:

N: Target sample size (must be ≤ original data size)
seed: Randomization seed for reproducible sampling
random_selection: If True, randomly selects rows; if False, uses first N rows

Behavior:

Updates internal _sample DataFrame with the specified subset
Preserves original data in df attribute
Recomputes categorical value counts for the new sample

`randomise_names(seed=None) -> None`¶

Randomizes node names for algorithm sensitivity testing.

Arguments:

seed: Randomization seed, or None to revert to original names

Implementation:

Generates randomized column names using format X###NNNNNN
Updates DataFrame column names in-place
Maintains mappings between original and external names
Updates sample DataFrame to reflect name changes

Statistical Operations¶

`marginals(node, parents, values_reqd=False) -> Tuple`¶

Computes marginal distributions for a node given its parents.

Arguments:

node: Target node name (external name)
parents: Dictionary of parent values {parent_name: value}
values_reqd: If True, returns actual values; if False, returns counts only

Returns:

Tuple of (marginal_counts, unique_values) for categorical data
For continuous data, returns appropriate statistical summaries

Implementation:

Uses pandas crosstab for efficient categorical marginalization
Handles continuous variables with binning strategies
Optimized for sparse parent configurations

`values(nodes: Tuple[str, ...]) -> np.ndarray`¶

Returns the actual data values for specified nodes.

Arguments:

nodes: Tuple of node names (external names)

Returns:

NumPy array with shape (N, len(nodes)) containing the data values

Usage:

# Get values for specific variables
subset = data.values(("var1", "var2", "var3"))
print(subset.shape)  # (N, 3)

Properties¶

Node Information¶

nodes: Original column names from DataFrame
node_types: Mapping of node names to their data types
node_values: Value counts for categorical variables only

Sample Access¶

N: Current working sample size
sample: Current working sample as DataFrame

Data Conversion¶

`as_df() -> DataFrame`¶

Returns the current working sample as a pandas DataFrame.

Returns:

DataFrame with external column names and current sample data

Usage:

# Access current sample
current_df = data.as_df()

# Convert to NumPy for performance-critical operations
numpy_data = NumPy.from_df(current_df, dstype=data.dstype)

File I/O¶

`write(filename, compress=False, sf=10, zero=None, preserve=True) -> None`¶

Writes the current sample to a CSV file.

Arguments:

filename: Output file path
compress: Whether to gzip compress the output
sf: Significant figures for floating-point data
zero: Value to replace zeros with (for numerical stability)
preserve: Whether to preserve original formatting

Features:

Automatic compression if filename ends with .gz
Configurable precision for floating-point output
Handles categorical data appropriately
Preserves data integrity during round-trip operations

Type Handling¶

Automatic Type Inference¶

The Pandas adapter automatically infers and converts data types:

Categorical Data:

String columns are converted to pandas categorical type
Integer columns with limited unique values become categorical
Maintains category ordering where applicable

Numeric Data:

Floating-point columns preserve precision
Integer columns use appropriate NumPy integer types
Mixed columns are handled according to dstype parameter

Type Validation¶

# Dataset type is automatically determined
if data.dstype == "categorical":
    print("All variables are categorical")
elif data.dstype == "continuous":
    print("All variables are numeric")  
else:  # "mixed"
    print("Mixed variable types detected")

Memory Management¶

Efficient Sampling¶

Original DataFrame is preserved in df attribute
Working sample stored separately in _sample
View-based operations where possible to minimize copying

Lazy Operations¶

Type conversions performed only when necessary
Value counts computed on-demand for categorical variables
Sample updates triggered only when needed

Performance Characteristics¶

Best Use Cases¶

Exploratory data analysis and prototyping
Moderate-sized datasets (< 100K rows typically)
Mixed data types requiring pandas functionality
File I/O with various formats and options

Performance Considerations¶

Memory overhead due to pandas metadata
String operations can be slow for large categorical data
DataFrame operations generally slower than pure NumPy

Integration Examples¶

With NumPy Adapter¶

# Load and explore with Pandas
pandas_data = Pandas.read("data.csv", dstype="mixed")
print(pandas_data.as_df().describe())

# Convert to NumPy for computational efficiency
numpy_data = NumPy.from_df(pandas_data.as_df(), 
                          dstype=pandas_data.dstype,
                          keep_df=True)

Workflow Integration¶

# Load data
data = Pandas.read("experiment.csv", dstype="categorical")

# Set up experiment parameters
data.set_N(5000, seed=42, random_selection=True)
data.randomise_names(seed=123)

# Run causal discovery algorithm
results = discovery_algorithm.run(data)

# Save results
data.write("experiment_sample.csv", compress=True)

Pandas - DataFrame-Based Data Adapter¶

Class Definition¶

Constructor¶

__init__(df: DataFrame) -> None¶

Class Methods¶

read(filename, dstype, N=None, **kwargs) -> 'Pandas'¶

Data Management¶

set_N(N, seed=None, random_selection=False) -> None¶

randomise_names(seed=None) -> None¶

Statistical Operations¶

marginals(node, parents, values_reqd=False) -> Tuple¶

values(nodes: Tuple[str, ...]) -> np.ndarray¶

Properties¶

Node Information¶

Sample Access¶

Data Conversion¶

as_df() -> DataFrame¶

File I/O¶

write(filename, compress=False, sf=10, zero=None, preserve=True) -> None¶

Type Handling¶

Automatic Type Inference¶

Type Validation¶

Memory Management¶

Efficient Sampling¶

Lazy Operations¶

Performance Characteristics¶

Best Use Cases¶

Performance Considerations¶

Integration Examples¶

With NumPy Adapter¶

Workflow Integration¶

`init(df: DataFrame) -> None`¶

`read(filename, dstype, N=None, **kwargs) -> 'Pandas'`¶

`set_N(N, seed=None, random_selection=False) -> None`¶

`randomise_names(seed=None) -> None`¶

`marginals(node, parents, values_reqd=False) -> Tuple`¶

`values(nodes: Tuple[str, ...]) -> np.ndarray`¶

`as_df() -> DataFrame`¶

`write(filename, compress=False, sf=10, zero=None, preserve=True) -> None`¶