Pandas - DataFrame-Based Data Adapter¶
The Pandas class provides a concrete implementation of the Data interface using pandas DataFrames as the underlying data storage. This adapter is ideal for exploratory data analysis and moderate-sized datasets where pandas' rich functionality is beneficial.
Class Definition¶
class Pandas(Data):
"""Data subclass which holds data in a Pandas dataframe.
Args:
df: Data provided as a Pandas dataframe.
Attributes:
df: Original Pandas dataframe providing data.
dstype: Type of dataset (categorical/numeric/mixed).
"""
Constructor¶
__init__(df: DataFrame) -> None¶
Creates a new Pandas data adapter from a DataFrame.
Arguments:
df: Pandas DataFrame containing the data
Validation:
- Minimum 2 rows and 2 columns required
- No missing data (NaN values) allowed
- All column names must be strings
Raises:
TypeError: If df is not a pandas DataFrameValueError: If DataFrame size or data validation fails
Class Methods¶
read(filename, dstype, N=None, **kwargs) -> 'Pandas'¶
Factory method to create a Pandas instance by reading data from a file.
Arguments:
filename: Path to data file (supports .csv, .gz compression)dstype: Dataset type ("categorical", "continuous", or "mixed")N: Optional sample size limit**kwargs: Additional arguments for pandas.read_csv()
Features: - Automatic compression detection for .gz files - Intelligent type inference and conversion - Categorical variable encoding - Memory-efficient loading for large files
Example:
# Load categorical data
data = Pandas.read("dataset.csv", dstype="categorical")
# Load with custom separator and sample size
data = Pandas.read("data.tsv", dstype="mixed", N=10000, sep='\t')
Data Management¶
set_N(N, seed=None, random_selection=False) -> None¶
Sets the working sample size with optional randomization.
Arguments:
N: Target sample size (must be ≤ original data size)seed: Randomization seed for reproducible samplingrandom_selection: If True, randomly selects rows; if False, uses first N rows
Behavior:
- Updates internal
_sampleDataFrame with the specified subset - Preserves original data in
dfattribute - Recomputes categorical value counts for the new sample
randomise_names(seed=None) -> None¶
Randomizes node names for algorithm sensitivity testing.
Arguments:
seed: Randomization seed, or None to revert to original names
Implementation:
- Generates randomized column names using format
X###NNNNNN - Updates DataFrame column names in-place
- Maintains mappings between original and external names
- Updates sample DataFrame to reflect name changes
Statistical Operations¶
marginals(node, parents, values_reqd=False) -> Tuple¶
Computes marginal distributions for a node given its parents.
Arguments:
node: Target node name (external name)parents: Dictionary of parent values{parent_name: value}values_reqd: If True, returns actual values; if False, returns counts only
Returns:
- Tuple of (marginal_counts, unique_values) for categorical data
- For continuous data, returns appropriate statistical summaries
Implementation:
- Uses pandas crosstab for efficient categorical marginalization
- Handles continuous variables with binning strategies
- Optimized for sparse parent configurations
values(nodes: Tuple[str, ...]) -> np.ndarray¶
Returns the actual data values for specified nodes.
Arguments:
nodes: Tuple of node names (external names)
Returns:
- NumPy array with shape (N, len(nodes)) containing the data values
Usage:
# Get values for specific variables
subset = data.values(("var1", "var2", "var3"))
print(subset.shape) # (N, 3)
Properties¶
Node Information¶
nodes: Original column names from DataFramenode_types: Mapping of node names to their data typesnode_values: Value counts for categorical variables only
Sample Access¶
N: Current working sample sizesample: Current working sample as DataFrame
Data Conversion¶
as_df() -> DataFrame¶
Returns the current working sample as a pandas DataFrame.
Returns:
- DataFrame with external column names and current sample data
Usage:
# Access current sample
current_df = data.as_df()
# Convert to NumPy for performance-critical operations
numpy_data = NumPy.from_df(current_df, dstype=data.dstype)
File I/O¶
write(filename, compress=False, sf=10, zero=None, preserve=True) -> None¶
Writes the current sample to a CSV file.
Arguments:
filename: Output file pathcompress: Whether to gzip compress the outputsf: Significant figures for floating-point datazero: Value to replace zeros with (for numerical stability)preserve: Whether to preserve original formatting
Features:
- Automatic compression if filename ends with .gz
- Configurable precision for floating-point output
- Handles categorical data appropriately
- Preserves data integrity during round-trip operations
Type Handling¶
Automatic Type Inference¶
The Pandas adapter automatically infers and converts data types:
Categorical Data:
- String columns are converted to pandas categorical type
- Integer columns with limited unique values become categorical
- Maintains category ordering where applicable
Numeric Data:
- Floating-point columns preserve precision
- Integer columns use appropriate NumPy integer types
- Mixed columns are handled according to dstype parameter
Type Validation¶
# Dataset type is automatically determined
if data.dstype == "categorical":
print("All variables are categorical")
elif data.dstype == "continuous":
print("All variables are numeric")
else: # "mixed"
print("Mixed variable types detected")
Memory Management¶
Efficient Sampling¶
- Original DataFrame is preserved in
dfattribute - Working sample stored separately in
_sample - View-based operations where possible to minimize copying
Lazy Operations¶
- Type conversions performed only when necessary
- Value counts computed on-demand for categorical variables
- Sample updates triggered only when needed
Performance Characteristics¶
Best Use Cases¶
- Exploratory data analysis and prototyping
- Moderate-sized datasets (< 100K rows typically)
- Mixed data types requiring pandas functionality
- File I/O with various formats and options
Performance Considerations¶
- Memory overhead due to pandas metadata
- String operations can be slow for large categorical data
- DataFrame operations generally slower than pure NumPy
Integration Examples¶
With NumPy Adapter¶
# Load and explore with Pandas
pandas_data = Pandas.read("data.csv", dstype="mixed")
print(pandas_data.as_df().describe())
# Convert to NumPy for computational efficiency
numpy_data = NumPy.from_df(pandas_data.as_df(),
dstype=pandas_data.dstype,
keep_df=True)
Workflow Integration¶
# Load data
data = Pandas.read("experiment.csv", dstype="categorical")
# Set up experiment parameters
data.set_N(5000, seed=42, random_selection=True)
data.randomise_names(seed=123)
# Run causal discovery algorithm
results = discovery_algorithm.run(data)
# Save results
data.write("experiment_sample.csv", compress=True)