Skip to content

Evaluating Graphs

The evaluate_graph capability structurally evaluates a graph (PDAG, CPDAG, or DAG) against a reference graph. Note that comparisons between general SDG graphs are not supported.

This is an update action (see workflow patterns) and so updates the metadata for an existing graph in the input cache with the requested metrics when used within a CausalIQ workflow.


Parameters

Parameter CLI Flag Required Description
input -i/--input Yes Learned graph file (CLI) or workflow cache .db (action)
reference -r/--reference Yes Reference graph file (.csv, .graphml, .tetrad, .xdsl, .dsc)
metric -m/--metric Yes Metric(s) to compute (repeatable in CLI)
output -o/--output CLI only Output directory for _meta.json file
filter No Filter expression for cache entries (workflow only)

Supported Metrics: f1, shd, precision, recall, equiv.f1, equiv.shd

Notes:

  • In CLI mode, input is a graph file (.csv, .graphml, .tetrad, .xdsl, .dsc) and output is a directory where _meta.json will be written.
  • IN CLI mode you can request multiple metrics by repeating the -m/00metric option e.g. "-m f1 -m shd"
  • In workflows, input is a workflow cache (.db) and output is prohibited (UPDATE action pattern). The filter parameter can select specific cache entries.

CLI Usage

Basic Comparison

Compare a learned graph against a reference:

causaliq-analysis evaluate-graph -i learned.graphml -r ground_truth.graphml \
    -m f1 -m shd -o results/eval

This creates results/eval/_meta.json containing:

{
  "f1": 0.7,
  "shd": 3
}

All Metrics

Request all available metrics:

causaliq-analysis evaluate-graph -i learned.graphml -r ground_truth.graphml \
    -m f1 -m shd -m precision -m recall -m equiv.f1 -m equiv.shd \
    -o results/eval

Equivalence Class Metrics

Compare equivalence classes (CPDAGs) rather than raw graphs:

causaliq-analysis evaluate-graph -i learned.graphml -r ground_truth.graphml \
    -m equiv.f1 -m equiv.shd -o results/eval

Workflow Usage

In a CausalIQ workflow, evaluate_graph operates as an UPDATE action:

steps:
  - name: "Evaluate Graphs"
    uses: "causaliq-analysis"
    with:
      action: "evaluate_graph"
      input: "results/graphs.db"
      reference: "reference/asia_true.graphml"
      metric:
        - f1
        - shd
        - precision
        - recall

This computes metrics for each graph entry in the cache and adds them to the entry's metadata.


Supported Metrics

Available Metrics Summary

Metric Description
f1 F1 score from direct graph comparison
shd Structural Hamming Distance
precision Precision from direct comparison
recall Recall from direct comparison
equiv.f1 F1 comparing equivalence classes (CPDAGs)
equiv.shd SHD comparing equivalence classes (CPDAGs)

Metric Naming in CausalIQ

Many different structural metrics are used to evaluate graphs in causal discovery. Common ones are F1, Precision, Recall and Structural Hamming Distance (SHD), but others specific to causal discovery, such as Structural Intervention Distance (SID), are also employed.

Critical differences in structural evaluation include:

  • Whether the raw graphs (e.g., a learned DAG and a reference DAG) are compared, or whether the equivalence classes (CPDAGs or PAGs) to which they belong are compared. The former is generally more appropriate in causal discovery where orientation of arcs is critical.
  • Many structural metrics are built upon true/false positive/negative counts, and different authors take different approaches to computing these counts for arcs which have an orientation property.
  • Some authors report the raw metric but others normalise it (e.g., SHD divided by the number of variables or edges).

CausalIQ uses the following naming structure for metrics:

[<preprocessing>].<metric>.[<semantics>].[<postprocessing>].[<statistic>]

Element Optional Description Supported Values
<preprocessing> Yes Preprocessing before comparison equiv (convert to CPDAGs first)
<metric> No The basic metric f1, shd, precision, recall
<scheme> Yes Alternative computation semantics not currently supported
<postprocessing> Yes Postprocessing, e.g., normalisation not currently supported
<statistic> Yes Statistic over multiple values see summarise action

Legacy Support

The core module which provides structural comparisons between PDAGs (mixed directed and undirected edge graphs, a superset of DAGs and CPDAGs) is pdag_compare in metrics.py. It implements the comparison semantics used consistently in CausalIQ papers and the legacy discovery repository.

Comparison Semantics

To be completed — will describe in detail how the CausalIQ code computes the confusion matrix counts that underlie the structural metrics.

See Also