Extracting the Best Graph¶

The best_graph capability extracts an optimal DAG from a Probabilistic Dependency Graph (PDG) using a greedy algorithm. This is useful when you have a PDG representing structural uncertainty (e.g., from merged graphs) and need to select a single best DAG for downstream analysis or inference.

This is an update action (see workflow patterns) and so updates entries in the input cache by adding a DAG object to each matched entry.

Parameters¶

Parameter	CLI	Required	Default	Description
`input`	`-i`	Yes	None	Input PDG file (`.graphml`) or cache (`.db`)
`output`	`-o`	Yes (CLI)	None	Output directory for DAG and metadata files
`threshold`	`-t`	No	0.0	Minimum edge probability threshold
`filter`	`-f`	No	None	Filter entries in input cache

Notes:

Input type is auto-detected by file extension:
.graphml: Read as PDG file (CLI only)
.db: Read all entries from WorkflowCache database containing PDG objects
The output parameter is CLI only. Creates output directory containing dag.graphml and _meta.json.
In workflows, best_graph is an update action: output is prohibited. The action processes all entries in the input cache and adds a DAG object to each entry.

How It Works¶

Step 1: Order Edges by Probability¶

For each node pair, the algorithm computes an effective directed probability by splitting undirected probability equally between forward and backward directions:

\[P_{effective}(forward) = P(forward) + \frac{P(undirected)}{2}\]

\[P_{effective}(backward) = P(backward) + \frac{P(undirected)}{2}\]

Edges are then sorted by their maximum effective probability in descending order.

Step 2: Greedy Edge Selection¶

The algorithm greedily adds edges to the DAG in probability order:

Select the highest-probability edge not yet processed
Choose direction (forward or backward) based on higher effective probability
Threshold check: Skip if max probability < threshold
Cycle check: Skip if adding this edge would create a cycle
Tie-breaking: If probabilities are equal, use alphabetical ordering (source < target)
Add edge to DAG and continue

Step 3: Return Statistics¶

The algorithm returns the DAG along with extraction statistics:

Statistic	Description
`edges_included`	Number of edges in the optimal DAG
`edges_skipped_cycle`	Edges skipped to avoid cycles
`edges_skipped_threshold`	Edges below probability threshold
`tie_breaks_applied`	Direction ties resolved alphabetically

Example¶

Consider a PDG with three nodes and the following edge probabilities:

Node Pair	P(forward)	P(backward)	P(undirected)	P(none)
(A, B)	0.6	0.1	0.2	0.1
(B, C)	0.3	0.5	0.1	0.1
(A, C)	0.0	0.0	0.0	1.0

Effective probabilities after splitting undirected:

Node Pair	P_eff(forward)	P_eff(backward)	Max
(A, B)	0.7	0.2	0.7
(B, C)	0.35	0.55	0.55

Result (threshold=0.0):

Add A → B (0.7 > 0.2)
Add C → B (0.55 > 0.35)

Final DAG: A → B ← C (2 edges, 0 skipped, 0 tie-breaks)

CLI Usage¶

Basic Usage¶

Extract optimal DAG from a PDG file:

causaliq-analysis best-graph -i merged.graphml -o results/optimal

This creates:

results/optimal/dag.graphml — The optimal DAG
results/optimal/_meta.json — Extraction metadata and statistics

With Threshold¶

Apply a minimum probability threshold to filter weak edges:

causaliq-analysis best-graph -i merged.graphml -o results/optimal \
    --threshold=0.5

Only edges with max effective probability ≥ 0.5 will be included.

Workflow Usage¶

In a CausalIQ workflow, best_graph operates as an UPDATE action that adds a DAG object to entries containing PDG objects:

steps:
  - name: "Merge experimental graphs"
    uses: "causaliq-analysis"
    with:
      action: "merge_graphs"
      input: "results/learned_graphs.db"
      output: "results/merged.db"
    matrix:
      network: [asia, cancer]
      sample_size: [100, 1K]

  - name: "Extract optimal DAGs"
    uses: "causaliq-analysis"
    with:
      action: "best_graph"
      input: "results/merged.db"
      threshold: 0.3

After this workflow:

Each entry in results/merged.db will contain both:
The original pdg object from merge_graphs
A new dag object from best_graph

Output Metadata¶

The best_graph action writes extraction statistics to metadata, which can be used for analysis or filtering:

{
  "metadata": {
    "causaliq-analysis": {
      "best_graph": {
        "action": "best_graph",
        "timestamp": "2024-01-15T10:30:00+00:00",
        "threshold": 0.3,
        "edges_included": 5,
        "edges_skipped_cycle": 2,
        "edges_skipped_threshold": 1,
        "tie_breaks_applied": 0
      }
    }
  }
}

Python API¶

Using the Action Provider¶

from causaliq_analysis.workflow_action import AnalysisActionProvider

action = AnalysisActionProvider()
status, metadata, objects = action.run(
    "best_graph",
    {"input": "merged.graphml", "threshold": 0.5},
    mode="run",
)

# Extract DAG content
dag_graphml = objects[0]["content"]
print(f"Edges included: {metadata['edges_included']}")

Using PDG Directly¶

from causaliq_core.graph import PDG, EdgeProbabilities
from causaliq_core.graph.io import graphml

# Load or create PDG
pdg = PDG(
    ["A", "B", "C"],
    {
        ("A", "B"): EdgeProbabilities(forward=0.8, none=0.2),
        ("B", "C"): EdgeProbabilities(forward=0.7, none=0.3),
    },
)

# Extract optimal DAG
result = pdg.to_dag_greedy(threshold=0.5)

print(f"Optimal DAG edges: {list(result.dag.edges)}")
print(f"Edges included: {result.edges_included}")
print(f"Edges skipped (cycle): {result.edges_skipped_cycle}")
print(f"Edges skipped (threshold): {result.edges_skipped_threshold}")
print(f"Tie-breaks applied: {result.tie_breaks_applied}")